Welcome to our Data Engineering Portfolio

Information

  • Project Team: Justine Guirauden and Volcy Desmazures
  • School: ESIEE Paris (2025-2026)
  • Course: Data Engineering (DE1 & DE2)
  • Track: D — Aviation (OpenSky Network)
  • Objective: Engineering high-performance pipelines for data-intensive aviation workloads.

Semester 2: Aviation Track & Data-Intensive Systems

This semester focused on Track D (Aviation), processing high-velocity data from the OpenSky Network. We developed an end-to-end Lakehouse architecture designed to handle real-world aeronautical constraints: noise, late-arriving data, and complex spatial relationships.

Major Project: Aviation Lakehouse & Analytics

We built a multi-layered pipeline (Bronze/Silver/Gold) using Spark 4.0 to transform raw transponder signals into a queryable knowledge base.

  • Real-time Monitoring: Implementation of Structured Streaming to track airspace density. We achieved a throughput of 57,883 rows/sec, significantly exceeding our SLO of 100 r/s.
  • Graph Intelligence: Modeling aircraft proximity as a dynamic network. Using PageRank over 6,264 vertices and 256,461 edges, we identified critical flight corridors and traffic hubs.
  • Advanced Curation: Generation of an AI-ready corpus of 692,430 documents. We implemented quality filters (xxHash64 deduplication) resulting in a 45.36% pass ratio, highlighting the challenges of synthetic METAR sentence length.

Full Report & Code


Semester 2: Engineering Notes & Labs Archive

Detailed practical assignments and technical documentation covering streaming, indexing, and graph algorithms for the Aviation track.

  • Lab 1: Streaming Pipeline

    • Objective: High-throughput ingestion of OpenSky vectors.
    • Engineering Note: Optimized micro-batch processing by tuning shuffle partitions (n=8) and implementing a 10-minute Watermark to handle out-of-order state vectors.
    • Access Notebook: Assignment 1
  • Lab 2: Inverted Index

    • Objective: NLP-based retrieval system for aviation weather reports.
    • Engineering Note: Designed a columnar search index. By migrating to Parquet, we reduced the storage footprint to 34.44% of the original size (104MB vs 304MB).
    • Access Notebook: Assignment 2
  • Lab 3: Graph Processing

    • Objective: Proximity network analysis and PageRank calculation.
    • Engineering Note: Focused on shuffle eradication via repartition(8, "src"), achieving a 27% execution time gain during iterative PageRank updates.
    • Access Notebook: Assignment 3

Semester 1: Foundations (Synthesis)

The first semester focused on the fundamentals of the Data Engineering stack, from environment containerization to advanced physical data modeling.

Project: Food Facts Analysis

We optimized a 1.1 GB global nutritional dataset (Open Food Facts) through rigorous cleaning and schema enforcement.

  • Storage Efficiency: Reduced data footprint from 1.1 GB to 0.34 MB (-99.9%).
  • Query Performance: Improved execution speed by 3.4x using Predicate Pushdown and sortWithinPartitions.
  • View Semester 1 Report | View DE1 Notebook

Labs Archive (DE1)


Technical Stack

  • Engine: Apache Spark 4.0 (PySpark)
  • Storage: Parquet (Snappy), Medallion Lakehouse Architecture
  • Graph: Spark GraphX (Distributed Graph Processing)
  • Ops: Docker, Makefile, Quartz (Documentation)
  • Deployment: Cloudflare Pages & Zero Trust

About this site

This portfolio follows the “Docs as Code” philosophy:

  • Content authored in Markdown.
  • Site generated with Quartz.
  • CI/CD via GitHub Actions to Cloudflare Pages.