Welcome to our Data Engineering Portfolio
Information
- Project Team: Justine Guirauden and Volcy Desmazures
- School: ESIEE Paris (2025-2026)
- Course: Data Engineering (DE1 & DE2)
- Track: D — Aviation (OpenSky Network)
- Objective: Engineering high-performance pipelines for data-intensive aviation workloads.
Semester 2: Aviation Track & Data-Intensive Systems
This semester focused on Track D (Aviation), processing high-velocity data from the OpenSky Network. We developed an end-to-end Lakehouse architecture designed to handle real-world aeronautical constraints: noise, late-arriving data, and complex spatial relationships.
Major Project: Aviation Lakehouse & Analytics
We built a multi-layered pipeline (Bronze/Silver/Gold) using Spark 4.0 to transform raw transponder signals into a queryable knowledge base.
- Real-time Monitoring: Implementation of Structured Streaming to track airspace density. We achieved a throughput of 57,883 rows/sec, significantly exceeding our SLO of 100 r/s.
- Graph Intelligence: Modeling aircraft proximity as a dynamic network. Using PageRank over 6,264 vertices and 256,461 edges, we identified critical flight corridors and traffic hubs.
- Advanced Curation: Generation of an AI-ready corpus of 692,430 documents. We implemented quality filters (xxHash64 deduplication) resulting in a 45.36% pass ratio, highlighting the challenges of synthetic METAR sentence length.
Full Report & Code
Semester 2: Engineering Notes & Labs Archive
Detailed practical assignments and technical documentation covering streaming, indexing, and graph algorithms for the Aviation track.
-
Lab 1: Streaming Pipeline
- Objective: High-throughput ingestion of OpenSky vectors.
- Engineering Note: Optimized micro-batch processing by tuning shuffle partitions (n=8) and implementing a 10-minute Watermark to handle out-of-order state vectors.
- Access Notebook: Assignment 1
-
Lab 2: Inverted Index
- Objective: NLP-based retrieval system for aviation weather reports.
- Engineering Note: Designed a columnar search index. By migrating to Parquet, we reduced the storage footprint to 34.44% of the original size (104MB vs 304MB).
- Access Notebook: Assignment 2
-
Lab 3: Graph Processing
- Objective: Proximity network analysis and PageRank calculation.
- Engineering Note: Focused on shuffle eradication via
repartition(8, "src"), achieving a 27% execution time gain during iterative PageRank updates. - Access Notebook: Assignment 3
Semester 1: Foundations (Synthesis)
The first semester focused on the fundamentals of the Data Engineering stack, from environment containerization to advanced physical data modeling.
Project: Food Facts Analysis
We optimized a 1.1 GB global nutritional dataset (Open Food Facts) through rigorous cleaning and schema enforcement.
- Storage Efficiency: Reduced data footprint from 1.1 GB to 0.34 MB (-99.9%).
- Query Performance: Improved execution speed by 3.4x using Predicate Pushdown and
sortWithinPartitions. - View Semester 1 Report | View DE1 Notebook
Labs Archive (DE1)
- Lab 1: Environment & Docker Setup
- Lab 2: SQL & Analytical Data Modeling
- Lab 3: Orchestrated Data Pipelines
Technical Stack
- Engine: Apache Spark 4.0 (PySpark)
- Storage: Parquet (Snappy), Medallion Lakehouse Architecture
- Graph: Spark GraphX (Distributed Graph Processing)
- Ops: Docker, Makefile, Quartz (Documentation)
- Deployment: Cloudflare Pages & Zero Trust
About this site
This portfolio follows the “Docs as Code” philosophy:
- Content authored in Markdown.
- Site generated with Quartz.
- CI/CD via GitHub Actions to Cloudflare Pages.