Welcome to our Data Engineering website
Information
- Project Team: Justine Guirauden and Volcy Desmazures
- School: ESIEE Paris (2025-2026)
- Course: Data Engineering
- Track: D — Aviation (OpenSky Network)
- Objective: Engineering high-performance pipelines for data-intensive aviation workloads.
This site documents our practical labs as well as our final project on Big Data pipeline optimization.
Semester 2: Aviation Track & Data-Intensive Systems
During this semester, we focused on Track D (Aviation), processing real-world data from the OpenSky Network. We implemented complex pipelines covering streaming, indexing, and graph processing.
Major Project: End-to-End Aviation Lakehouse
We built a comprehensive pipeline to analyze European airspace traffic. Our system supports batch analytics, real-time monitoring, and advanced research tools.
- Our Architecture: A multi-layered Lakehouse (Bronze/Silver/Gold) using PySpark and Parquet.
- Key Features: * Real-time airspace density monitoring via Structured Streaming.
- Centrality analysis of flight networks using PageRank.
- Industrial-grade NLP corpus for aviation report retrieval.
- Result: A reproducible framework capable of handling >100,000 aircraft records with optimized shuffle paths.
Semester 2 Engineering Notes
- Lab 1: Streaming Optimization → We achieved a 10x throughput improvement by tuning shuffle partitions and watermarks for METAR weather reports. View Note
- Lab 2: Inverted Indexing → We designed a columnar search index for aviation text, reducing storage footprint by 65.7% compared to CSV. View Note
- Lab 3: Graph Processing → We modeled aircraft proximity as a graph and identified key traffic hubs using iterative algorithms. View Note
Semester 1: Foundations
In the first semester, we focused on the fundamentals of the Data Engineering stack, from containerization to analytical modeling.
- Project: Food Facts Analysis
- We optimized a 1.1 GB dataset down to 0.34 MB (-99.9%).
- We improved query speed by 3.4x using Predicate Pushdown.
- View Semester 1 Report
- Labs Archive:
Our Technical Stack
- Engine: Apache Spark (PySpark 3.x)
- Storage: Parquet (Snappy), Lakehouse Architecture
- Ops: Docker, Makefile, GitHub Actions
- Deployment: Cloudflare Pages & Zero Trust
Final Project: Local Lakehouse & Optimization
We built a local Lakehouse capable of processing real and complex data while meeting strict performance objectives (SLOs).
The Topic: Nutritional Analysis (Open Food Facts)
We analyzed the evolution of the nutritional quality of global food products (Sugar, Fat, Nutriscore).
- Data: ~1.1 GB of raw CSV, highly denormalized (>150 columns).
- Stack: PySpark (Spark 3.x), Parquet, Local Single Node.
Key Results
We compared a “naive” pipeline (Baseline) against our optimized pipeline (Silver/Gold layers).
| Metric | Result Obtained | Technical Impact |
|---|---|---|
| Storage | -99.9% (1.1GB → 0.34MB) | Snappy Compression + Drastic Cleaning |
| Speed (Q3) | x3.4 faster | Predicate Pushdown & Data Skipping |
| Latency | 228 ms | Optimized reading via sorting (sortWithinPartitions) |
Access the Report
This project demonstrates how rigorous physical design (Sorting, Partitioning, Projection) can transform an unusable dataset into a high-performance Datamart.
Labs
Here are all the practical labs completed, covering Data Engineering fundamentals, from containerization to data pipelines.
-
Lab 1: Environment & Docker
- Skills Acquired: Environment setup, containerization.
- Access Lab 1
-
Lab 2: SQL & Data Modeling
- Skills Acquired: Analytical queries, data structuring.
- Access Lab 2
-
Lab 3: Data Pipelines
- Skills Acquired: Orchestration and transformation.
- Access Lab 3
About this site
This portfolio is built using the “Docs as Code” approach:
- Generated with Quartz.
- Hosted on Cloudflare Pages.
- Secured by Cloudflare Zero Trust (Access Policies).