Welcome to our Data Engineering website

Information

Project Team: Justine Guirauden and Volcy Desmazures

School: ESIEE Paris (2025-2026)

Course: Data Engineering

Track: D — Aviation (OpenSky Network)

Objective: Engineering high-performance pipelines for data-intensive aviation workloads.

This site documents our practical labs as well as our final project on Big Data pipeline optimization.

Semester 2: Aviation Track & Data-Intensive Systems

During this semester, we focused on Track D (Aviation), processing real-world data from the OpenSky Network. We implemented complex pipelines covering streaming, indexing, and graph processing.

Major Project: End-to-End Aviation Lakehouse

We built a comprehensive pipeline to analyze European airspace traffic. Our system supports batch analytics, real-time monitoring, and advanced research tools.

Our Architecture: A multi-layered Lakehouse (Bronze/Silver/Gold) using PySpark and Parquet.
Key Features: * Real-time airspace density monitoring via Structured Streaming.
- Centrality analysis of flight networks using PageRank.
- Industrial-grade NLP corpus for aviation report retrieval.
Result: A reproducible framework capable of handling >100,000 aircraft records with optimized shuffle paths.

Read our Full Aviation Report

Semester 2 Engineering Notes

Lab 1: Streaming Optimization → We achieved a 10x throughput improvement by tuning shuffle partitions and watermarks for METAR weather reports. View Note
Lab 2: Inverted Indexing → We designed a columnar search index for aviation text, reducing storage footprint by 65.7% compared to CSV. View Note
Lab 3: Graph Processing → We modeled aircraft proximity as a graph and identified key traffic hubs using iterative algorithms. View Note

Semester 1: Foundations

In the first semester, we focused on the fundamentals of the Data Engineering stack, from containerization to analytical modeling.

Project: Food Facts Analysis
- We optimized a 1.1 GB dataset down to 0.34 MB (-99.9%).
- We improved query speed by 3.4x using Predicate Pushdown.
- View Semester 1 Report
Labs Archive:
- Lab 1: Environment & Docker
- Lab 2: SQL & Data Modeling

Our Technical Stack

Engine: Apache Spark (PySpark 3.x)
Storage: Parquet (Snappy), Lakehouse Architecture
Ops: Docker, Makefile, GitHub Actions
Deployment: Cloudflare Pages & Zero Trust

Final Project: Local Lakehouse & Optimization

We built a local Lakehouse capable of processing real and complex data while meeting strict performance objectives (SLOs).

The Topic: Nutritional Analysis (Open Food Facts)

We analyzed the evolution of the nutritional quality of global food products (Sugar, Fat, Nutriscore).

Data: ~1.1 GB of raw CSV, highly denormalized (>150 columns).
Stack: PySpark (Spark 3.x), Parquet, Local Single Node.

Key Results

We compared a “naive” pipeline (Baseline) against our optimized pipeline (Silver/Gold layers).

Metric	Result Obtained	Technical Impact
Storage	-99.9% (1.1GB → 0.34MB)	Snappy Compression + Drastic Cleaning
Speed (Q3)	x3.4 faster	Predicate Pushdown & Data Skipping
Latency	228 ms	Optimized reading via sorting (sortWithinPartitions)

Access the Report

This project demonstrates how rigorous physical design (Sorting, Partitioning, Projection) can transform an unusable dataset into a high-performance Datamart.

Read the Full Project Report

View the Jupyter Notebook (Source Code)

Labs

Here are all the practical labs completed, covering Data Engineering fundamentals, from containerization to data pipelines.

Lab 1: Environment & Docker
- Skills Acquired: Environment setup, containerization.
- Access Lab 1
Lab 2: SQL & Data Modeling
- Skills Acquired: Analytical queries, data structuring.
- Access Lab 2
Lab 3: Data Pipelines
- Skills Acquired: Orchestration and transformation.
- Access Lab 3

About this site

This portfolio is built using the “Docs as Code” approach:

Generated with Quartz.
Hosted on Cloudflare Pages.
Secured by Cloudflare Zero Trust (Access Policies).

Quartz 4

Explorer

Home

Welcome to our Data Engineering website

Semester 2: Aviation Track & Data-Intensive Systems

Major Project: End-to-End Aviation Lakehouse

Semester 2 Engineering Notes

Semester 1: Foundations

Our Technical Stack

Final Project: Local Lakehouse & Optimization

The Topic: Nutritional Analysis (Open Food Facts)

Key Results

Labs

About this site

Graph View

Table of Contents

Backlinks