Welcome to our Data Engineering website

Information

  • Project Team: Justine Guirauden and Volcy Desmazures
  • School: ESIEE Paris (2025-2026)
  • Course: Data Engineering
  • Track: D — Aviation (OpenSky Network)
  • Objective: Engineering high-performance pipelines for data-intensive aviation workloads.

This site documents our practical labs as well as our final project on Big Data pipeline optimization.


Semester 2: Aviation Track & Data-Intensive Systems

During this semester, we focused on Track D (Aviation), processing real-world data from the OpenSky Network. We implemented complex pipelines covering streaming, indexing, and graph processing.

Major Project: End-to-End Aviation Lakehouse

We built a comprehensive pipeline to analyze European airspace traffic. Our system supports batch analytics, real-time monitoring, and advanced research tools.

  • Our Architecture: A multi-layered Lakehouse (Bronze/Silver/Gold) using PySpark and Parquet.
  • Key Features: * Real-time airspace density monitoring via Structured Streaming.
    • Centrality analysis of flight networks using PageRank.
    • Industrial-grade NLP corpus for aviation report retrieval.
  • Result: A reproducible framework capable of handling >100,000 aircraft records with optimized shuffle paths.

Read our Full Aviation Report

Semester 2 Engineering Notes

  • Lab 1: Streaming Optimization We achieved a 10x throughput improvement by tuning shuffle partitions and watermarks for METAR weather reports. View Note
  • Lab 2: Inverted Indexing We designed a columnar search index for aviation text, reducing storage footprint by 65.7% compared to CSV. View Note
  • Lab 3: Graph Processing We modeled aircraft proximity as a graph and identified key traffic hubs using iterative algorithms. View Note

Semester 1: Foundations

In the first semester, we focused on the fundamentals of the Data Engineering stack, from containerization to analytical modeling.


Our Technical Stack

  • Engine: Apache Spark (PySpark 3.x)
  • Storage: Parquet (Snappy), Lakehouse Architecture
  • Ops: Docker, Makefile, GitHub Actions
  • Deployment: Cloudflare Pages & Zero Trust

Final Project: Local Lakehouse & Optimization

We built a local Lakehouse capable of processing real and complex data while meeting strict performance objectives (SLOs).

The Topic: Nutritional Analysis (Open Food Facts)

We analyzed the evolution of the nutritional quality of global food products (Sugar, Fat, Nutriscore).

  • Data: ~1.1 GB of raw CSV, highly denormalized (>150 columns).
  • Stack: PySpark (Spark 3.x), Parquet, Local Single Node.

Key Results

We compared a “naive” pipeline (Baseline) against our optimized pipeline (Silver/Gold layers).

MetricResult ObtainedTechnical Impact
Storage-99.9% (1.1GB 0.34MB)Snappy Compression + Drastic Cleaning
Speed (Q3)x3.4 fasterPredicate Pushdown & Data Skipping
Latency228 msOptimized reading via sorting (sortWithinPartitions)

Access the Report

This project demonstrates how rigorous physical design (Sorting, Partitioning, Projection) can transform an unusable dataset into a high-performance Datamart.

Read the Full Project Report

View the Jupyter Notebook (Source Code)


Labs

Here are all the practical labs completed, covering Data Engineering fundamentals, from containerization to data pipelines.

  • Lab 1: Environment & Docker

    • Skills Acquired: Environment setup, containerization.
    • Access Lab 1
  • Lab 2: SQL & Data Modeling

    • Skills Acquired: Analytical queries, data structuring.
    • Access Lab 2
  • Lab 3: Data Pipelines

    • Skills Acquired: Orchestration and transformation.
    • Access Lab 3

About this site

This portfolio is built using the “Docs as Code” approach:

  • Generated with Quartz.
  • Hosted on Cloudflare Pages.
  • Secured by Cloudflare Zero Trust (Access Policies).