DE2 — Final Project Report

Course: Data Engineering II (Data-Intensive Workloads) — ESIEE Paris 2025-2026
Track: D — Aviation (OpenSky Network)
Iterative path: A — Graph Processing
Names: Justine Guirauden and Volcy Desmazures
Date: May 2026


1. Use-case and Dataset

Problem statement

Build a reproducible, end-to-end data-intensive pipeline over European airspace traffic data that enables:

  1. Analysts to query per-sector aircraft density and per-aircraft kinematic profiles (batch ETL)
  2. Operations to monitor real-time airspace load via streaming aggregations
  3. NLP engineers to retrieve aviation-domain reports by keyword (inverted index)
  4. Researchers to identify the most central aircraft in the traffic network (PageRank)
  5. LLM teams to consume a curated, deduplicated aviation text corpus

Dataset

PropertyValue
SourceOpenSky Network — states_2017-06-05-00.csv
FormatCSV, 16 columns, header row
Records~5 000 (synthetic) / up to 100 000+ (real OpenSky snapshot)
Time range2017-06-05 00:00–01:00 UTC (1-hour window)
Known issues~3–5% null baroaltitude; whitespace padding in callsign; no row deduplication in raw file

Schema (silver, after typing)

time (Long), icao24 (String, PK), lat/lon (Double), velocity/heading/vertrate (Double), callsign (String), onground/alert/spi (Boolean), squawk (String), baroaltitude/geoaltitude (Double), lastposupdate/lastcontact (Double), derived: event_time (Timestamp), year/month (Int), row_id (Long, surrogate)


2. System and SLOs

Hardware & Spark config

  • Machine: laptop, Intel Core i7, 16 GB RAM
  • OS: Ubuntu 24 (WSL2) or macOS
  • Spark: 4.0.0, Python 3.10, JupyterLab, local[*]
  • spark.sql.shuffle.partitions = 8 (tuned for local single-node)

SLOs

SLOThresholdStage
Streaming throughput≥ 100 rows/s processedStreaming
Text query latency≤ 2 000 ms (single-term, cached index)Text
PageRank convergencemax Δrank < 0.001Graph
Parquet/CSV ratio≤ 60% (Parquet ≤ 60% of CSV size)Text / ETL
Full pipeline latency≤ 10 min on local machineEnd-to-end
LLM pass ratio≥ 80% of documents pass quality filtersLLM prep

3. Batch ETL Pipeline Design

Lineage diagram

data/project/raw/*.csv  (Bronze — immutable, as-is CSV)
        │
        │  schema enforcement (opensky_schema, PERMISSIVE mode)
        │  filter: icao24 IS NOT NULL AND time IS NOT NULL
        │  domain validation: lat ∈ [-90,90], lon ∈ [-180,180], velocity ≥ 0
        │  derive: event_time = from_unixtime(time)
        │          year, month from event_time
        │          callsign = trim(callsign)
        │          row_id = xxhash64(icao24, time)
        │  deduplicate: dropDuplicates(["icao24","time"])
        ▼
outputs/project/silver/  (Parquet — typed, clean, deduplicated)
        │
        ├──► gold/density/        groupBy(year,month,sector,hour).agg(count,avg_vel,avg_alt)
        │                         partitionBy(year, month)
        │
        └──► gold/aircraft_profile/  groupBy(icao24,callsign,year,month).agg(n_pos,avg_vel,max_alt)
                                     partitionBy(year, month)

Schema contracts

  • Nullability: icao24 and time are NOT NULL (dropped if missing)
  • Domain: lat/lon validated; invalid velocity rows filtered
  • Surrogate key: xxhash64(icao24, time) — fast, collision-resistant, no shuffle required
  • Natural key: (icao24, time) — used for deduplication

Key metrics

RunStageRowsBytesTime (ms)
r1Bronze landing1 526 689199 784 9276 811
r1Bronze → Silver1 526 68961 052 46013 381
r1Silver → Gold362 21312 690

Note: 0 rows removed at deduplication step — no duplicates found in the raw file for this run. Bronze size is the raw CSV on disk (163.9 MiB read by Spark per the physical plan); the Silver Parquet footprint (61 MB) represents a 3.3× compression over the shuffle-written intermediate.


4. Streaming Ingestion

Configuration

  • Source: file-based (readStream.csv), maxFilesPerTrigger=1
  • Event time: from_unixtime(time) cast to Timestamp
  • Watermark: 10 minutes (tolerates late ATC data re-transmissions)
  • Window: 5-minute tumbling, keyed by (sector_lat, sector_lon)
  • Aggregates: count(*), avg(velocity), max(baroaltitude), countDistinct(icao24)
  • Output mode: append (only closed windows emitted)
  • Checkpoint: outputs/project/streaming_checkpoint/ (exactly-once)
  • Trigger: processingTime = "10 seconds"

Evidence

MetricMeasuredSLOStatus
processedRowsPerSecond57 883≥ 100PASS
triggerExecution (ms)2 971 (avg, batches 2–8)
inputRowsPerSecond15 146

Steady-state measurements taken over batches 2–8 (batch 0 excluded as cold-start). The Spark Streaming UI confirms Avg Process/sec = 57 882.86 and Avg Input/sec = 15 145.78 over a 7-minute run (9 batches total, latest batch 7).

proof/query_progress.json contains the full lastProgress JSON.
proof/plan_streaming.txt contains the formatted physical plan.


5. Text Processing

Corpus

Built from the OpenSky dataset: each aircraft position record is converted to a structured English sentence describing status, altitude, velocity, heading, and squawk code. One document = one record (doc_id = icao24_time).

Normalization pipeline

  1. lower() → uniform case
  2. regexp_replace([^a-z0-9\s], "") → strip punctuation
  3. split([\s\W]+) → tokenize
  4. filter(len > 1) → remove single-char tokens
  5. Stop-word filter: 36-word list including METAR-specific filler (meters, knots, degrees, spelled-out digits)

Index schema

token (String) | doc_ids (Array<String>) | freq (Long)

Query latency

Termfreqpostingslatency (ms)SLO ≤ 2 000 ms
aircraft1 526 6891 526 6896 907FAIL
landing00471PASS
squawk1 526 6891 526 6893 105FAIL

Analysis: aircraft and squawk appear in every document (universal terms in the synthetic corpus), producing maximal posting lists (1 526 689 entries) that exceed the 2 s SLO. landing is absent from the vocabulary (freq = 0), yielding a fast index-miss path (471 ms). The SLO breach reflects the degenerate distribution of a synthetic corpus — in a real, varied METAR corpus, high-selectivity terms would dominate queries. Mitigation: add a TF-IDF cutoff to skip ultra-high-frequency tokens, or pre-materialise posting-list sizes to short-circuit at query time.

Storage footprint

FormatSize (bytes)FilesNotes
Parquet104 788 488Column pruning on token filter
CSV304 244 880Full scan, no predicate pushdown
Ratio34.44%SLO ≤ 60% — PASS

6. Iterative Workload — Graph PageRank

Choice and rationale

Path A — Graph Processing. The OpenSky dataset provides position data (lat, lon, time) that naturally encodes spatial proximity relationships. A route graph is not available in the raw feed, so a proximity co-presence graph is constructed: two aircraft sharing the same 2°×2° lat/lon sector in the same 5-minute window are connected by an edge. PageRank on this graph identifies aircraft most central in the traffic flow — those frequenting dense corridors (LFPG approach, Mediterranean overflight).

Algorithm

  • Vertices: unique icao24 identifiers (airborne only)
  • Edges: distinct co-present pairs per sector/window (strict inequality to avoid duplicates)
  • PageRank formula: rank(v) = (1−d)/N + d × Σ rank(u)/outdeg(u)
  • Damping: d = 0.85 | Convergence: max|Δrank| < 0.001

Graph statistics

MetricValue
Vertices6 264
Edges256 461

Per-iteration metrics

IterBaseline (ms)Repartitioned (ms)Delta (ms)
01 171535−636
1869481−388
2711578−133
3659609−50
4764849+85
Avg835610−225

Convergence reached at iteration 4: max Δrank = 0.000971 < 0.001 PASS.

Partitioning comparison

StrategyAvg iter (ms)Shuffle behaviour
Default hash835Exchange on both sides of join
repartition(8, "src")610Exchange only on ranks side
Gain≈ 27%

proof/plan_before.txt and proof/plan_after.txt confirm the Exchange node reduction.

Skew analysis

MetricValue
Skew ratio (max/mean)10.3×
Graph size6 264 vertices, 256 461 edges

A ratio of 10.3× indicates moderate skewrepartition(8, "src") is sufficient to reduce join shuffle. The threshold for more aggressive measures (salting, hot-vertex filtering) would be > 15×.


7. LLM Data Readiness

Pipeline

silver data (text corpus)
    │  filter: text IS NOT NULL
    │  filter: length(text) ≥ 100 chars
    │  withColumn: content_hash = xxhash64(text)
    │  dropDuplicates(["content_hash"])
    │  withColumn: word_count, source, version, curated_at
    ▼
outputs/project/llm_ready/  (Parquet)

Data card

FieldValue
NameDE2 Aviation LLM-Ready Dataset
SourceOpenSky Network states_2017-06-05-00 (Track D)
Size0 rows / 0 bytes
Schemadoc_id, text, word_count, content_hash, source, version, curated_at
Quality filtersNOT NULL, length ≥ 100, xxhash64 dedup
Raw input1 526 689 rows
Pass ratio0.00% 45.36% (SLO: ≥ 80%) — FAIL
Intended useRAG retrieval over aviation reports; LLM fine-tuning for aviation NLP
Versionv1.0

Root cause: The synthetic METAR sentences generated from the OpenSky schema are structurally short (typically < 100 characters), causing all 1 526 689 rows to be rejected by the length(text) ≥ 100 quality filter. Fix: lower the threshold to ≥ 50 chars for synthetic data, or enrich the sentence template to include full field descriptions (callsign, squawk, sector, trajectory vector) to push average length above 100 chars. On real METAR reports the filter would pass comfortably.

Lowering the quality filter threshold from 100 to 50 characters improved the retention rate to 45.36%, but it remains insufficient to meet the 80% pass ratio SLO; this confirms that the synthetic METAR sentences lack structural diversity and are predominantly removed during the dropDuplicates step.

This is a data-engineering deliverable only — no LLM is run. The pipeline prepares structured Parquet files consumable by any HuggingFace datasets.load_dataset("parquet", ...) call.


8. Physical Design & Optimization

Strategies applied

TechniqueWhereEffect
shuffle.partitions = 8All stagesRemoves 192 empty tasks on 8-core local machine
partitionBy(year, month)Gold tablesEnables partition pruning on date-range queries
repartition(8, "src")PageRankCollocates out-edges with rank data, reduces join shuffle
.cache().count() in PageRank loopIterativeBreaks DAG lineage growth, prevents exponential plan depth
xxhash64 surrogate keySilverNo shuffle needed; pure local hash
Parquet for silver/gold/index/LLMAll batchColumn pruning + Snappy compression

Before/after plans

  • proof/plan_etl_silver.txt — silver cleaning plan (SortMergeJoin on dedup key)
  • proof/plan_before.txt — PageRank join with full exchange on both sides
  • proof/plan_after.txt — PageRank join with exchange only on ranks side

Metrics comparison

StageMetricBeforeAfterGain
PageRankavg iter (ms)83561027%
Text indexParquet/CSV ratio34.44%PASS
LLM preppass ratio0.00%FAIL (template too short)

9. Results and Limits

SLO summary

SLOThresholdResultStatus
Streaming throughput≥ 100 rows/s57 883 rows/sPASS
Text query latency≤ 2 000 ms6 907 ms (aircraft) / 471 ms (landing)PARTIAL — universal terms fail
Storage ratio≤ 60%34.44%PASS
LLM pass ratio≥ 80%0.00%FAIL (synthetic sentences too short)
PageRank convergenceΔ < 0.001Δ = 0.000971 at iter 4PASS

Pipeline integration

Raw OpenSky CSV
 └─ Bronze (CSV, immutable)                          199.8 MB — 1 526 689 rows
     └─ Silver (Parquet, typed+deduped)               61.1 MB — 1 526 689 rows (0 dupes removed)
         ├─ Gold/density + Gold/aircraft_profile       0.4 MB — partitionBy(year,month)
         ├─ Streaming (5-min windows → Parquet sink)   57 883 rows/s processed  [feeds real-time dashboards]
         ├─ Text corpus → inverted index              104.8 MB Parquet | 11 823 unique tokens | ratio 34.44%
         ├─ Proximity graph → PageRank               6 264 vertices | 256 461 edges | converges iter 4
         └─ LLM-ready dataset                          0 rows (length filter rejects synthetic sentences)

Known limits

  • Scale: real SLO validation requires ≥ 10M rows; synthetic dataset of 10 000 rows meets structural requirements but may not stress-test shuffle paths
  • Streaming: file-based source simulates real streaming; a Kafka source would reflect production latency more accurately
  • Graph: proximity graph depends on sector granularity; 2° cells may merge distinct approach paths in dense areas (LFPG + LFPO both in Paris sector)
  • LLM quality: synthetic METAR sentences are structurally short (< 100 chars), causing a 0% pass rate on the length filter; raising the threshold to ≥ 50 chars or enriching the template would restore the SLO
  • Text index SLO: universal terms (aircraft, squawk) generate maximal posting lists that systematically exceed the 2 s latency threshold — the SLO applies to selective, real-world vocabulary

Future work

  • Replace file-based streaming with Kafka connector
  • Add TF-IDF features alongside inverted index for relevance ranking; pre-filter tokens with df > 90% of corpus
  • Run PageRank on the full 24-hour OpenSky snapshot (≥ 50 000 aircraft) to trigger real skew
  • Add language detection filter to LLM pipeline (fastText)
  • Enrich METAR sentence template to exceed the 100-char quality threshold

DE2 Final Project Report — ESIEE Paris 2025-2026 — Track D Aviation