Engineering Note - DE2 Assignment 1 : Streaming Pipeline Optimization
Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (METAR weather reports)
1. Pipeline Design
The pipeline implements a Structured Streaming architecture designed to process aviation data in real-time.
OpenSky CSV dataset
│ (split into 10 batches, dropped one-by-one)
▼
data/landing/ ←── readStream (CSV, schema=event_schema, maxFilesPerTrigger=1)
│
│ .withColumn("event_time", from_unixtime(col("time")).cast("timestamp"))
▼
df_stream (16 columns + event_time)
│
│ .withWatermark("event_time", WATERMARK_DELAY)
│ .groupBy(window("event_time", "10 min"), "icao24")
│ .agg(count, avg(velocity), max(baroaltitude))
▼
windowed DataFrame
│
│ writeStream → format("parquet"), outputMode("append")
│ trigger(processingTime=TRIGGER)
│ checkpointLocation → outputs/lab1/checkpoint/
▼
outputs/lab1/stream_sink/ (Parquet part files, partitioned by trigger batch)
The source simulates a real feed by copying pre-split CSV files into data/landing/. Spark’s FileStreamSource detects new files via directory listing, guaranteeing exactly-once ingestion combined with the checkpoint WAL.
- Transformations: Windowed aggregation (10-minute windows, 5-minute sliding) to count occurrences per
icao24. - Sink: Parquet files in
appendmode with full checkpointing for fault tolerance and exactly-once semantics.
2. Watermark Reasoning
We implemented a Watermark (set to 5 minutes in the final run) on the event_time column.
- Goal: Manage memory by allowing Spark to discard “old” state data that falls outside the watermark threshold.
- Impact: By reducing the watermark from 10m to 5m, we ensure the State Store remains lean, preventing unbounded memory growth during long-running streaming jobs.
3. Performance & Optimization Results
We compared three configurations (r1, r2, r3) to measure the impact of shuffle and trigger optimizations.
| Configuration | Shuffle Partitions | Trigger Interval | Avg Process Rate (rows/sec) | Status |
|---|---|---|---|---|
| r1 (Baseline) | 200 (Default) | 10s | ~8,958 | Completed |
| r2 (Optimized) | 5 | 30s | ~51,362 | Completed |
| r3 (Final) | 5 | 30s | ~90,657 | Active/Stable |
Analysis of Gains:
- Shuffle Optimization: Changing
spark.sql.shuffle.partitionsfrom 200 to 5 was the most significant factor. In local mode, 200 partitions created massive overhead for small batches. Reducing this allowed the CPU to focus on processing rather than task scheduling. - Throughput: We achieved a 10x improvement in processing speed between r1 and r3.
- Stability: The DAG visualization for r3 confirms a streamlined execution plan with minimal exchange overhead.
4. Conclusion
The optimized pipeline (r3) demonstrates that for local or low-parallelism environments, reducing partition overhead and aligning trigger intervals with data arrival rates significantly improves throughput and reduces latency.
Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 1, April 2026