Engineering Note - DE2 Assignment 1 : Streaming Pipeline Optimization

Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (METAR weather reports)

1. Pipeline Design

The pipeline implements a Structured Streaming architecture designed to process aviation data in real-time.

OpenSky CSV dataset
        │  (split into 10 batches, dropped one-by-one)
        ▼
 data/landing/   ←── readStream (CSV, schema=event_schema, maxFilesPerTrigger=1)
        │
        │  .withColumn("event_time", from_unixtime(col("time")).cast("timestamp"))
        ▼
 df_stream  (16 columns + event_time)
        │
        │  .withWatermark("event_time", WATERMARK_DELAY)
        │  .groupBy(window("event_time", "10 min"), "icao24")
        │  .agg(count, avg(velocity), max(baroaltitude))
        ▼
 windowed DataFrame
        │
        │  writeStream → format("parquet"), outputMode("append")
        │               trigger(processingTime=TRIGGER)
        │               checkpointLocation → outputs/lab1/checkpoint/
        ▼
 outputs/lab1/stream_sink/  (Parquet part files, partitioned by trigger batch)

The source simulates a real feed by copying pre-split CSV files into data/landing/. Spark’s FileStreamSource detects new files via directory listing, guaranteeing exactly-once ingestion combined with the checkpoint WAL.

  • Transformations: Windowed aggregation (10-minute windows, 5-minute sliding) to count occurrences per icao24.
  • Sink: Parquet files in append mode with full checkpointing for fault tolerance and exactly-once semantics.

2. Watermark Reasoning

We implemented a Watermark (set to 5 minutes in the final run) on the event_time column.

  • Goal: Manage memory by allowing Spark to discard “old” state data that falls outside the watermark threshold.
  • Impact: By reducing the watermark from 10m to 5m, we ensure the State Store remains lean, preventing unbounded memory growth during long-running streaming jobs.

3. Performance & Optimization Results

We compared three configurations (r1, r2, r3) to measure the impact of shuffle and trigger optimizations.

ConfigurationShuffle PartitionsTrigger IntervalAvg Process Rate (rows/sec)Status
r1 (Baseline)200 (Default)10s~8,958Completed
r2 (Optimized)530s~51,362Completed
r3 (Final)530s~90,657Active/Stable

Analysis of Gains:

  • Shuffle Optimization: Changing spark.sql.shuffle.partitions from 200 to 5 was the most significant factor. In local mode, 200 partitions created massive overhead for small batches. Reducing this allowed the CPU to focus on processing rather than task scheduling.
  • Throughput: We achieved a 10x improvement in processing speed between r1 and r3.
  • Stability: The DAG visualization for r3 confirms a streamlined execution plan with minimal exchange overhead.

4. Conclusion

The optimized pipeline (r3) demonstrates that for local or low-parallelism environments, reducing partition overhead and aligning trigger intervals with data arrival rates significantly improves throughput and reduces latency.

Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 1, April 2026