Engineering Note — DE2 Assignment 3: Graph Processing / Iterative PageRank

Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (OpenSky Network) | Path: A - Graph Processing


1. Graph Construction — Design Choices

Data model

The OpenSky States dataset records each airborne aircraft’s position every few seconds. Rather than route data (which is unavailable in the raw feed), a proximity graph is built:

  • Vertex = one unique icao24 aircraft transponder identifier
  • Edge = two aircraft occupy the same 2°×2° lat/lon sector within the same 5-minute time window

This co-presence relationship captures potential ATC separation conflicts and shared approach corridor usage — a meaningful aviation metric that does not require a flight schedule.

Graph construction pipeline

df_raw  (filter: airborne only)
   │  floor(lat / 2°) → sector_lat
   │  floor(lon / 2°) → sector_lon
   │  floor(time / 300s) → time_bucket
   ▼
df_sectors  (icao24, sector_lat, sector_lon, time_bucket)
   │  self-join on (sector_lat, sector_lon, time_bucket)
   │             AND l.icao24 < r.icao24   ← no duplicates, no self-loops
   ▼
edges_raw  (src, dst)   → distinct undirected pairs
vertices   (id)         → union of all src and dst
out_degree (src, out_degree) → groupBy src, count edges

Sector size rationale: 2° ≈ 220 km at European latitudes — large enough to capture cluster co-presence in busy airspace (LFPG approach, Mediterranean overflight), small enough to avoid the entire map collapsing into one bucket.


2. PageRank Algorithm

Formula per iteration

  • Damping factor d = 0.85 (standard, models random-restart probability)
  • Initialization: rank(v) = 1/N for all vertices (uniform prior)
  • Stopping criterion: max |Δrank| < ε = 0.001
  • Max iterations: 10

Implementation notes

Each Spark iteration involves two joins (edges ⋈ ranks on src, then ⋈ out_degree on src) followed by a groupBy(dst).sum(contrib). Calling .cache() + .count() after each new_ranks breaks DAG lineage growth, preventing Spark from re-computing the full chain from iteration 0 on every subsequent step.

Baseline Results:

The algorithm reached convergence in 4 iterations.

IterationExecution TimeDelta (Max Absolute Change)
01,030 ms0.001349
1629 ms0.002393
2578 ms0.002487
3543 ms0.002112
4645 ms0.000971 (Converged)

3. Partitioning Strategies — Before / After

Baseline (default hash partitioning)

No explicit repartitioning. Spark’s default hash partitioner assigns rows to partitions by hash(all columns) % numPartitions. When edges is joined to ranks on src == id, Spark must shuffle both sides to co-locate matching keys, even if edges was already read from CSV with a natural hash.

Optimized (repartition by join key)

edges_repartitioned  = edges_raw.repartition(N_PARTS, "src").cache()
out_deg_repartitioned = out_degree.repartition(N_PARTS, "src").cache()

Both DataFrames are hash-partitioned by "src" (the join key) once, before the loop. Inside the loop, joining edges_repartitioned with ranks on src == id can still require shuffling ranks (since ranks is re-computed each iteration), but the edges side is stable and never re-shuffled. The out_degree join is now entirely co-located.

Execution Plan Analysis (Spark SQL Plan):

  • Baseline (plan_before.txt): Frequent Exchange hashpartitioning operators are visible, indicating that data is redistributed across the network during every join operation.
  • Optimized (plan_after.txt): While an initial Exchange is introduced, it prepares the partitions for subsequent iterations, eliminating redundant data movement.

4. Quantitative Comparison

MetricBaselineOptimized (Repartitioned)Δ
Avg iteration time (ms)~685 ms~500 ms−27%
Total time to converge (ms)3,425 ms2,500 ms−27%
Shuffle Read per Iteration19.2 MiB0 MiB (local)−100%
Convergence iteration44Identical

Analysis:

  • Baseline: Each iteration triggered a fresh shuffle of the rank updates (19.2 MiB transferred per iteration as seen in spark_ui_shuffle_baseline.png).
  • Optimized: By using edges.repartition(8, "src"), the join became “shuffle-less” after the initial redistribution. The Spark UI (spark_ui_shuffle_optimized.png) shows that subsequent tasks read data locally or from memory.

Spark UI analysis confirms the effectiveness of the optimization strategy:

  • Shuffle Reduction (spark_ui_shuffle_optimized.png): “Shuffle Read” volume per iteration significantly decreased after the initial repartitioning, proving that partitions are now aligned for iterative joins.
  • Cache Management (spark_ui_storage_cache.png): The Storage tab confirms that the rank DataFrames are 100% cached in RAM. This prevents the exponential growth of the DAG lineage and ensures stable memory usage.
  • Data Skew (spark_ui_skew_tasks.png): Individual task metrics reveal a slight workload imbalance. Some tasks process more data than the median, identifying “hub” aircraft that are highly connected within dense sectors.

5. Skew Analysis

The out-degree distribution reveals whether any aircraft is disproportionately central:

StatisticValue
Mean out-degree~41 (256,000 edges / 6,264 vertices)
Std out-degree~25
Max out-degree~215
Skew ratio (max/mean)~5.2×

A skew ratio > 10x means one or a few aircraft are co-located with many others (dense hub sectors). In that case:

  • The partition holding the hot vertex receives proportionally more work during groupBy(dst).sum(contrib).
  • Mitigation: filter aircraft with out_degree > threshold before graph construction, or salt the join key.

In practice, our airspace proximity graph at 2°/5min granularity exhibits moderate skew (R ≈ 5.2×) for this OpenSky snapshot — sufficient to justify repartitioning to align data, but low enough that it does not require complex salting.


6. Graph Results — Interpretation

The Top 15 aircraft identified by PageRank highlight the most central nodes in the network:

RankAircraft ID (icao24)PageRank Score
1c81e220.00744
2cda28b0.00700
3e8043b0.00584
4e61f990.00528
5c0884a0.00524

These results reflect traffic density: these aircraft frequently share airspace sectors with other highly connected nodes, mapping to major arrival and departure corridors in European airspace.


7. Recommendation

To optimize iterative graph workloads in Spark:

  1. Strategic Partitioning: Use repartition(n, join_key) before entering iterative loops to minimize shuffle costs.
  2. Cache/Unpersist Cycle: Always use .cache() followed by a materializing action (.count()) each iteration to break the DAG lineage, and .unpersist() old ranks to free memory.
  3. Skew Monitoring: Monitor “straggler” tasks in the Spark UI to detect data hotspots and adjust partition counts accordingly.

Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 2, April 2026