Engineering Note — DE2 Assignment 3: Graph Processing / Iterative PageRank

Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (OpenSky Network) | Path: A - Graph Processing

1. Graph Construction — Design Choices

Data model

The OpenSky States dataset records each airborne aircraft’s position every few seconds. Rather than route data (which is unavailable in the raw feed), a proximity graph is built:

Vertex = one unique icao24 aircraft transponder identifier
Edge = two aircraft occupy the same 2°×2° lat/lon sector within the same 5-minute time window

This co-presence relationship captures potential ATC separation conflicts and shared approach corridor usage — a meaningful aviation metric that does not require a flight schedule.

Graph construction pipeline

df_raw  (filter: airborne only)
   │  floor(lat / 2°) → sector_lat
   │  floor(lon / 2°) → sector_lon
   │  floor(time / 300s) → time_bucket
   ▼
df_sectors  (icao24, sector_lat, sector_lon, time_bucket)
   │  self-join on (sector_lat, sector_lon, time_bucket)
   │             AND l.icao24 < r.icao24   ← no duplicates, no self-loops
   ▼
edges_raw  (src, dst)   → distinct undirected pairs
vertices   (id)         → union of all src and dst
out_degree (src, out_degree) → groupBy src, count edges

Sector size rationale: 2° ≈ 220 km at European latitudes — large enough to capture cluster co-presence in busy airspace (LFPG approach, Mediterranean overflight), small enough to avoid the entire map collapsing into one bucket.

2. PageRank Algorithm

Formula per iteration

$rank (v) = \frac{1 - d}{N} + d \cdot \sum_{u \in in-neighbours (v)} \frac{rank ( u )}{outdeg ( u )}$

Damping factor d = 0.85 (standard, models random-restart probability)
Initialization: rank(v) = 1/N for all vertices (uniform prior)
Stopping criterion: max |Δrank| < ε = 0.001
Max iterations: 10

Implementation notes

Each Spark iteration involves two joins (edges ⋈ ranks on src, then ⋈ out_degree on src) followed by a groupBy(dst).sum(contrib). Calling .cache() + .count() after each new_ranks breaks DAG lineage growth, preventing Spark from re-computing the full chain from iteration 0 on every subsequent step.

Baseline Results:

The algorithm reached convergence in 4 iterations.

Iteration	Execution Time	Delta (Max Absolute Change)
0	1,030 ms	0.001349
1	629 ms	0.002393
2	578 ms	0.002487
3	543 ms	0.002112
4	645 ms	0.000971 (Converged)

3. Partitioning Strategies — Before / After

Baseline (default hash partitioning)

No explicit repartitioning. Spark’s default hash partitioner assigns rows to partitions by hash(all columns) % numPartitions. When edges is joined to ranks on src == id, Spark must shuffle both sides to co-locate matching keys, even if edges was already read from CSV with a natural hash.

Optimized (repartition by join key)

edges_repartitioned  = edges_raw.repartition(N_PARTS, "src").cache()
out_deg_repartitioned = out_degree.repartition(N_PARTS, "src").cache()

Both DataFrames are hash-partitioned by "src" (the join key) once, before the loop. Inside the loop, joining edges_repartitioned with ranks on src == id can still require shuffling ranks (since ranks is re-computed each iteration), but the edges side is stable and never re-shuffled. The out_degree join is now entirely co-located.

Execution Plan Analysis (Spark SQL Plan):

Baseline (plan_before.txt): Frequent Exchange hashpartitioning operators are visible, indicating that data is redistributed across the network during every join operation.
Optimized (plan_after.txt): While an initial Exchange is introduced, it prepares the partitions for subsequent iterations, eliminating redundant data movement.

4. Quantitative Comparison

Metric	Baseline	Optimized (Repartitioned)	Δ
Avg iteration time (ms)	~685 ms	~500 ms	−27%
Total time to converge (ms)	3,425 ms	2,500 ms	−27%
Shuffle Read per Iteration	19.2 MiB	0 MiB (local)	−100%
Convergence iteration	4	4	Identical

Analysis:

Baseline: Each iteration triggered a fresh shuffle of the rank updates (19.2 MiB transferred per iteration as seen in spark_ui_shuffle_baseline.png).
Optimized: By using edges.repartition(8, "src"), the join became “shuffle-less” after the initial redistribution. The Spark UI (spark_ui_shuffle_optimized.png) shows that subsequent tasks read data locally or from memory.

Spark UI analysis confirms the effectiveness of the optimization strategy:

Shuffle Reduction (spark_ui_shuffle_optimized.png): “Shuffle Read” volume per iteration significantly decreased after the initial repartitioning, proving that partitions are now aligned for iterative joins.
Cache Management (spark_ui_storage_cache.png): The Storage tab confirms that the rank DataFrames are 100% cached in RAM. This prevents the exponential growth of the DAG lineage and ensures stable memory usage.
Data Skew (spark_ui_skew_tasks.png): Individual task metrics reveal a slight workload imbalance. Some tasks process more data than the median, identifying “hub” aircraft that are highly connected within dense sectors.

5. Skew Analysis

The out-degree distribution reveals whether any aircraft is disproportionately central:

Statistic	Value
Mean out-degree	~41 (256,000 edges / 6,264 vertices)
Std out-degree	~25
Max out-degree	~215
Skew ratio (max/mean)	~5.2×

A skew ratio > 10x means one or a few aircraft are co-located with many others (dense hub sectors). In that case:

The partition holding the hot vertex receives proportionally more work during groupBy(dst).sum(contrib).
Mitigation: filter aircraft with out_degree > threshold before graph construction, or salt the join key.

In practice, our airspace proximity graph at 2°/5min granularity exhibits moderate skew (R ≈ 5.2×) for this OpenSky snapshot — sufficient to justify repartitioning to align data, but low enough that it does not require complex salting.

6. Graph Results — Interpretation

The Top 15 aircraft identified by PageRank highlight the most central nodes in the network:

Rank	Aircraft ID (icao24)	PageRank Score
1	c81e22	0.00744
2	cda28b	0.00700
3	e8043b	0.00584
4	e61f99	0.00528
5	c0884a	0.00524

These results reflect traffic density: these aircraft frequently share airspace sectors with other highly connected nodes, mapping to major arrival and departure corridors in European airspace.

7. Recommendation

To optimize iterative graph workloads in Spark:

Strategic Partitioning: Use repartition(n, join_key) before entering iterative loops to minimize shuffle costs.
Cache/Unpersist Cycle: Always use .cache() followed by a materializing action (.count()) each iteration to break the DAG lineage, and .unpersist() old ranks to free memory.
Skew Monitoring: Monitor “straggler” tasks in the Spark UI to detect data hotspots and adjust partition counts accordingly.

Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 2, April 2026

Quartz 4

Explorer

engineering_note_lab3