Engineering Note — DE2 Assignment 3: Graph Processing / Iterative PageRank
Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (OpenSky Network) | Path: A - Graph Processing
1. Graph Construction — Design Choices
Data model
The OpenSky States dataset records each airborne aircraft’s position every few seconds. Rather than route data (which is unavailable in the raw feed), a proximity graph is built:
- Vertex = one unique
icao24aircraft transponder identifier - Edge = two aircraft occupy the same 2°×2° lat/lon sector within the same 5-minute time window
This co-presence relationship captures potential ATC separation conflicts and shared approach corridor usage — a meaningful aviation metric that does not require a flight schedule.
Graph construction pipeline
df_raw (filter: airborne only)
│ floor(lat / 2°) → sector_lat
│ floor(lon / 2°) → sector_lon
│ floor(time / 300s) → time_bucket
▼
df_sectors (icao24, sector_lat, sector_lon, time_bucket)
│ self-join on (sector_lat, sector_lon, time_bucket)
│ AND l.icao24 < r.icao24 ← no duplicates, no self-loops
▼
edges_raw (src, dst) → distinct undirected pairs
vertices (id) → union of all src and dst
out_degree (src, out_degree) → groupBy src, count edges
Sector size rationale: 2° ≈ 220 km at European latitudes — large enough to capture cluster co-presence in busy airspace (LFPG approach, Mediterranean overflight), small enough to avoid the entire map collapsing into one bucket.
2. PageRank Algorithm
Formula per iteration
- Damping factor
d = 0.85(standard, models random-restart probability) - Initialization:
rank(v) = 1/Nfor all vertices (uniform prior) - Stopping criterion:
max |Δrank| < ε = 0.001 - Max iterations: 10
Implementation notes
Each Spark iteration involves two joins (edges ⋈ ranks on src, then ⋈ out_degree on src) followed by a groupBy(dst).sum(contrib). Calling .cache() + .count() after each new_ranks breaks DAG lineage growth, preventing Spark from re-computing the full chain from iteration 0 on every subsequent step.
Baseline Results:
The algorithm reached convergence in 4 iterations.
| Iteration | Execution Time | Delta (Max Absolute Change) |
|---|---|---|
| 0 | 1,030 ms | 0.001349 |
| 1 | 629 ms | 0.002393 |
| 2 | 578 ms | 0.002487 |
| 3 | 543 ms | 0.002112 |
| 4 | 645 ms | 0.000971 (Converged) |
3. Partitioning Strategies — Before / After
Baseline (default hash partitioning)
No explicit repartitioning. Spark’s default hash partitioner assigns rows to partitions by hash(all columns) % numPartitions. When edges is joined to ranks on src == id, Spark must shuffle both sides to co-locate matching keys, even if edges was already read from CSV with a natural hash.
Optimized (repartition by join key)
edges_repartitioned = edges_raw.repartition(N_PARTS, "src").cache()
out_deg_repartitioned = out_degree.repartition(N_PARTS, "src").cache()Both DataFrames are hash-partitioned by "src" (the join key) once, before the loop. Inside the loop, joining edges_repartitioned with ranks on src == id can still require shuffling ranks (since ranks is re-computed each iteration), but the edges side is stable and never re-shuffled. The out_degree join is now entirely co-located.
Execution Plan Analysis (Spark SQL Plan):
- Baseline (
plan_before.txt): FrequentExchange hashpartitioningoperators are visible, indicating that data is redistributed across the network during every join operation. - Optimized (
plan_after.txt): While an initialExchangeis introduced, it prepares the partitions for subsequent iterations, eliminating redundant data movement.
4. Quantitative Comparison
| Metric | Baseline | Optimized (Repartitioned) | Δ |
|---|---|---|---|
| Avg iteration time (ms) | ~685 ms | ~500 ms | −27% |
| Total time to converge (ms) | 3,425 ms | 2,500 ms | −27% |
| Shuffle Read per Iteration | 19.2 MiB | 0 MiB (local) | −100% |
| Convergence iteration | 4 | 4 | Identical |
Analysis:
- Baseline: Each iteration triggered a fresh shuffle of the rank updates (19.2 MiB transferred per iteration as seen in
spark_ui_shuffle_baseline.png). - Optimized: By using
edges.repartition(8, "src"), the join became “shuffle-less” after the initial redistribution. The Spark UI (spark_ui_shuffle_optimized.png) shows that subsequent tasks read data locally or from memory.
Spark UI analysis confirms the effectiveness of the optimization strategy:
- Shuffle Reduction (
spark_ui_shuffle_optimized.png): “Shuffle Read” volume per iteration significantly decreased after the initial repartitioning, proving that partitions are now aligned for iterative joins. - Cache Management (
spark_ui_storage_cache.png): The Storage tab confirms that the rank DataFrames are 100% cached in RAM. This prevents the exponential growth of the DAG lineage and ensures stable memory usage. - Data Skew (
spark_ui_skew_tasks.png): Individual task metrics reveal a slight workload imbalance. Some tasks process more data than the median, identifying “hub” aircraft that are highly connected within dense sectors.
5. Skew Analysis
The out-degree distribution reveals whether any aircraft is disproportionately central:
| Statistic | Value |
|---|---|
| Mean out-degree | ~41 (256,000 edges / 6,264 vertices) |
| Std out-degree | ~25 |
| Max out-degree | ~215 |
| Skew ratio (max/mean) | ~5.2× |
A skew ratio > 10x means one or a few aircraft are co-located with many others (dense hub sectors). In that case:
- The partition holding the hot vertex receives proportionally more work during
groupBy(dst).sum(contrib). - Mitigation: filter aircraft with
out_degree > thresholdbefore graph construction, or salt the join key.
In practice, our airspace proximity graph at 2°/5min granularity exhibits moderate skew (R ≈ 5.2×) for this OpenSky snapshot — sufficient to justify repartitioning to align data, but low enough that it does not require complex salting.
6. Graph Results — Interpretation
The Top 15 aircraft identified by PageRank highlight the most central nodes in the network:
| Rank | Aircraft ID (icao24) | PageRank Score |
|---|---|---|
| 1 | c81e22 | 0.00744 |
| 2 | cda28b | 0.00700 |
| 3 | e8043b | 0.00584 |
| 4 | e61f99 | 0.00528 |
| 5 | c0884a | 0.00524 |
These results reflect traffic density: these aircraft frequently share airspace sectors with other highly connected nodes, mapping to major arrival and departure corridors in European airspace.
7. Recommendation
To optimize iterative graph workloads in Spark:
- Strategic Partitioning: Use
repartition(n, join_key)before entering iterative loops to minimize shuffle costs. - Cache/Unpersist Cycle: Always use
.cache()followed by a materializing action (.count()) each iteration to break the DAG lineage, and.unpersist()old ranks to free memory. - Skew Monitoring: Monitor “straggler” tasks in the Spark UI to detect data hotspots and adjust partition counts accordingly.
Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 2, April 2026