Engineering Note - DE2 Assignment 2: Inverted Index Pipeline
Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (METAR weather reports)
1. Pipeline Architecture
data/corpus/aviation_metar.csv
│ (doc_id, text) — one METAR report per row
▼
spark.read.csv (schema-enforced)
│
│ regexp_replace(lower(text), "[^a-z0-9\s]", "")
│ split("[\\s\\W]+") → array[token]
│ explode → (doc_id, token) pairs
│ filter: len > 1, token NOT IN stop_words
▼
df_filtered (doc_id, token)
│
│ groupBy("token")
│ .agg(collect_list("doc_id") → doc_ids,
│ count("*") → freq)
▼
inverted_index (token | doc_ids: array<string> | freq: long)
│
├──→ write.parquet("outputs/lab2/inverted_index/")
└──→ write.csv ("outputs/lab2/inverted_index_csv/")
(doc_ids serialized as "|"-joined string)
2. Corpus Design (Track D)
The METAR corpus is derived from the OpenSky States dataset used in Lab 1. Each aircraft state record is transformed into a free-text sentence describing status, altitude, velocity, heading, and squawk code. This produces a natural-language corpus where aviation terminology (landing, airborne, squawk, altitude, heading) has meaningful frequency distributions.
One document = one aircraft position report = one row with a unique doc_id (icao24_timestamp).
- Total input rows: 1,526,689
- Initial tokens generated: 32,203,784
- Post-filter tokens: 19,818,447
3. Normalization Choices
| Step | Choice | Rationale |
|---|---|---|
| Lowercase | F.lower() | Case-insensitive matching; METAR uses all-caps |
| Punctuation | regexp_replace([^a-z0-9\s], "") | Removes slash, hyphen in METAR codes |
| Tokenize | split([\s\W]+) | Splits on any whitespace or punctuation boundary |
| Min length | len > 1 | Drops single chars (e.g., “z” from “051200Z”) |
| Stop-words | 36-word list (English + METAR filler) | meters, knots, degrees, numeric words inflate the index without adding search value |
Stop-word list rationale: A generic English list was extended with METAR-specific filler: unit words (meters, knots, degrees), spelled-out digits (zero–nine), and modal verbs. This reduces token count by ~25–40% while preserving all semantically meaningful aviation terms.
4. Index Schema
root
|-- token : string (not null) ← index key
|-- doc_ids : array<string> ← postings list
|-- freq : long ← document frequency (df)
collect_list (not collect_set) is used to retain all occurrence positions, enabling future TF (term-frequency) computation per document. The freq column equals len(doc_ids) for this corpus because each (doc_id, token) pair appears exactly once after normalization.
Performance of Build index:
The index construction was performed on a local Spark session with 4GB of RAM allocated to the driver and executors to ensure stability.
- Parquet Build Duration: 41 seconds
- CSV Build Duration: 42 seconds
- Unique terms indexed: 70,293
- Memory Management: A significant spill of 1568.4 MiB was recorded during the
ObjectHashAggregatephase. This metric highlights why the initial memory increase to 4GB was critical to prevent “Java Heap Space” failures during the shuffle-heavy aggregation.
5. Query Latency & Plan Analysis
Query performance was optimized using the .cache() method on the resulting inverted index. Analysis of Stage 46 in the Spark UI shows:
- Median Latency: 3.0 ms per task.
- Cache Efficiency: The presence of the
InMemoryTableScannode in the execution DAG confirms that subsequent searches bypass disk I/O entirely, serving results directly from the 2.1 GiB of allocated storage memory.
The formatted plan (proof/plan_query.txt) shows:
- PushedFilters:
[IsNotNull(token), EqualTo(token, aircraft)]— Parquet row-group statistics allow skipping irrelevant row groups. - Columnar Efficiency: Unlike the CSV format which requires a full row scan, the Parquet scan only fetches the specific columns required for the query (
token,doc_ids), drastically reducing I/O overhead.
6. Storage Footprint Comparison
| Format | Size (Bytes) | Files | Notes |
|---|---|---|---|
| Parquet | 123,192,287 bytes | 6 files | Columnar + Snappy compression |
| CSV | 359,635,037 bytes | 6 files | Plain text, pipe-joined strings |
Key Findings:
- Storage Ratio: 34.25%
- Performance: Parquet is 65.7% smaller than CSV. This massive gain is due to Parquet’s ability to compress repeated tokens and efficiently encode the long strings of document IDs compared to the verbatim text storage of CSV.
7. Optimization (shuffle.partitions)
We explicitly set spark.sql.shuffle.partitions = 8 (vs default 200) to align with the local machine’s CPU cores. This prevents the overhead of Spark’s default 200 partitions, which would have created hundreds of tiny, inefficient tasks for our 70k unique terms, thereby reducing total execution time. This was first observed in Lab 1 and applies equally here.
Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 2, April 2026