Engineering Note - DE2 Assignment 2: Inverted Index Pipeline

Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (METAR weather reports)

1. Pipeline Architecture

data/corpus/aviation_metar.csv
        │  (doc_id, text) — one METAR report per row
        ▼
 spark.read.csv (schema-enforced)
        │
        │  regexp_replace(lower(text), "[^a-z0-9\s]", "")
        │  split("[\\s\\W]+") → array[token]
        │  explode → (doc_id, token) pairs
        │  filter: len > 1, token NOT IN stop_words
        ▼
 df_filtered  (doc_id, token)
        │
        │  groupBy("token")
        │  .agg(collect_list("doc_id") → doc_ids,
        │       count("*")            → freq)
        ▼
 inverted_index  (token | doc_ids: array<string> | freq: long)
        │
        ├──→ write.parquet("outputs/lab2/inverted_index/")
        └──→ write.csv    ("outputs/lab2/inverted_index_csv/")
                          (doc_ids serialized as "|"-joined string)

2. Corpus Design (Track D)

The METAR corpus is derived from the OpenSky States dataset used in Lab 1. Each aircraft state record is transformed into a free-text sentence describing status, altitude, velocity, heading, and squawk code. This produces a natural-language corpus where aviation terminology (landing, airborne, squawk, altitude, heading) has meaningful frequency distributions.

One document = one aircraft position report = one row with a unique doc_id (icao24_timestamp).

Total input rows: 1,526,689
Initial tokens generated: 32,203,784
Post-filter tokens: 19,818,447

3. Normalization Choices

Step	Choice	Rationale
Lowercase	`F.lower()`	Case-insensitive matching; METAR uses all-caps
Punctuation	`regexp_replace([^a-z0-9\s], "")`	Removes slash, hyphen in METAR codes
Tokenize	`split([\s\W]+)`	Splits on any whitespace or punctuation boundary
Min length	`len > 1`	Drops single chars (e.g., “z” from “051200Z”)
Stop-words	36-word list (English + METAR filler)	`meters`, `knots`, `degrees`, numeric words inflate the index without adding search value

Stop-word list rationale: A generic English list was extended with METAR-specific filler: unit words (meters, knots, degrees), spelled-out digits (zero–nine), and modal verbs. This reduces token count by ~25–40% while preserving all semantically meaningful aviation terms.

4. Index Schema

root
 |-- token   : string  (not null)  ← index key
 |-- doc_ids : array<string>       ← postings list
 |-- freq    : long                ← document frequency (df)

collect_list (not collect_set) is used to retain all occurrence positions, enabling future TF (term-frequency) computation per document. The freq column equals len(doc_ids) for this corpus because each (doc_id, token) pair appears exactly once after normalization.

Performance of Build index:

The index construction was performed on a local Spark session with 4GB of RAM allocated to the driver and executors to ensure stability.

Parquet Build Duration: 41 seconds
CSV Build Duration: 42 seconds
Unique terms indexed: 70,293
Memory Management: A significant spill of 1568.4 MiB was recorded during the ObjectHashAggregate phase. This metric highlights why the initial memory increase to 4GB was critical to prevent “Java Heap Space” failures during the shuffle-heavy aggregation.

5. Query Latency & Plan Analysis

Query performance was optimized using the .cache() method on the resulting inverted index. Analysis of Stage 46 in the Spark UI shows:

Median Latency: 3.0 ms per task.
Cache Efficiency: The presence of the InMemoryTableScan node in the execution DAG confirms that subsequent searches bypass disk I/O entirely, serving results directly from the 2.1 GiB of allocated storage memory.

The formatted plan (proof/plan_query.txt) shows:

PushedFilters: [IsNotNull(token), EqualTo(token, aircraft)] — Parquet row-group statistics allow skipping irrelevant row groups.
Columnar Efficiency: Unlike the CSV format which requires a full row scan, the Parquet scan only fetches the specific columns required for the query (token, doc_ids), drastically reducing I/O overhead.

6. Storage Footprint Comparison

Format	Size (Bytes)	Files	Notes
Parquet	123,192,287 bytes	6 files	Columnar + Snappy compression
CSV	359,635,037 bytes	6 files	Plain text, pipe-joined strings

Key Findings:

Storage Ratio: 34.25%
Performance: Parquet is 65.7% smaller than CSV. This massive gain is due to Parquet’s ability to compress repeated tokens and efficiently encode the long strings of document IDs compared to the verbatim text storage of CSV.

7. Optimization (shuffle.partitions)

We explicitly set spark.sql.shuffle.partitions = 8 (vs default 200) to align with the local machine’s CPU cores. This prevents the overhead of Spark’s default 200 partitions, which would have created hundreds of tiny, inefficient tasks for our 70k unique terms, thereby reducing total execution time. This was first observed in Lab 1 and applies equally here.

Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 2, April 2026

Quartz 4

Explorer

engineering_note_lab2