Engineering Note - DE2 Assignment 2: Inverted Index Pipeline

Course: Data Engineering II - ESIEE Paris 2025-2026
Track: D - Aviation (METAR weather reports)


1. Pipeline Architecture

data/corpus/aviation_metar.csv
        │  (doc_id, text) — one METAR report per row
        ▼
 spark.read.csv (schema-enforced)
        │
        │  regexp_replace(lower(text), "[^a-z0-9\s]", "")
        │  split("[\\s\\W]+") → array[token]
        │  explode → (doc_id, token) pairs
        │  filter: len > 1, token NOT IN stop_words
        ▼
 df_filtered  (doc_id, token)
        │
        │  groupBy("token")
        │  .agg(collect_list("doc_id") → doc_ids,
        │       count("*")            → freq)
        ▼
 inverted_index  (token | doc_ids: array<string> | freq: long)
        │
        ├──→ write.parquet("outputs/lab2/inverted_index/")
        └──→ write.csv    ("outputs/lab2/inverted_index_csv/")
                          (doc_ids serialized as "|"-joined string)

2. Corpus Design (Track D)

The METAR corpus is derived from the OpenSky States dataset used in Lab 1. Each aircraft state record is transformed into a free-text sentence describing status, altitude, velocity, heading, and squawk code. This produces a natural-language corpus where aviation terminology (landing, airborne, squawk, altitude, heading) has meaningful frequency distributions.

One document = one aircraft position report = one row with a unique doc_id (icao24_timestamp).

  • Total input rows: 1,526,689
  • Initial tokens generated: 32,203,784
  • Post-filter tokens: 19,818,447

3. Normalization Choices

StepChoiceRationale
LowercaseF.lower()Case-insensitive matching; METAR uses all-caps
Punctuationregexp_replace([^a-z0-9\s], "")Removes slash, hyphen in METAR codes
Tokenizesplit([\s\W]+)Splits on any whitespace or punctuation boundary
Min lengthlen > 1Drops single chars (e.g., “z” from “051200Z”)
Stop-words36-word list (English + METAR filler)meters, knots, degrees, numeric words inflate the index without adding search value

Stop-word list rationale: A generic English list was extended with METAR-specific filler: unit words (meters, knots, degrees), spelled-out digits (zeronine), and modal verbs. This reduces token count by ~25–40% while preserving all semantically meaningful aviation terms.


4. Index Schema

root
 |-- token   : string  (not null)  ← index key
 |-- doc_ids : array<string>       ← postings list
 |-- freq    : long                ← document frequency (df)

collect_list (not collect_set) is used to retain all occurrence positions, enabling future TF (term-frequency) computation per document. The freq column equals len(doc_ids) for this corpus because each (doc_id, token) pair appears exactly once after normalization.

Performance of Build index:

The index construction was performed on a local Spark session with 4GB of RAM allocated to the driver and executors to ensure stability.

  • Parquet Build Duration: 41 seconds
  • CSV Build Duration: 42 seconds
  • Unique terms indexed: 70,293
  • Memory Management: A significant spill of 1568.4 MiB was recorded during the ObjectHashAggregate phase. This metric highlights why the initial memory increase to 4GB was critical to prevent “Java Heap Space” failures during the shuffle-heavy aggregation.

5. Query Latency & Plan Analysis

Query performance was optimized using the .cache() method on the resulting inverted index. Analysis of Stage 46 in the Spark UI shows:

  • Median Latency: 3.0 ms per task.
  • Cache Efficiency: The presence of the InMemoryTableScan node in the execution DAG confirms that subsequent searches bypass disk I/O entirely, serving results directly from the 2.1 GiB of allocated storage memory.

The formatted plan (proof/plan_query.txt) shows:

  • PushedFilters: [IsNotNull(token), EqualTo(token, aircraft)] — Parquet row-group statistics allow skipping irrelevant row groups.
  • Columnar Efficiency: Unlike the CSV format which requires a full row scan, the Parquet scan only fetches the specific columns required for the query (token, doc_ids), drastically reducing I/O overhead.

6. Storage Footprint Comparison

FormatSize (Bytes)FilesNotes
Parquet123,192,287 bytes6 filesColumnar + Snappy compression
CSV359,635,037 bytes6 filesPlain text, pipe-joined strings

Key Findings:

  • Storage Ratio: 34.25%
  • Performance: Parquet is 65.7% smaller than CSV. This massive gain is due to Parquet’s ability to compress repeated tokens and efficiently encode the long strings of document IDs compared to the verbatim text storage of CSV.

7. Optimization (shuffle.partitions)

We explicitly set spark.sql.shuffle.partitions = 8 (vs default 200) to align with the local machine’s CPU cores. This prevents the overhead of Spark’s default 200 partitions, which would have created hundreds of tiny, inefficient tasks for our 70k unique terms, thereby reducing total execution time. This was first observed in Lab 1 and applies equally here.


Authored by Justine Guirauden & Volcy Desmazures - ESIEE Paris DE2 Assignment 2, April 2026