For Assignment 1, I used generative AI (ChatGPT) to help me:
-
Understand the difference between RDD and DataFrame approaches – I asked for explanations of how Spark reads data line by line with RDDs versus column-wise with DataFrames, and why this leads to different word counts.
-
Interpret the Spark code – I got step-by-step explanations for code snippets like the RDD word count pipeline (flatMap, map, reduceByKey, sortBy) and the DataFrame equivalent using split, explode, and groupBy.
-
Explain performance metrics – I asked for a breakdown of the wall time, RSS, and peak memory metrics displayed by the %%timemem cell magic.