This week marked an exciting milestone in my data engineering journey as I completed Module 5 of the Data Engineering Zoomcamp, which focused on batch processing with Apache Spark.
The module began with setting up Apache Spark and PySpark on my Linux machine:
Getting that first Spark session running and seeing the version number appear was a small but satisfying victory.
The homework assignment centered around analyzing the NYC Yellow Taxi dataset for October 2024:
Data Processing Steps:
Key Learning: Spark's lazy evaluation means commands aren't executed until an action is called, making the entire pipeline more efficient as Spark can optimize operations before actually processing data.
The analysis questions pushed me to apply various Spark functions and techniques:
Question 3: Trip Count Analysis