My Journey Through Module 5 of Data Engineering Zoomcamp

This week marked an exciting milestone in my data engineering journey as I completed Module 5 of the Data Engineering Zoomcamp, which focused on batch processing with Apache Spark.

🛠️ Setting Up the Environment

The module began with setting up Apache Spark and PySpark on my Linux machine:

Getting that first Spark session running and seeing the version number appear was a small but satisfying victory.

📊 Working with NYC Yellow Taxi Data

The homework assignment centered around analyzing the NYC Yellow Taxi dataset for October 2024:

Data Processing Steps:

  1. Read the parquet file into a Spark DataFrame
  2. Examined the schema and data structure
  3. Repartitioned the data into 4 partitions
  4. Measured the resulting partition sizes (about 23MB each)

Key Learning: Spark's lazy evaluation means commands aren't executed until an action is called, making the entire pipeline more efficient as Spark can optimize operations before actually processing data.

🔍 Finding Insights in the Data

The analysis questions pushed me to apply various Spark functions and techniques:

Question 3: Trip Count Analysis