What I Learned in Module 4 of the Data Engineering Zoomcamp

Today I completed Module 4 of the Data Engineering Zoomcamp, focusing on Analytics Engineering with dbt (data build tool). This module took me on a comprehensive journey through transforming raw taxi data into sophisticated analytics-ready models that answer complex business questions. The process was both challenging and rewarding, pushing me to apply theoretical knowledge to real-world data problems.

Setting Up the Environment: From Raw Data to Cloud Storage

The foundation of any good data project starts with proper data engineering. I began by preparing three massive NYC Taxi & Limousine Commission (TLC) datasets in Google Cloud Platform:

  1. Green Taxi dataset: 7,778,101 records spanning 2019-2020
  2. Yellow Taxi dataset: 109,047,518 records spanning 2019-2020
  3. For-Hire Vehicle dataset: 43,244,696 records from 2019

Using Python scripts, I:

This process taught me about handling large datasets efficiently, working with compressed files, and the challenges of moving data between systems without overwhelming local resources.

dbt Project Structure: Building a Logical Data Transformation Flow

With the raw data available in BigQuery, I designed a dbt project following modern analytics engineering principles:

  1. Staging Layer: Created initial models (stg_green_tripdata, stg_yellow_tripdata, stg_fhv_tripdata) that:
  2. Dimension Layer: Created dimension tables including:
  3. Fact Layer: Built fact tables like fact_trips and specialized analytical models: