Data Engineering with dbt: Mastering Analytics Engineering Through NYC Taxi Data

What I Learned in Module 4 of the Data Engineering Zoomcamp

Today I completed Module 4 of the Data Engineering Zoomcamp, focusing on Analytics Engineering with dbt (data build tool). This module took me on a comprehensive journey through transforming raw taxi data into sophisticated analytics-ready models that answer complex business questions. The process was both challenging and rewarding, pushing me to apply theoretical knowledge to real-world data problems.

Setting Up the Environment: From Raw Data to Cloud Storage

The foundation of any good data project starts with proper data engineering. I began by preparing three massive NYC Taxi & Limousine Commission (TLC) datasets in Google Cloud Platform:

Green Taxi dataset: 7,778,101 records spanning 2019-2020
Yellow Taxi dataset: 109,047,518 records spanning 2019-2020
For-Hire Vehicle dataset: 43,244,696 records from 2019

Using Python scripts, I:

Generated download URLs for all data files
Downloaded each compressed CSV file to temporary storage
Uploaded them to Google Cloud Storage buckets
Created external tables in BigQuery with appropriate schemas
Validated record counts to ensure data integrity before proceeding

This process taught me about handling large datasets efficiently, working with compressed files, and the challenges of moving data between systems without overwhelming local resources.

dbt Project Structure: Building a Logical Data Transformation Flow

With the raw data available in BigQuery, I designed a dbt project following modern analytics engineering principles:

Staging Layer: Created initial models (stg_green_tripdata, stg_yellow_tripdata, stg_fhv_tripdata) that:
- Cleaned and standardized column names
- Applied appropriate type casting
- Handled data quality issues like duplicate rows
- Generated surrogate keys for easier joining
Dimension Layer: Created dimension tables including:
- dim_zones: Normalized representation of taxi zones
- dim_fhv_trips: Clean representation of for-hire vehicle trips
Fact Layer: Built fact tables like fact_trips and specialized analytical models:
- fct_taxi_trips_quarterly_revenue: For revenue trend analysis
- fct_taxi_trips_monthly_fare_p95: For fare distribution analysis
- fct_fhv_monthly_zone_traveltime_p90: For travel time analysis