Today I completed Module 4 of the Data Engineering Zoomcamp, focusing on Analytics Engineering with dbt (data build tool). This module took me on a comprehensive journey through transforming raw taxi data into sophisticated analytics-ready models that answer complex business questions. The process was both challenging and rewarding, pushing me to apply theoretical knowledge to real-world data problems.
The foundation of any good data project starts with proper data engineering. I began by preparing three massive NYC Taxi & Limousine Commission (TLC) datasets in Google Cloud Platform:
Using Python scripts, I:
This process taught me about handling large datasets efficiently, working with compressed files, and the challenges of moving data between systems without overwhelming local resources.
With the raw data available in BigQuery, I designed a dbt project following modern analytics engineering principles:
stg_green_tripdata
, stg_yellow_tripdata
, stg_fhv_tripdata
) that:
dim_zones
: Normalized representation of taxi zonesdim_fhv_trips
: Clean representation of for-hire vehicle tripsfact_trips
and specialized analytical models:
fct_taxi_trips_quarterly_revenue
: For revenue trend analysisfct_taxi_trips_monthly_fare_p95
: For fare distribution analysisfct_fhv_monthly_zone_traveltime_p90
: For travel time analysis