Building an End-to-End Data Pipeline for NYC Citi Bike

Why Citi Bike Data?

Citi Bike, New York City’s bike-sharing system, generates millions of trip records each month, capturing detailed data on rides, stations, and user behavior. This data is a goldmine for understanding urban mobility patterns, but without an automated pipeline, processing these large datasets and extracting actionable insights is a daunting task. My project aimed to solve this by building a scalable batch pipeline to ingest, store, transform, and visualize Citi Bike’s 2024 trip data.

Key Questions to Answer

How does Citi Bike usage vary over time, and what seasonal patterns emerge in 2024?
What’s the breakdown of rides between electric and classic bikes, and how does this impact operations?
How do ride patterns differ between member and casual users?
Which stations are the most popular starting points, and what does this suggest about urban mobility in NYC?
What’s the average trip duration, and how can this inform bike maintenance or rebalancing strategies?

Project Overview

The NYC Bike Rides Pipeline is a batch data pipeline that processes Citi Bike 2024 trip data, stores it in a data lake, transforms it in a data warehouse, and visualizes key metrics through an interactive dashboard. Built entirely on Google Cloud Platform (GCP), the pipeline leverages modern data engineering tools to automate the process and deliver insights.

Architecture

The pipeline follows a batch processing workflow, orchestrated end-to-end using Apache Airflow. Here’s a breakdown of each stage:

Data Ingestion: I sourced the Citi Bike 2024 trip data (CSV files) from a public S3 bucket. The data is downloaded and uploaded to a Google Cloud Storage (GCS) bucket named naza_nyc_bike_rides, which serves as the data lake.
Orchestration: Apache Airflow, running locally in a Docker container, orchestrates the pipeline. A custom DAG (etl.py) manages the monthly ingestion, loading, and transformation steps, ensuring the pipeline runs smoothly for each month of 2024.
Data Warehouse: Raw data is loaded from GCS into Google BigQuery’s staging dataset (nyc_bikes_staging). To optimize query performance, I partitioned the tables by the started_at timestamp and clustered them by start_station_id. After transformation, the data is loaded into the production dataset (nyc_bikes_prod).
Transformations: Using dbt Cloud, I transformed the raw data into a production-ready dataset. This involved cleaning (e.g., removing records with null values in critical fields like started_at and rideable_type), aggregating (e.g., rides by month), and modeling the data for analysis.
Visualization: The final dataset in nyc_bikes_prod is connected to Looker Studio, where I built an interactive dashboard to visualize key metrics like total rides, rideable type breakdown, and top start stations.
Infrastructure as Code (IaC): I used Terraform to provision the GCS bucket (naza_nyc_bike_rides) and BigQuery datasets (nyc_bikes_staging and nyc_bikes_prod), ensuring reproducibility and scalability.

Implementation Details

Setting Up the Infrastructure with Terraform: I started by defining the infrastructure in Terraform, which allowed me to provision the GCS bucket and BigQuery datasets as code. The terraform/ directory contains main.tf and variables.tf, where I specified the GCP project ID, region, and resource names. Running terraform init and terraform apply set up the infrastructure in minutes, ensuring consistency across environments.
Orchestrating with Apache Airflow: Airflow was the backbone of the pipeline, automating the ingestion, loading, and transformation steps. I containerized Airflow using Docker (docker-compose.yml) and configured it with environment variables in a .env file (e.g., GCP project ID, bucket name). The DAG (etl.py) in the dags/ folder handles the monthly batch processing, downloading CSV files, uploading them to GCS, and loading them into BigQuery.

Challenges and Solutions

Challenge 1: Airflow Setup with Cloud Composer: Initially, I planned to use Google Cloud Composer for Airflow, but I ran into setup issues, including dependency conflicts and longer-than-expected provisioning times. I pivoted to running Airflow locally with Docker, which gave me more control and faster iteration cycles. This taught me the importance of flexibility when working with cloud-managed services.

Challenge 2: Handling Large Datasets in BigQuery: The Citi Bike dataset, with over 1.1 million records, required careful optimization in BigQuery. I fine-tuned performance by partitioning tables by the started_at timestamp and clustering by start_station_id, which significantly reduced query costs and improved performance for downstream analyses.

Challenge 3: Data Quality Issues: Some records had null values in critical fields like started_at, rideable_type, and start_station_name. I addressed this in dbt by filtering out these records during transformation, ensuring the dashboard reflected accurate insights.

What I Learned

While I’ve worked with tools like BigQuery, Airflow, and Terraform before, this project deepened my understanding of how to apply them in a real-world context:

Tool Integration: I gained a deeper appreciation for integrating a full suite of tools (GCP, Airflow, dbt, Terraform) into a cohesive, scalable pipeline. Each tool has its strengths, and orchestrating them effectively is key to a successful project.
Performance Optimization: Fine-tuning BigQuery for large datasets with partitioning and clustering was a great exercise in balancing cost and performance, especially for a dataset of this scale.
Best Practices in Orchestration and IaC: I refined my approach to orchestrating complex workflows with Airflow and provisioning infrastructure with Terraform, focusing on modularity and reproducibility.
Data Storytelling: The project reinforced the power of data storytelling—turning raw trip data into actionable insights about urban mobility highlighted the importance of a solid pipeline as the foundation for impactful visualization.

Conclusion

Building the NYC Bike Rides Pipeline for DE Zoomcamp 2025 was an incredible learning experience. It allowed me to apply my data engineering skills to a real-world problem, from ingestion to visualization, while uncovering meaningful insights about urban mobility in NYC. The project also highlighted the importance of automation, optimization, and storytelling in data engineering.

FAQs

Q: What is the purpose of the project?
A: The project aims to build a scalable batch pipeline to ingest, store, transform, and visualize Citi Bike’s 2024 trip data, uncovering trends in urban mobility and rider behavior.

Q: What are the key questions the project aims to answer?
A: The project aims to answer questions such as how Citi Bike usage varies over time, what’s the breakdown of rides between electric and classic bikes, and how do ride patterns differ between member and casual users.

Q: What are the main tools used in the project?
A: The project uses Google Cloud Platform (GCP), Apache Airflow, dbt Cloud, and Terraform.

Q: What are the main challenges faced during the project?
A: The project faced challenges such as Airflow setup issues, handling large datasets in BigQuery, and data quality issues.

Q: What did you learn from the project?
A: The project deepened my understanding of integrating a full suite of tools, fine-tuning BigQuery for large datasets, and best practices in orchestration and IaC.

Post Views: 14

Building an End-to-End Data Pipeline for NYC Citi Bike

Why Citi Bike Data?

Key Questions to Answer

Project Overview

Architecture

Implementation Details

Challenges and Solutions

What I Learned

Conclusion

FAQs

Building Belief: Organizational Change Management Strategies

Generate single title from this title How Kai Cenat saved my high school English class in 100 -150 characters. And it must return only...

SmartThings Blog

Generate single title from this title AI is rewriting the rules of the insurance industry in 100 -150 characters. And it must return only...

Simulation-based pipeline tailors training data for dexterous robots | MIT News

Building Belief: Organizational Change Management Strategies

Generate single title from this title How Kai Cenat saved my high school English class in 100 -150 characters. And it must return only...

SmartThings Blog

Generate single title from this title AI is rewriting the rules of the insurance industry in 100 -150 characters. And it must return only...

Simulation-based pipeline tailors training data for dexterous robots | MIT News

Generate single title from this title OpenAI And Perplexity Set To Battle Google For Browser Dominance in 100 -150 characters. And it must return...

Generate single title from this title Common Sense Media releases AI toolkit for school districts in 100 -150 characters. And it must return only...

Supporting mission-driven space innovation, for Earth and beyond | MIT News

LEAVE A REPLY Cancel reply

Latest

Building Belief: Organizational Change Management Strategies

Generate single title from this title How Kai Cenat saved my high school English class in 100 -150 characters. And it must return only...

SmartThings Blog

Categories

Useful Links

Our Newsletter