bike-share-data-engineering

Introduction

This is a capstone project for the Udacity DataEngineering Nanodegree. The purpose of the data engineering capstone project is to give you a chance to combine everything learned throughout the program.

Datasets

Bay Wheel trip history data
- bike share trip data of Bay Wheel from January 2019 to March 2020, over 3 millilon rows
Bay Wheel station data
- 454 rows
Bay wheel region data
- 6 rows
Weather data
- over 1000 rows

Goal

The goal of this project is to create a data pipeline for cleaning, transforming and loading bike share data from Bay Wheel. The users would be able to analyze bike traffic given the knowledge of station information, time information, region information and weather information.

Technologies

We created 3 buckets on AWS S3 for data storage. S3 is known for its easy-to-use management features so anyone can organize data and configure finely-tuned access controls to meet specific business, organizational, and compliance requirements.

Apache Spark is used for our ETL pipeline. We spinned up a Spark cluster on EMR. Spark is a memory-based dributed computation architeture.

The last piece is Apache Airflow, which is used to connect all the components of our pipeline and orchestrate it. Airflow has an built-in UI and we are able to track the whole process and troubshoot with the logs of the pipeline.

Data Quality

trip_data

The trip dataset includes more columns than we need. Columns such as, start_station_name, start_station_longitude, and start_station_latitude would be removed as station dataset already has these information. Only start_station_id and end_station_id are kept. We can join trips table and stations table when we need detailed station information.

Also the trip dataset includes records with null station information, these records needs to be filtered as we need station information to compute traffic.

station_data

The station dataset includes unnecessary columns and will be removed.

weather_data

The weather dataset is acquired with API.

As the stations distribute across different cities in San Francisco Bay Area, we need to get the weather information of all these citites. We gengerated the weather dataset with the following steps:

Extract latitude and longitude information from the station dataset
Use the latitude and longitude information to get the nearest weather station with the API
Obtain the weather data of all the weather stations with the API, the result shows that we only need to get the weather data from 5 weather stations
- 74506
- KSJC0
- 72494
- 72493
- KRHV0

region_data

The region dataset is small and complete. No action is needed.

Data Model

We created 2 separate buckets on AWS S3 for storing raw data and optimized data.

shuaishao-raw-datalake
- This bucket has 4 folders storing trip data, station data, weathe data and region data respectively.
shuaishao-bikeshare-datalake
- This bucket is used to store optimized datasets, we are using star schema with 1 fact table and 5 dimension tables.

Schema

Fact table

trips
- duration(sec): int; duration of a trip;
- start_time; datetime; start time of a trip; Primary Key
- end_time; datetime; end time of a trip
- start_station_id; int; id of start station; Foreign Key
- end_station_id; int; id of end station; Foreign Key
- bike_id; int; id of bike
- user_type; string; type of user

Dimension tables

stations

station_id; int; Primary Key
- station_short_name; string; station short name
- station_external_id; string
- station_has_kiosk; bool
- station_rental_methods; list
- station_capacity; int
- station_region_id; int; Foreign Key
- station_type; string
- station_name; string
- station_longitude; float
- station_latitude; float
- weather_station_id; strin; Foreign Key

start_time
- start_time; timestamp; Primary Key
- start_year; int
- start_month; int
- start_dayofyear; int
- start_dayofmonth; int
- start_week;int
- start_dayofweek; int
- start_hour; int
end_time
- end_time; timestamp; Primary Key
- end_year; int
- end_month; int
- end_dayofyear; int
- end_dayofmonth; int
- end_week;int
- end_dayofweek; int
- end_hour; int
weather
- station; string; Primary Key
- data; date
- temperature; float
- temperature_min; float
- temperature_max; float
- precipitation; float
- winddirection; float
- windspeed; float
- pressure; float
regions
- region_id; int; Primary Key
- region_name; string

Data Pipeline

The pipelien includes two DAGs.

The first DAG is for creating buckets for storing raw data, python script, optimized data, and checking data quality.

The second DAG is for running ETL on raw data and generated optimized data, and cheking data quality of optimized data.

Getting started

Use cases

The optimized dataset can be queried by using Amazon Athena or Apache Spark.
- Be able to know cycling route by popularity to help maintenance. For example, if we know that the route from A station to B station is the most popular route, we would consider:
  - Increasing bike at A station (increasing capacity)
  - Deploying more stations around A station
  - Schedule a more frequent maintenance for station A
The optimized data can also be fetched and written into a database (e.g., MySQL, PostgreSQL, Neo4j)
- The dataset could be written into Neo4j for visualizing cycling route (e.g., stations are nodes and routes are edges)

Step 1: Clone repository to local machine

git clone https://github.com/shao-shuai/bike-share-data-engineering.git

Step2: Create python virtual environment

python3 -m venv venv             # create virtualenv
source venv/bin/activate         # activate virtualenv
pip install -r requirements.txt  # install requirements

Step 3: Download the dataset

dataset link

create a directory called dataset in this repo.

Step 4: Start Airflow Container

docker-compose up

Step 5: Start Ariflow UI

Visit the path http://localhost:8080 in your browser. Login to Airflow.

Username: user

Password: password

Step 6: Configure Airflow connections

Click connections under Admin tag

Select create under connections
On the create connection page, enter the following values:

Conn Id: Enter aws_credentials.
Conn Type: Enter Amazon Web Services.
Login: Enter your Access key ID from the IAM User credentials.
Password: Enter your Secret access key from the IAM User credentials.
Extra: Add the default region name. { "region_name": "us-west-2" }

Step 7: Change default EMR config in Airflow

Click on the Admin tab and select Connections.
Select the 'emr_default' connection
Copy everything from src/helper/emr_default.json and paste into the field 'Extra'
Click save

Step 8: Start raw_datalake_dag

This DAG creates all buckets and uploads raw data. The DAG takes about 10 minutes.

Step 9: Start bike_data_etl_dag

This pipeline extracts the data from raw data bucket, transforms it with Spark cluster on EMR and writes it back to S3. The pipeline takes about 1 hour.

Addressing Other Scenarios

The data was increased by 100x
- Currently we are using m4large for EC2 instances, if the data increases by 100x, I would conside using m5large or m5xlarge.
- Now we are using 3 workers, adding more workers would be another option.
- Currently Airflow is running on a single container on my laptop. In a production environment, deploying Airflow on a Kubernetes coordinated cluster would be a better solution.
The pipelines would be run on a daily basis by 7 am every day.
- We would be able to doing by scheduling Airflow
The database needed to be accessed by 100+ people.
- If the performance is down with more users, we could conside using an MPP based data warehouse like Redshift.

Special thanks

Thanks brfulu, I learned a lot from your repo including code readability and well organized documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.cfg		config.cfg
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

shao-shuai/bike-share-data-engineering

Folders and files

Latest commit

History

Repository files navigation