GitHub - zachliu/Insight-TrafficJam: Insight Data Engineering

#TrafficJam

#Table of Contents

#Introduction This is a data engineering project at Insight Data Science. There are two goals that this project aims to accomplish:

Provide an API for drivers, city planners, and data scientists, for analyzing long term trends in traffic pattern w.r.t metrics such as average car volume, speed etc.
Enable a framework for real-time monitoring of traffic information, so that a user can know the best route to take and select a specific road to view the historical data.

#Data Set Historical: The project is based on historical traffic volume data for nearly 60,000 major roads in New York State, collected over 10 years. The data is available as a time series. The following table provides a snap shot of the raw data set:

roadID, timestamp, car count

Real-Time: The historical dataset is played back to simulate real-time behavior.

AWS Clusters: A distributed AWS cluster of four ec2 machines is being used for this project. All the components (ingestion, batch and real-time processing) are configured and run in distributed mode, with one master and three workers.

#Data Processing Framework

Ingestion Layer (Kafka): The raw data is consumed by a message broker, configured in publish-subscribe mode. Related files: producer.py, consumer.py.
Batch Layer (HDFS, Spark): A kafka consumer stores the data into HDFS. Additional columns are added to the dataset to generate metrics as described in the previous section. Following this, tables representing the aggregate views for serving queries at the user end are generated using Spark. Related Files: myBatch.py
Speed Layer (Storm): The topology for processing real-time data comprises of a kafka-spout and a bolt (with tick interval frequency of 2.5 sec). The data is filtered to only store clean (uncorrupted) entries. Related files: stormBolt.py, topology.yaml
Front-end (Flask): The car volume information for each road are rendered on Google Maps in terms of four colors and updated at 1 sec interval via Flask. Historical data is represented as line plot via Highcharts. Realted files: views.py, map.js.
Libraries and APIs: Cassandra, Pyleus, Kafka-python, Google Maps

#Data Transformations Following metrics are computed via a MapReduce operation on the raw dataset (Spark):

Total car count in a month

#Live Demo: A Live Demo of the project is available here: trafficjam.today or trafficjam.online. A snap shot of the map with highlighted roads:

#Presentation The presentation slides are available here: trafficjam.today/slideshare

#Instructions to Run this Pipeline

Install python packages: sudo pip install kafka-python cassandra-driver pyleus

Run the Kafka producer: python kafka/producer.py

Run the Kafka consumer: python kafka/kafka_consumer.py

Run Spark: spark-submit --packages TargetHolding/pyspark-cassandra:0.1.5 ~/Insight-TrafficJam/spark/myBatch.py

Build storm topology: pyleus build topology.yaml

Submit pyleus topology: pyleus submit -n 54.174.177.48 topology.jar

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
cassandra		cassandra
data		data
flask		flask
images		images
kafka		kafka
spark		spark
storm/topology		storm/topology
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cassandra

cassandra

data

data

flask

flask

images

images

kafka

kafka

spark

spark

storm/topology

storm/topology

.gitignore

.gitignore

README.md

README.md

Repository files navigation

About

Releases

Packages

Languages

zachliu/Insight-TrafficJam

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages