GitHub

Case Study: To Create Pyspark Application

Analytics:

Application should perform below analysis and store the results for each analysis.

Analysis 1: Find the number of crashes (accidents) in which number of persons killed are male?
Analysis 2: How many two-wheelers are booked for crashes?
Analysis 3: Which state has the highest number of accidents in which females are involved?
Analysis 4: Which are the Top 5th to 15th VEH_MAKE_IDs that contribute to a largest number of injuries including death
Analysis 5: For all the body styles involved in crashes, mention the top ethnic user group of each unique body style
Analysis 6: Among the crashed cars, what are the Top 5 Zip Codes with the highest number crashes with alcohols as the contributing factor to a crash (Use Driver Zip Code)
Analysis 7: Count of Distinct Crash IDs where No Damaged Property was observed and Damage Level (VEH_DMAG_SCL~) is above 4 and car avails Insurance
Analysis 8: Determine the Top 5 Vehicle Makes where drivers are charged with speeding related offences, has licensed Drivers, uses top 10 used vehicle colours and has car licensed with the Top 25 states with highest number of offences (to be deduced from the data)

Expected Output:

Develop an application which is modular & follows software engineering best practices (e.g. Classes, docstrings, functions, config driven, command line executable through spark-submit)
Code should be properly organized in folders as a project.
Input data sources and output should be config driven
Code should be strictly developed using Dataframe APIs (Do not use Spark SQL)
Upload the entire project in Github repo.

Process to follow:

Create virtual environment and install all packages and dependencies (if using VS Code).
Go to Project Directory: $ cd DataEngineer
In Bash terminal, run $ make. It will run all the commands in Makefile and build the project to run via spark-submit. In this process, a new folder with name "dist" will be created, and the code artefacts will be copied into it.
Using Dist, src.zip file as an argument with py files will be submitted to spark-submit to run against the cluster.
In CLI, run $ cd Dist && spark-submit --master "local[*]" --py-files src.zip --files config.yaml main.py && cd ..

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.env_test (virtual Environment)		.env_test (virtual Environment)
Code		Code
Data		Data
Dist		Dist
Output		Output
VS Code Screenshots		VS Code Screenshots
Analysis Notebook.ipynb		Analysis Notebook.ipynb
Case_Study_CarCrash.docx		Case_Study_CarCrash.docx
Data Dictionary.xlsx		Data Dictionary.xlsx
Data.zip		Data.zip
Makefile		Makefile
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.env_test (virtual Environment)

.env_test (virtual Environment)

Code

Code

Data

Data

Dist

Dist

Output

Output

VS Code Screenshots

VS Code Screenshots

Analysis Notebook.ipynb

Analysis Notebook.ipynb

Case_Study_CarCrash.docx

Case_Study_CarCrash.docx

Data Dictionary.xlsx

Data Dictionary.xlsx

Data.zip

Data.zip

Makefile

Makefile

Readme.md

Readme.md

Repository files navigation

Case Study: To Create Pyspark Application

Analytics:

Expected Output:

Process to follow:

NOTE: VS Code screenshots are attached.

About

Releases

Packages

Languages

adityag020/PySparkApplication

Folders and files

Latest commit

History

Repository files navigation

Case Study: To Create Pyspark Application

Analytics:

Expected Output:

Process to follow:

NOTE: VS Code screenshots are attached.

About

Resources

Stars

Watchers

Forks

Languages