GitHub

This is a homework project to design a ETL pipeline on AWS. Task specifications:

The dataset in question is available here - https://eforexcel.com/wp/wp-content/uploads/2020/09/2m-Sales-Records.zip
The solution should be a collection of Lambda functions working together to fetch the file from the https endpoint, un-compress it, convert it to parquet, partition the data on S3 by the content of the Country field and define all relevant resources to allow then the querying of the final dataset via Athena. Original file should be archived under another prefix on the same bucket. The solution should be deployable via a CloudFormation template that contains all the relevant AWS resources.

Proposed solution

download_unzip_lambda: This is a directory which includes lambda functions. Directory includes:
- A common Dockerfile specification for lambda functions.
- AWS lambda functions called download.py and unzip.py with different entrypoints.
glue: This is a directory which includes a Python script called glue_convert_partition.py to specify AWS Glue ETL job on PySpark.
cloudformation: This is a directory which includes scripts for deploying Cloudformation templates.
- cloudformation.yaml: Initial Cloudformation template.
- cloudformation_s3_ecr.yaml: Cloudformation template which includes codebase components (artifact S3, ECR) which will be used to create a change set on the stack.
- deploy_cloudformation.sh: A end-to-end bash script to deploy cloudformation stack.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cloudformation		cloudformation
download_unzip_lambda		download_unzip_lambda
glue		glue
img		img
README.md		README.md