Skip to content

quancore/sales_record_etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a homework project to design a ETL pipeline on AWS. Task specifications:

  • The dataset in question is available here - https://eforexcel.com/wp/wp-content/uploads/2020/09/2m-Sales-Records.zip
  • The solution should be a collection of Lambda functions working together to fetch the file from the https endpoint, un-compress it, convert it to parquet, partition the data on S3 by the content of the Country field and define all relevant resources to allow then the querying of the final dataset via Athena. Original file should be archived under another prefix on the same bucket. The solution should be deployable via a CloudFormation template that contains all the relevant AWS resources.

Proposed solution

solution

  • download_unzip_lambda: This is a directory which includes lambda functions. Directory includes:
    • A common Dockerfile specification for lambda functions.
    • AWS lambda functions called download.py and unzip.py with different entrypoints.
  • glue: This is a directory which includes a Python script called glue_convert_partition.py to specify AWS Glue ETL job on PySpark.
  • cloudformation: This is a directory which includes scripts for deploying Cloudformation templates.
    • cloudformation.yaml: Initial Cloudformation template.
    • cloudformation_s3_ecr.yaml: Cloudformation template which includes codebase components (artifact S3, ECR) which will be used to create a change set on the stack.
    • deploy_cloudformation.sh: A end-to-end bash script to deploy cloudformation stack.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published