How to develop, and deploy properly AWS Glue Job using AWS Glue interactive sessions and AWS CDK

PS: A SAM version is available on sam branch

This repo aim to demonstrate how to develop AWS Glue Job efficiently:

Be able to develop locally
Get a fast feedback loop
Be able to commit with no manual copy paste between tools

In addition this repo shows how to deploy this AWS Glue Job through a proper CI/CD pipeline leveraging Infrastructure as code.

Two options are proposed here: "Use this repo" or "Do it your self"

Use This repo

Prerequisites

Clone this repo

git clone https://github.com/flochaz/aws-glue-job-e2e-dev-life-cycle.git
cd aws-glue-job-e2e-dev-life-cycle

setup virtual env

python3 -m venv .venv
source .venv/bin/activate

Install CDK
```
npm install -g aws-cdk
```

deploy dev env

In order to run glue job locally we will need some specific elements such as

an iam role to assume while running local notebook
a glue database to store the data
a glue crawler to extract the schema and data from raw source csv files
Trigger the crawler ...

This CDK app will deploy all those for you to be ready to work on the glue job itself

Install deps
```
pip install -r requirements.txt
```
Bootstrap account
```
cdk bootstrap
```
Deploy Glue role, crawler etc.

cdk deploy infrastructure

Local dev experience

AWS Glue service offer a way to run your job remotely while developping locally through the Interactive Sessions feature.

Set up interactive session:

pip install -r requirements-dev.txt
SITE_PACKAGES=$(pip show aws-glue-sessions | grep Location | awk '{print $2}')
jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_pyspark # Add "--user" if getting "[Errno 13] Permission denied: '/usr/local/share/jupyter'"
jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_spark # Add "--user" if getting "[Errno 13] Permission denied: '/usr/local/share/jupyter'"

Setup glue role by copying the output called awsConfigUPDATE of the previous cdk deploy command into ~/.aws/config under [default]
```
cat ~/.aws/config
[default]
glue_role_arn=xxxxxx
```

Launch notebook

jupyter notebook # add "--ip 0.0.0.0" if running in a remote IDE such as cloud9 (PS: you will need to open your security group for TCP connection on 8888 port as well !)

Play with glue_job_source/data_cleaning_and_lambda.ipynb
Commit your changes to git
Optionally deploy your changes to dev env
```
cdk deploy infrastructure
```

Deploy through pipeline

If deploying to same account / region, first you will need to destroy your dev stack to avoid resource collision (especially glue role, crawler, database etc.)

cdk destroy infrastructure

Create a repo by deploying the pipeline stack
```
cdk deploy GlueJobPipelineStack
```

Push code to repo

# Remove github origin
git remote remove origin
# Add code commit repo as origin
git remote add origin <YOUR CODE COMMIT REPO URL (THE COMMAND SHOULD BE FOUND IN THE PREVIOUS "cdk deploy GlueJobPipelineStack" output)>
git push -u master

Observe the deployment through code pipeline

Do it your self

Get into your aws account
Setup your online IDE: Cloud 9
Add your glue job (you can take this one for instance https://github.com/aws-samples/aws-glue-samples/blob/master/examples/data_cleaning_and_lambda.py)
Add interactive sessions + notebook CI/CD (optional)
https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Quick hack
1. vim ~/.aws/config glue_role_arn
2. vim ~/.aws/credentials
3. jupyter notebook —ip 0.0.0.0
4. jupyter nbconvert --to script ./data_cleaning_and_lambda.ipynb
Create your first CDK app
Add glue infrastructure: https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_glue_alpha/README.html
Glue database
Glue Role
Glue Crawler
Glue Job
Add CI/CD using the official doc or workshop

TODO

Inject config (such as output_bucket, stage, database name etc ...)
Add dev life cycle diagram and screenshots
Add example for external file inclusion in notebook with aws s3Sync and %extra_py_files etc.
Add integration tests to pipeline
Describe how to add stage with manual approval
Fix CDK unit tests

Feel free to contribute !!!

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
glue_job_source		glue_job_source
infrastructure		infrastructure
tests		tests
.gitignore		.gitignore
Medicare_Hospital_Provider.csv		Medicare_Hospital_Provider.csv
README.md		README.md
app.py		app.py
cdk.json		cdk.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
source.bat		source.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glue_job_source

glue_job_source

infrastructure

infrastructure

tests

tests

.gitignore

.gitignore

Medicare_Hospital_Provider.csv

Medicare_Hospital_Provider.csv

README.md

README.md

app.py

app.py

cdk.json

cdk.json

requirements-dev.txt

requirements-dev.txt

requirements.txt

requirements.txt

source.bat

source.bat

Repository files navigation

How to develop, and deploy properly AWS Glue Job using AWS Glue interactive sessions and AWS CDK

Use This repo

Prerequisites

deploy dev env

Local dev experience

Deploy through pipeline

Do it your self

TODO

About

Releases

Packages

Contributors 19

Languages

flochaz/aws-glue-job-e2e-dev-life-cycle

Folders and files

Latest commit

History

Repository files navigation

How to develop, and deploy properly AWS Glue Job using AWS Glue interactive sessions and AWS CDK

Use This repo

Prerequisites

deploy dev env

Local dev experience

Deploy through pipeline

Do it your self

TODO

About

Resources

Stars

Watchers

Forks

Languages