ETL

Introduction

A fake startup called Sparkify wants to analyze the data they've been collecting on their new music streaming app about:

songs;
user activity.

The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app (folder log_data), as well as a directory with JSON metadata on the songs in their app (folder song_data).

They'd like a data engineer to create:

a PostgreSQL database with tables designed to optimize queries on song play analysis
- using fact and dimension tables for a star schema
an ETL pipeline that transfers data from files in two local directories into these tables in PostgreSQL.

Project datasets

There are two datasets: song dataset and log dataset.

Song dataset

Subset of real data from the Million Song Dataset

Each file is in JSON format and contains metadata about a song and the artist of that song.

Example content of file song_data/A/A/A/TRAAAAW128F429D538.json:

{
  "num_songs": 1,
  "artist_id": "ARD7TVE1187B99BFB1",
  "artist_latitude": null,
  "artist_longitude": null,
  "artist_location": "California - LA",
  "artist_name": "Casual",
  "song_id": "SOMZWCG12A8C13C480",
  "title": "I Didn't Mean To",
  "duration": 218.93179,
  "year": 0
}

The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
- song_data/A/B/C/TRABCEI128F424C983.json
- song_data/A/A/B/TRAABJL12903CDCF1A.json

Log dataset

Log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

Example content of the first two lines of file log_data/2018/11/2018-11-01-events.json:

{"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
{"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}

The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:
- log_data/2018/11/2018-11-12-events.json
- log_data/2018/11/2018-11-13-events.json

Schema for song play analysis

Fact table

songplays: records in log data associated with song plays, i.e., records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension tables

users: users in the app
- user_id, first_name, last_name, gender, level
songs: songs in music database
- song_id, title, artist_id, year, duration
artists: artists in music database
- artist_id, name, location, latitude, longitude
time: timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday

Project structure

├── README.md: this file
├── constants.py: database constants
├── create_tables.py: create clean database with all tables. Run this file to reset your tables
|                     before running the ETL scripts.
├── data
│   ├── log_data: JSON log files
│   └── song_data: JSON song files
├── etl.ipynb: read and process a single file from `song_data` and `log_data`; and load the data
|              into the tables. This notebook contains detailed instructions on the ETL process for
|              each of the tables. It helps to create file `etl.py`
├── etl.py: read and process JSON files and insert data into dimension and fact tables
├── requirements.txt: project requirements (Python libraries)
├── requirements_dev.txt: additional requirements used for development
├── sql_queries.py: SQL commands (DROP TABLE, CREATE TABLE, INSERT INTO, SELECT); imported in
|                   `create_tables.py`, `etl.py` and `etl.ipynb`
└── test.ipynb: display the first few rows of each table to check if the database is correct

Python environment

Setup

Create a conda environment called etl-env and install the requirements in it.

conda create -yn etl-env python=3.7 --file requirements.txt

Development

Use the new conda environment in jupyter notebook.

conda install -n base nb_conda_kernels
conda activate etl-env
python -m pip install -r requirements_dev.txt
conda deactivate
jupyter notebook

Now you can run the jupyter notebook files etl.ipynb and test.ipynb.

Database

Setup PostgreSQL database (macOS)

Install PostgreSQL:

Install Homebrew (macOS): http://brew.sh/
Install PostgreSQL (macOS): brew install postgresql

Start service and create default user and database:

Start PostgreSQL with homebrew-services: brew services run postgresql
- Alternative: pg_ctl -D /usr/local/var/postgres start
- brew service start <service> starts the <service> at login, while brew services run runs the <service> but doesn't start it at login (nor boot).

Connect to default PostgreSQL database and create user:

$ psql --dbname postgres
postgres=# \du
postgres=# CREATE ROLE student WITH LOGIN CREATEDB PASSWORD 'student';
postgres=# \du
postgres=# \quit

Connect to default PostgreSQL database as the new user and create database:

$ psql --dbname postgres --username student
postgres=> \list
postgres=> CREATE DATABASE studentdb;
postgres=> \list
postgres=> GRANT ALL PRIVILEGES ON DATABASE studentdb TO student;
postgres=> \list
postgres=> \connect studentdb
postgres=> \quit

Run the ETL

python create_tables.py && python etl.py

To debug: psql --dbname sparkifydb --username student

If you receive the error DETAIL: There is 1 other session using the database., close all other connections, except the current one. Probably it is a jupyter notebook running, so restart it, or run this query:

SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = current_database() AND pid <> pg_backend_pid();

Cleanup

Remove the Python environment: conda env remove -n etl-env

Remove all tables:

$ psql --dbname sparkifydb
sparkifydb=# \dn+
sparkifydb=# DROP SCHEMA public CASCADE;
sparkifydb=# \dn+
sparkifydb=# CREATE SCHEMA public;
sparkifydb=# GRANT ALL ON SCHEMA public TO public;
sparkifydb=# COMMENT ON SCHEMA public IS 'standard public schema';
sparkifydb=# \dn+
sparkifydb=# \quit

Or remove the database and the user:

$ psql --dbname postgres
postgres=# DROP DATABASE IF EXISTS sparkifydb;
postgres=# DROP USER student;
postgres=# \quit

Stop the PostgreSQL service: brew services stop postgresql
- Alternative: pg_ctl -D /usr/local/var/postgres stop
- Check by listing all services managed by brew services (postgresql should be stopped): brew services list

Possible questions to ask

Examples of queries:

Top 10 most common years when song were published:

SELECT year, COUNT(*)
FROM songs
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Distribution of song duration:

SELECT
  percentile,
  PERCENTILE_DISC(percentile) WITHIN GROUP (ORDER BY duration) AS duration
FROM songs, GENERATE_SERIES(0, 1, 0.25) AS percentile
GROUP BY 1;

Top 10 most common first names and last names of users:

SELECT first_name, COUNT(*)
FROM users
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

SELECT last_name, COUNT(*)
FROM users
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Top 10 most common artist locations:

SELECT location, COUNT(*)
FROM artists
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Percentage of users per gender and level:

SELECT
  gender,
  ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage
FROM users
GROUP BY 1
ORDER BY 1;

SELECT
  level,
  ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage
FROM users
GROUP BY 1
ORDER BY 1;

WITH per_gender_level AS (
SELECT gender, level, COUNT(*) AS count
FROM users
GROUP BY 1, 2),

per_gender AS (
SELECT gender, COUNT(*) AS count
FROM users
GROUP BY 1)

SELECT
  per_gender_level.gender,
  per_gender_level.level,
  ROUND(100 * (CAST(per_gender_level.count AS DECIMAL) / per_gender.count), 2) AS percentage
FROM per_gender_level
INNER JOIN per_gender ON per_gender.gender = per_gender_level.gender
ORDER BY 1, 2;

OLAP cube for users gender and level:

SELECT gender, level, COUNT(*)
FROM users
WHERE gender IS NOT NULL AND level IS NOT NULL
GROUP BY CUBE (1, 2)
ORDER BY 1, 2;

Description of number of songs per session:

SELECT MIN(n_songs), AVG(n_songs), MAX(n_songs)
FROM (SELECT session_id, COUNT(*) AS n_songs
      FROM songplays
      GROUP BY 1) AS songs_per_session;

Top 10 most active users:

SELECT user_id, COUNT(*)
FROM songplays
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Top 10 most played song and artist:

SELECT song_id, COUNT(*)
FROM songplays
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

SELECT artist_id, COUNT(*)
FROM songplays
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
db_values.py		db_values.py
etl.ipynb		etl.ipynb
etl.py		etl.py
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

gcbeltramini/etl-postgresql

Folders and files

Latest commit

History

Repository files navigation

ETL