Skip to content

gcbeltramini/etl-postgresql

Repository files navigation

ETL

Introduction

A fake startup called Sparkify wants to analyze the data they've been collecting on their new music streaming app about:

  1. songs;
  2. user activity.

The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app (folder log_data), as well as a directory with JSON metadata on the songs in their app (folder song_data).

They'd like a data engineer to create:

  • a PostgreSQL database with tables designed to optimize queries on song play analysis
    • using fact and dimension tables for a star schema
  • an ETL pipeline that transfers data from files in two local directories into these tables in PostgreSQL.

Project datasets

There are two datasets: song dataset and log dataset.

Song dataset

  • Subset of real data from the Million Song Dataset
  • Each file is in JSON format and contains metadata about a song and the artist of that song.
    • Example content of file song_data/A/A/A/TRAAAAW128F429D538.json:
      {
        "num_songs": 1,
        "artist_id": "ARD7TVE1187B99BFB1",
        "artist_latitude": null,
        "artist_longitude": null,
        "artist_location": "California - LA",
        "artist_name": "Casual",
        "song_id": "SOMZWCG12A8C13C480",
        "title": "I Didn't Mean To",
        "duration": 218.93179,
        "year": 0
      }
  • The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
    • song_data/A/B/C/TRABCEI128F424C983.json
    • song_data/A/A/B/TRAABJL12903CDCF1A.json

Log dataset

  • Log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
    • Example content of the first two lines of file log_data/2018/11/2018-11-01-events.json:
    {"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
    {"artist":null,"auth":"Logged In","firstName":"Kaylee","gender":"F","itemInSession":0,"lastName":"Summers","length":null,"level":"free","location":"Phoenix-Mesa-Scottsdale, AZ","method":"GET","page":"Home","registration":1540344794796.0,"sessionId":139,"song":null,"status":200,"ts":1541106106796,"userAgent":"\"Mozilla\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/35.0.1916.153 Safari\/537.36\"","userId":"8"}
  • The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:
    • log_data/2018/11/2018-11-12-events.json
    • log_data/2018/11/2018-11-13-events.json

Schema for song play analysis

Fact table

  1. songplays: records in log data associated with song plays, i.e., records with page NextSong
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension tables

  1. users: users in the app
    • user_id, first_name, last_name, gender, level
  2. songs: songs in music database
    • song_id, title, artist_id, year, duration
  3. artists: artists in music database
    • artist_id, name, location, latitude, longitude
  4. time: timestamps of records in songplays broken down into specific units
    • start_time, hour, day, week, month, year, weekday

Project structure

├── README.md: this file
├── constants.py: database constants
├── create_tables.py: create clean database with all tables. Run this file to reset your tables
|                     before running the ETL scripts.
├── data
│   ├── log_data: JSON log files
│   └── song_data: JSON song files
├── etl.ipynb: read and process a single file from `song_data` and `log_data`; and load the data
|              into the tables. This notebook contains detailed instructions on the ETL process for
|              each of the tables. It helps to create file `etl.py`
├── etl.py: read and process JSON files and insert data into dimension and fact tables
├── requirements.txt: project requirements (Python libraries)
├── requirements_dev.txt: additional requirements used for development
├── sql_queries.py: SQL commands (DROP TABLE, CREATE TABLE, INSERT INTO, SELECT); imported in
|                   `create_tables.py`, `etl.py` and `etl.ipynb`
└── test.ipynb: display the first few rows of each table to check if the database is correct

Python environment

Setup

Create a conda environment called etl-env and install the requirements in it.

conda create -yn etl-env python=3.7 --file requirements.txt

Development

Use the new conda environment in jupyter notebook.

conda install -n base nb_conda_kernels
conda activate etl-env
python -m pip install -r requirements_dev.txt
conda deactivate
jupyter notebook

Now you can run the jupyter notebook files etl.ipynb and test.ipynb.

Database

Setup PostgreSQL database (macOS)

Install PostgreSQL:

  1. Install Homebrew (macOS): http://brew.sh/
  2. Install PostgreSQL (macOS): brew install postgresql

Start service and create default user and database:

  1. Start PostgreSQL with homebrew-services: brew services run postgresql

    • Alternative: pg_ctl -D /usr/local/var/postgres start
    • brew service start <service> starts the <service> at login, while brew services run runs the <service> but doesn't start it at login (nor boot).
  2. Connect to default PostgreSQL database and create user:

    $ psql --dbname postgres
    postgres=# \du
    postgres=# CREATE ROLE student WITH LOGIN CREATEDB PASSWORD 'student';
    postgres=# \du
    postgres=# \quit
  3. Connect to default PostgreSQL database as the new user and create database:

    $ psql --dbname postgres --username student
    postgres=> \list
    postgres=> CREATE DATABASE studentdb;
    postgres=> \list
    postgres=> GRANT ALL PRIVILEGES ON DATABASE studentdb TO student;
    postgres=> \list
    postgres=> \connect studentdb
    postgres=> \quit

Run the ETL

python create_tables.py && python etl.py

To debug: psql --dbname sparkifydb --username student

If you receive the error DETAIL: There is 1 other session using the database., close all other connections, except the current one. Probably it is a jupyter notebook running, so restart it, or run this query:

SELECT pid, pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = current_database() AND pid <> pg_backend_pid();

Cleanup

  1. Remove the Python environment: conda env remove -n etl-env

  2. Remove all tables:

    $ psql --dbname sparkifydb
    sparkifydb=# \dn+
    sparkifydb=# DROP SCHEMA public CASCADE;
    sparkifydb=# \dn+
    sparkifydb=# CREATE SCHEMA public;
    sparkifydb=# GRANT ALL ON SCHEMA public TO public;
    sparkifydb=# COMMENT ON SCHEMA public IS 'standard public schema';
    sparkifydb=# \dn+
    sparkifydb=# \quit

    Or remove the database and the user:

    $ psql --dbname postgres
    postgres=# DROP DATABASE IF EXISTS sparkifydb;
    postgres=# DROP USER student;
    postgres=# \quit
  3. Stop the PostgreSQL service: brew services stop postgresql

    • Alternative: pg_ctl -D /usr/local/var/postgres stop
    • Check by listing all services managed by brew services (postgresql should be stopped): brew services list

Possible questions to ask

Examples of queries:

  • Top 10 most common years when song were published:

    SELECT year, COUNT(*)
    FROM songs
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;
  • Distribution of song duration:

    SELECT
      percentile,
      PERCENTILE_DISC(percentile) WITHIN GROUP (ORDER BY duration) AS duration
    FROM songs, GENERATE_SERIES(0, 1, 0.25) AS percentile
    GROUP BY 1;
  • Top 10 most common first names and last names of users:

    SELECT first_name, COUNT(*)
    FROM users
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;
    
    SELECT last_name, COUNT(*)
    FROM users
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;
  • Top 10 most common artist locations:

    SELECT location, COUNT(*)
    FROM artists
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;
  • Percentage of users per gender and level:

    SELECT
      gender,
      ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage
    FROM users
    GROUP BY 1
    ORDER BY 1;
    
    SELECT
      level,
      ROUND(100 * CAST(COUNT(*) AS DECIMAL) / (SELECT COUNT(*) FROM users), 2) AS percentage
    FROM users
    GROUP BY 1
    ORDER BY 1;
    
    WITH per_gender_level AS (
    SELECT gender, level, COUNT(*) AS count
    FROM users
    GROUP BY 1, 2),
    
    per_gender AS (
    SELECT gender, COUNT(*) AS count
    FROM users
    GROUP BY 1)
    
    SELECT
      per_gender_level.gender,
      per_gender_level.level,
      ROUND(100 * (CAST(per_gender_level.count AS DECIMAL) / per_gender.count), 2) AS percentage
    FROM per_gender_level
    INNER JOIN per_gender ON per_gender.gender = per_gender_level.gender
    ORDER BY 1, 2;
  • OLAP cube for users gender and level:

    SELECT gender, level, COUNT(*)
    FROM users
    WHERE gender IS NOT NULL AND level IS NOT NULL
    GROUP BY CUBE (1, 2)
    ORDER BY 1, 2;
  • Description of number of songs per session:

    SELECT MIN(n_songs), AVG(n_songs), MAX(n_songs)
    FROM (SELECT session_id, COUNT(*) AS n_songs
          FROM songplays
          GROUP BY 1) AS songs_per_session;
  • Top 10 most active users:

    SELECT user_id, COUNT(*)
    FROM songplays
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;
  • Top 10 most played song and artist:

    SELECT song_id, COUNT(*)
    FROM songplays
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;
    
    SELECT artist_id, COUNT(*)
    FROM songplays
    GROUP BY 1
    ORDER BY 2 DESC
    LIMIT 10;

About

ETL pipeline that transfers data from local files into PostgreSQL tables.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published