nlp-president-speeches

Extract and visualise mentions of climate change issues in presidential speeches (from all time).

Dockerized everything.

brew install make
# see list of available commands
make

# Load the data, start api and frontend client
# (you might want to go grab a ☕️) 
# look out for the climate-related sentences being printed to console
make runall

# Inspect the model generation and play around (stop the api containers first)
# Open the link from terminal and go to 'app' directory in the sidebar
make etl-notebook

1. Process + Load data (`etl`)

Read .txt from filesystem
Model iteration in Jupyter Notebooks
Text processing and classification using NLTK and PyTorch
All environments using docker/docker-compose

2. Serve (`api`)

Simple flask app exposing rest endpoints
Integration with VSCode debugger
Hot reloading
Again all dockerized

3. Visualise (`client`)

React app displaying the data
Charting with Victory
Display all sentences about climate issues with related metadata (who, when)
Highlight sentences in the context of the full speech

How does it work

We share the database as docker volume (etl_postgres_data) between the ETL and API
The etl loads the data as follows:
- Read each folder of president speeches, insert president and speech into the database
- Sentences are classified using a pretrained classification model based on RoBERTa, using SimpleTransformers lib
- Trained on a corpus of sentences about climate change and not about climate change
- Note: we see many sentences being misclassified - e.g. most sentences with numbers in end up the climate set - that could be mitigated by using larger and more representative training datasets

That model actually took too much time to return predictions :(, so I reverted to the original "classifier" - does the sentence contain 1 or more keywords in the set e.g. ["climate", "environment", "green"].

What's next

Use larger datasets for training/evaluation, speed up prediction by passing in more sentences at once or using a smaller model
Share db models between etl and api
Setup python linting / static analysis

Split president names into firstname/lastname for better sorting
Return pre-sorted / precomputed data from extra endpoints -- then we need to do less manipulation on the client
Transform / format data when we recieve it in client (get rid of those underscores 🤮)
Add (more) tests in js and python

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.vscode		.vscode
api		api
client		client
etl		etl
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

api

api

client

client

etl

etl

.gitattributes

.gitattributes

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

Repository files navigation

nlp-president-speeches

1. Process + Load data (`etl`)

2. Serve (`api`)

3. Visualise (`client`)

How does it work

What's next

Sources

About

Releases

Packages

Languages

mattdean1/nlp-president-speeches

Folders and files

Latest commit

History

Repository files navigation

nlp-president-speeches

1. Process + Load data (etl)

2. Serve (api)

3. Visualise (client)

How does it work

What's next

Sources

About

Topics

Resources

Stars

Watchers

Forks

Languages

1. Process + Load data (`etl`)

2. Serve (`api`)

3. Visualise (`client`)