FSDL Course 2022 - Weak Supervision and Deep Learning with text data

Full Stack Deep Learning Course 2022 project

Team 44:

Aleks Hiidenhovi: alekshiidenhovi
Bernardo García: bergr7
Diego Quintana: diegoquintanav
Edoardo Abati: EdAbati
Juan Manuel: jmpc11
Kushal Atul Ramaiya: ramaiyaKushal

Description

Source data

AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

Weakly supervised dataset with unlabeled data from AG NEWS dataset

https://huggingface.co/datasets/bergr7/weakly_supervised_ag_news

from datasets import load_dataset
dataset = load_dataset("bergr7/weakly_supervised_ag_news")

...

Requirements

Docker
conda
a .env file in the root of this project (based on .env.sample)
Optional, for AWS deployment:
- An AWS Account
- AWS CLI installed and configured

Setup project

Create a virtual environment using conda: make conda-env-and-update

Usage

The project includes many services. They are either running as docker containers or available as public online services.

This project has the following services:

jupyter: a Jupyter notebook server with support for the NVIDIA A100-SXM4-40GB GPU (CUDA 11.6) and contains scripts for training and evaluating models
lambda: a lambda function for serving predictions
streamlit: a Streamlit application for displaying a user interface
rubrix: a Rubrix server for annotating data
elastic: an Elasticsearch server for storing data from Rubrix
kibana: a Kibana server for visualizing data from Elasticsearch

Rubrix

Launch locally the rubrix service (you also need the elasticsearch and kibana service):

docker compose -f docker-compose.yml up -d rubrix elasticsearch kibana

Afterward, you should be able to access the web app at http://localhost:6900/.

The default username and password are rubrix and 1234.

Register model in W&B and convert to TorchScript

Add a registered model in the W&B model registry (guide here)

Run the command using the name in the registry as 'model_name':

python cli/deploy.py register-and-convert-model --artifact-name="[entity]/[project]/[artifact_name]:[alias]" --model-name="[entity]/[project]/[model_name]"

This will register the model artifact in the registry and it will convert it into the TorchScript format.

Use docker-compose

All the services can be launched using docker-compose, and its configuration is shared across many yml files that can by chained with the -f command. Read more about this in the docker-compose documentation. You will need to provide a .env file with variables listed in .env.sample for this deployment to work.

In particular, the following files are used:

docker-compose.yml (base) is used to run the project in a container. It contains all the services a part from the proxy.
docker-compose.override.yml (override) is used to run the project as development environment. It exposes many volumes to allow quick prototyping.
docker-compose.nvidia.yml (nvidia) extends the jupyter container and adds support for GPU usage.

These can be launched using make, e.g. make dev.all.up will launch all the services combining the base and override configs. The Makefile contains a list of all the commands.

Contributing

As the project is made by multiple services, please follow the guide corresponding to the service you want to contribute to.

Entire project

If you need to work on a feature that requires all the services to be active and running (e.g. active learning loop with notebook and rubrix), you can run make dev.all.up.build.

Jupyter: Main app and notebooks

Create a dev environment either with conda or with docker(suggested way):
- Docker: Launch locally the jupyterservice using the dev config for docker-compose. This will mount the local directory to the container and it will allow you to change the local files and see changes in the container straight away:
```
docker compose build jupyter
docker compose -f docker-compose.yml -f docker-compose.override.yml up -d jupyter
```
  (Optional) Install pre-commit hooks: make install-pre-commit
Update the code in the folder:
- services/jupyter/src/app: contains the main code of the application (model training, deployment)
- services/jupyter/src/notebooks: contains useful notebooks for data exploration, testing, ...

AWS Lambda

Please follow the guide here

Streamlit UI

Please follow the guide here

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
models		models
services		services
.env.sample		.env.sample
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conda-environment.yml		conda-environment.yml
docker-compose.nvidia.yml		docker-compose.nvidia.yml
docker-compose.override.yml		docker-compose.override.yml
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

EdAbati/fsdl-2022-weak-supervision-project

Folders and files

Latest commit

History

Repository files navigation