Jigsaw - Multilingual Toxic Comment Classification

Built a multilingual text classification model to predict the probability that a comment is toxic using the data provided by Google Jigsaw.
The data had 4,35,775 text comments in 7 different languages.
A RNN model was used as a baseline. The BERT-Multilingual-base and XLMRoBERTa models were fine-tuned to get the best results.
The best results were obtained using the fine-tuned XLMRoberta model. It achieved an Accuracy of 96.24% and an ROC-AUC Score of 93.92%.

Data

It only takes one toxic comment to sour an online discussion. Toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion. If these toxic contributions can be identified, we could have a safer, more collaborative internet.

The goal is to find the probability that a comment is toxic.

Columns in the dataset:

id - identifier within each file.
comment_text - the text of the comment to be classified.
lang - the language of the comment.
toxic - whether or not the comment is classified as toxic.

The comments are composed of multiple non-English languages and come either from Civil Comments or Wikipedia talk page edits.

The dataset can be downloaded from here.

Experiments:

RNN:

A baseline was created using the RNN model. An embedding layer of size 64 was used. Training the model with an Adam optimizer with learning rate of 0.001 for 5 epochs yielded an Accuracy of 83.68% and an ROC-AUC Score of 55.72%.

BERT-Multilingual-base:

The BERT-Multilingual-base was fine tuned on the data. A hidden layer of 1024 neurons was added to the model. Training the model with an Adam optimizer with learning rate of 0.001, weight decay of 1e-6 for 10 epochs yielded an Accuracy of 93.92% and an ROC-AUC Score of 89.55%.

XLM RoBERTa:

The XLMRoberta model was fine tuned on the data. Training the model with an AdamW optimizer with learning rate of 1e-5, weight decay of 1e-5 for 7 epochs yielded an Accuracy of 96.24% and an ROC-AUC Score of 93.92%.

For all the models that were fine-tuned:

Batch size of 64 was used for training.
Binary Cross-Entropy was used as the loss function.

Results:

The best results were obtained using a fine-tuned XLMRoberta model. It was used for generating the final predictions. It achieved an Accuracy of 96.24% and an ROC-AUC Score of 93.92%.

The results from all the models have been summarized below:

Model	Accuracy	ROC-AUC Score
RNN	83.68	55.72
BERT-Multilingual-base (fine-tuned)	93.92	89.55
XLM RoBERTa (fine-tuned)	96.24	93.92

Run Locally

Install required libraries:
```
  pip install -r requirements.txt
```
Baseline model:
```
  python toxic-baseline-rnn.py
```

Fine-tune models:

  python toxic-bertm-base.py
  python toxic-xlm-roberta.py

License

Author: @awinml

Feedback

If you have any feedback, please reach out to me at:

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
LICENSE		LICENSE
README.md		README.md
jigsaw_clf_cover.jpg		jigsaw_clf_cover.jpg
requirements.txt		requirements.txt
toxic-baseline-rnn.py		toxic-baseline-rnn.py
toxic-bertm-base.py		toxic-bertm-base.py
toxic-xlm-roberta.py		toxic-xlm-roberta.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

jigsaw_clf_cover.jpg

jigsaw_clf_cover.jpg

requirements.txt

requirements.txt

toxic-baseline-rnn.py

toxic-baseline-rnn.py

toxic-bertm-base.py

toxic-bertm-base.py

toxic-xlm-roberta.py

toxic-xlm-roberta.py

utils.py

utils.py

Repository files navigation

Jigsaw - Multilingual Toxic Comment Classification

Data

Columns in the dataset:

Experiments:

RNN:

BERT-Multilingual-base:

XLM RoBERTa:

Results:

Run Locally

License

Feedback

About

Releases

Packages

Languages

License

awinml/jigsaw-toxic-comment-clf

Folders and files

Latest commit

History

Repository files navigation

Jigsaw - Multilingual Toxic Comment Classification

Data

Columns in the dataset:

Experiments:

RNN:

BERT-Multilingual-base:

XLM RoBERTa:

Results:

Run Locally

License

Feedback

About

Topics

Resources

License

Stars

Watchers

Forks

Languages