Skip to content

quancore/toxic-comment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo includes a single model (no ensemble) for toxic classification challange. The public LB score is ~ 0.93.

The backbone model from Hugginface is finetuned in the notebook pytorch_model.ipynb.

The model trained using Google Cloud: https://www.kaggle.com/quanncore/xlm-roberta-large

The notebook for model inference on test data of toxic classification: https://www.kaggle.com/quanncore/pytorch-tpu-inference

Presentation for the project: https://docs.google.com/presentation/d/1Xq4q_AWzQQY08NwAKbgzBaVmv4a_iquNufSz5TTqUys/edit?usp=sharing

Several methods applied in the model:

  • Translated training data (from English to other languages) have been used.
  • Further fine-tuning has been performed on the validation dataset.
  • Data augmentation on the training set has been used.
  • Class balancing on a distributed environment has been used.
  • TPU is supported.
  • A multisample dropout network has been used on the classifier head.
  • The hidden states of transformers have been aggregated by attention weights instead of using the last layer.
  • Different learning rates for the backbone model and classifier head have been used.
  • Stochastic Weight Averaging (SWA) has been used.

Reproduction step of training:

  • Create a credential folder and put the Github and Kaggle credential files in JSON format.
  • Create a data folder and put the toxic dataset on it. You can change the data paths on the training notebook.
  • Create a model folder for storing different training model versions.
  • Create a output folder for storing submission files if needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published