sequence-tagging

one sentence from JNLPBA-dataset visualized with doccano

setup

pip install -r requirements
python -m spacy download en_core_web_sm

data

scierc-data

python -c "from util.data_io import download_data; download_data('http://nlp.cs.washington.edu/sciIE/data','sciERC_processed.tar.gz','data',unzip_it=True)"

JNLPBA

git clone https://github.com/allenai/scibert.git

see scibert/data/ner/JNLPBA

learning-curve on scierc-data

learning-curves on JNLPBA-data

active learning curves

uncertainty sampling vs. random sampling

sequence-tagger: spacy-features + crfsuite
5 times 10 "steps"

steps of 10% of trainset-size

steps of 1% of trainset-size

result

entropy/uncertainty -based sampling seems not beneficial if model is dumb (too few traindata or too shallow?)

3fold shuffle split on JNLPBA-dataset

20% of train-data, evaluated on test-set (which is not splitted)
why is farm so bad? where is the bug?

sequence tagging transformers + lightning

setup on HPC

git clone https://github.com/dertilo/transformers.git
git checkout lightning_examples
cd transformers/examples && pip install -r requirements.txt
on frontend: OMP_NUM_THREADS=2 wandb init
on frontend: OMP_NUM_THREADS=8 bash download_data.sh
on node: python preprocess.py --model_name_or_path bert-base-multilingual-cased --max_seq_length 128
on node: export PYTHONPATH=~/transformers/examples
on frontend: to download pretrained model: OMP_NUM_THREADS=8 python3 run_pl_ner.py --data_dir ./ --labels ./labels.txt --model_name_or_path $BERT_MODEL --do_train

train & evaluate

PYTHONPATH=~/transformers/examples WANDB_MODE=dryrun python ~/transformers/examples/token-classification/run_pl_ner.py --data_dir ./ \
--labels ./labels.txt \
--model_name_or_path bert-base-multilingual-cased  \
--output_dir germeval2014 \
--max_seq_length  128 \
--num_train_epochs 3 \
--train_batch_size 32 \
--seed 1 \
--do_train \
--do_predict

sync with wandb: OMP_NUM_THREADS=2 wandb sync wandb/dryrun-...
resuls after 3 epochs in ~20 minutes:

TEST RESULTS
{'avg_test_loss': tensor(0.0733),
 'f1': 0.8625160051216388,
 'precision': 0.8529597974042419,
 'recall': 0.8722887665911299,
 'val_loss': tensor(0.0733)}

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
active_learning		active_learning
huggingface_ner		huggingface_ner
images		images
results/JNLPBA_20percent		results/JNLPBA_20percent
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_flair_tagger.py		benchmark_flair_tagger.py
data_splitting.py		data_splitting.py
experiment_util.py		experiment_util.py
farm_score_tasks.py		farm_score_tasks.py
flair_conll2003_en.py		flair_conll2003_en.py
flair_score_tasks.py		flair_score_tasks.py
learning_curve.py		learning_curve.py
plot_learning_curve.py		plot_learning_curve.py
reading_scierc_data.py		reading_scierc_data.py
reading_seqtag_data.py		reading_seqtag_data.py
requirements.txt		requirements.txt
seq_tag_util.py		seq_tag_util.py
spacyCrf_score_task.py		spacyCrf_score_task.py
spacy_features_sklearn_crfsuite.py		spacy_features_sklearn_crfsuite.py

License

dertilo/sequence-tagging

Folders and files

Latest commit

History

Repository files navigation

sequence-tagging

setup

data

scierc-data

JNLPBA

learning-curve on scierc-data

learning-curves on JNLPBA-data

active learning curves

uncertainty sampling vs. random sampling

steps of 10% of trainset-size

steps of 1% of trainset-size

result

3fold shuffle split on JNLPBA-dataset

sequence tagging transformers + lightning

setup on HPC

train & evaluate

About

Resources

License

Stars

Watchers

Forks

Languages