Repository for NNOSE

This repository accompanies the paper:

NNOSE: Nearest Neighbor Occupational Skill Extraction

Mike Zhang, Rob van der Goot, Min-Yen Kan, and Barbara Plank. To appear at EACL 2024.

Getting Started

Requirements

Clone the repository. If you use conda, please install the accompanying environment by:

# create the environment
conda env create -f environment.yml

# activate environment
conda activate nnose

# install torch separately
pip3 install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html

There is a separate environment for generating the UMAP plot.

# create the environment
conda env create -f environment_umap.yml

# activate environment
conda activate nnose_umap

# install torch separately
pip3 install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

! The UMAP plot can only be created once we obtained the embeddings using run_scripts/get_representations.sh.

All experiments are ran on python=3.9 and torch==1.10.1.

Getting JobBERTa

JobBERTa will be released when the paper is accepted. You can check how JobBERTa is trained using: run_scripts/run_mlm.sh.

The MLM script is derived from HuggingFace and can be found in src/utils/run_mlm.py.

Running Experiments

‼️ It is extremely important that the experiments are ran in the right order.

1. Training the Language Models

To fine-tune the models used in the paper, run the following script:

bash run_scripts/run_trainer.sh

2. Obtaining Embeddings + Creating the Datastore

We have put the extraction of embeddings from the training datasets and training the datastore in one file. We have two types of datastores in our experiments, an in-dataset datastore ({D}) and an 'all' datastore ($\forall$ D).

To create the in-dataset datastore:

bash run_scripts/get_representations_dataset.sh

To create the \forall datastore:

bash run_scripts/get_representations.sh

3. Sweeping Hyperparameters

To do a hyperparameter sweep for NNOSE to get the best working k neighbors, lambda, and temperature run:

bash run_scripts/run_inference_sweep.sh

4. Running Inference

We have two scripts to run test/inference, which also outputs the predictions in a separate output file. To do this, please look at the following scripts:

run_scripts/run_test.sh

or to do test/inference with NNOSE:

run_scripts/run_test_index.sh

5. Doing the Analysis

Skill Distribution (Figure 8) + Jaccard Overlap

This analysis doesn't need any already ran experiments, to do this run:

python3 src/analysis/skill_distribution.py

Long Tail Analysis (Figure 2 and Figure 4)

This analysis requires the output of the models from step (4) above:

Example

python3 src/analysis/get_long_tail.py \
    --train_dir data/skillspan/train.json \
    --prediction_dir results/<name_of_file> \

False Positive Analysis (Table 11 + Table 12)

This analysis also requires the output of the models from step (4) above:

python3 src/analysis/skill_distribution.py \
    --prediction_dir results/<name_of_file_predictions> \
    --prediction_dir_knn results/<name_of_file_predictions_knn> \

Cross-dataset Analysis (Table 3)

This analysis requires you to have trained the models. Note that there is also an "all" model, which is the concatenation of all datasets.

This one is done manually with the run_scripts/run_test.sh script:

Example

MODEL="jobbert"     # "roberta" "jobberta"
DATASET="skillspan" # "sayfullina" "green"
TIMESTAMP=$(date +%F_%T)

python3 src/run_inference.py \
  --model_name_or_path "tmp-5e-5/"$DATASET"/$MODEL/"* \
  --train_file data/"$DATASET"/train.json \
  --validation_file data/sayfullina/dev.json \
  --text_column_name tokens \
  --label_column_name tags_skill \
  --seed 113412 \
  --write_output "results/run_test_$TIMESTAMP/" \

You can change the --validation_file flag to the dataset you want to apply it on. In this case, we use sayfullina.

UMAP plot (Figure 3)

Please change to the UMAP environment as stated in the Getting Started section. The only difference is the pytorch version.

This analysis further requires you to have ran step (2): Obtaining Embeddings + Creating the Datastore.

Example:

python3 src/analysis/plot_umap.py --output_dir plots/

WARNING: This script takes a lot of time if run for the first time (around 45-60 minutes on a good machine).

Questions

If there's any questions, please reach out to <email>.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
img		img
plots		plots
results		results
run_scripts		run_scripts
src		src
sweep		sweep
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
environment_umap.yml		environment_umap.yml

License

WING-NUS/nnose

Folders and files

Latest commit

History

Repository files navigation