GitHub - hassiahk/MLRC-2020-Double-Hard-Debias: Reproducing the results of Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation (ACL 2020)

Double-Hard Debias

In this repo, we tried to reproduce the results claimed in Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation (ACL 2020) as part of Reproducibility Challenge 2020 hosted by PaperswithCode

You can find our report here. Unfortunately, it got rejected for ReScience Journal.

Abstract of the Paper

Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.

Motivation

Despite widespread use in natural language processing (NLP) tasks, word embeddings have been criticized for inheriting unintended gender bias from training corpora. Bolukbasi et al. (2016) highlights that in word2vec embeddings trained on the Google News dataset (Mikolov et al., 2013a), programmer is more closely associated with man and homemaker is more closely associated with woman. Such gender bias also propagates to downstream tasks. Studies have shown that coreference resolution systems exhibit gender bias in predictions due to the use of biased word embeddings (Zhao et al., 2018a; Rudinger et al., 2018).

Usage

Requirements

Python >= 3.6.
Word Embeddings Benchmarks. Install them following the instructions here.

Installation

Clone the repo:

git clone https://github.com/hassiahk/Double-Hard-Debias.git

Install the dependencies:

pip install -r requirements.txt

To run in develop mode, this is needed if you are just running our notebooks without changing anything:

python setup.py develop

Data

Please download the below data and keep them in the data folder.

Word Embeddings - You can find the authors debiased embeddings and ours here.
Special Word Lists - You can find them in the data folder.
Google Word Analogy - Word Analogy dataset by Google. You can find it here.
MSR Word Analogy - MSR Word Analogy dataset. You can find it here.

You can find all the external data used in our experiments here.

Double-Hard Debias

You can find the detailed procedure to implement Double-Hard Debias in GloVe_Double_Hard_Debias.ipynb. (PyPi package coming soon)

We had to make minor changes as the authors code did not include the code to Double-Hard Debias the original GloVe embeddings and store them in a file.

Reproducibility Results

In Normalized_Unnormalized_GloVe_Evaluate.ipynb, we experimented with both normalized and unnormalized embeddings to see which one gives better results.
You can find the benchmarks results for Double-Hard Debias and other debiasing approaches on GloVe in GloVe_Evaluate.ipynb.
We also did some qualitative analysis by computing bias of some highly biased words before and after debiasing. You can find the analysis in Qualitative_Analysis.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
double_hard_debias		double_hard_debias
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

double_hard_debias

double_hard_debias

notebooks

notebooks

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

Double-Hard Debias

Abstract of the Paper

Motivation

Usage

Requirements

Installation

Data

Double-Hard Debias

Reproducibility Results

About

Releases

Packages

Contributors 2

Languages

hassiahk/MLRC-2020-Double-Hard-Debias

Folders and files

Latest commit

History

Repository files navigation

Double-Hard Debias

Abstract of the Paper

Motivation

Usage

Requirements

Installation

Data

Double-Hard Debias

Reproducibility Results

About

Resources

Stars

Watchers

Forks

Languages