Skip to content

Reproducing the results of Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation (ACL 2020)

Notifications You must be signed in to change notification settings

hassiahk/MLRC-2020-Double-Hard-Debias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Double-Hard Debias

In this repo, we tried to reproduce the results claimed in Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation (ACL 2020) as part of Reproducibility Challenge 2020 hosted by PaperswithCode

You can find our report here. Unfortunately, it got rejected for ReScience Journal.

Abstract of the Paper

Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.

Motivation

Despite widespread use in natural language processing (NLP) tasks, word embeddings have been criticized for inheriting unintended gender bias from training corpora. Bolukbasi et al. (2016) highlights that in word2vec embeddings trained on the Google News dataset (Mikolov et al., 2013a), programmer is more closely associated with man and homemaker is more closely associated with woman. Such gender bias also propagates to downstream tasks. Studies have shown that coreference resolution systems exhibit gender bias in predictions due to the use of biased word embeddings (Zhao et al., 2018a; Rudinger et al., 2018).


Usage

Requirements

  • Python >= 3.6.
  • Word Embeddings Benchmarks. Install them following the instructions here.

Installation

Clone the repo:

git clone https://github.com/hassiahk/Double-Hard-Debias.git

Install the dependencies:

pip install -r requirements.txt

To run in develop mode, this is needed if you are just running our notebooks without changing anything:

python setup.py develop

Data

Please download the below data and keep them in the data folder.

  • Word Embeddings - You can find the authors debiased embeddings and ours here.
  • Special Word Lists - You can find them in the data folder.
  • Google Word Analogy - Word Analogy dataset by Google. You can find it here.
  • MSR Word Analogy - MSR Word Analogy dataset. You can find it here.

You can find all the external data used in our experiments here.

Double-Hard Debias

You can find the detailed procedure to implement Double-Hard Debias in GloVe_Double_Hard_Debias.ipynb. (PyPi package coming soon)

We had to make minor changes as the authors code did not include the code to Double-Hard Debias the original GloVe embeddings and store them in a file.

Reproducibility Results

About

Reproducing the results of Double-Hard Debias: Tailoring Word Embeddings for Gender Bias Mitigation (ACL 2020)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published