Skip to content

zentrum-lexikographie/eval-de-lemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

eval-de-lemmatise

An evaluation study of lemmatizers on different German language corpora. Branch ba-lk contains the code for the Bachelor's thesis of Lydia Körber.

Usage

  1. Download the datasets.
bash dataset-download.sh
  1. In order to avoid python dependency conflicts, each lemmatizer is installed in a separate virtual environment.
for dir in algorithms/*; do
    bash "${dir}/install.sh"
done
  1. If you wish to track the CO2 emissions during the computation, execute as described here:
sudo chmod -R a+r /sys/class/powercap/intel-rapl

Then start the computations with the following command.

bash run.sh

The study has been conducted on a Debian GNU/Linux 10 machine with 72 CPUs and 188 GB RAM using Python3.7.3.

(4.) To run the evaluation scripts in Jupyter Notebook, execute the following commands:

cd nbs
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
jupyter lab

Data sets

Data set (paper) Format Era Genre Language area Guidelines Annotation Pre-processing
Empirist 2019 tab-separated 21st c a) dialogue (CMC): chat (social, professional), tweets, WhatsApp chats, blog comments, Wikipedia threads; b) web articles (Web) DE link, based on TIGER manual Normalized and original tokens used as input.
GerManC-GS XML 1650 - 1800 (Early Modern German) drama, humanities, legal texts, letters, narrative prose, newspapers, scientific texts, sermons DE, AT, CH link manual Normalized and original tokens used as input. Captions and stage directions ignored.
NoSta-D TCF, XML 14th - 21st c historical (anselm), chat (unicum), spoken (bematac), learner (falko), literary prose (kafka), newspaper (tueba-dz) DE semi-automatic (TreeTagger) Normalized and original tokens used as input.
RUB 2019, balanced Conll-U 20th - 21st c Novelette, movie subtitles, sermon, TED talks, Wikipedia DE TIGER with some modifications manual UPOS tags are not available and need to be converted from XPOS tags (STTS).
TGermaCorp Conll-U 16th - 21st c literature, Wikipedia DE semi-automatic (TreeTagger)
UD GSD, v2.10 (TIGER Korpus) Conll-U 21st c daily newspaper (Frankfurter Rundschau) DE link manual
UD HDT, v2.10 (Hamburg Treebank) Conll-U 20th c IT magazine (Heise) DE link manual
UD PUD, v2.10 Conll-U 21st c Wikipedia articles DE manual

Repository Structure