Reichsanzeiger NLP

This is work in progress. The goal is creating a NLP ground truth corpus based on the OCR ground truth data for the historical newspaper Deutscher Reichsanzeiger und Preußischer Staatsanzeiger (1819-1945). It was scanned and OCR-ed at UB Mannheim.

Ongoing work

✅ Convert the unprocessed text lines from Reichsanzeiger PAGE XML files to separate lines in TXT files [via blatt to_txt]. See data/text_raw/.
✅ Remove hyphens & line breaks from the text lines from Reichsanzeiger files and save them as plain text in TXT files [via blatt to_txt]. See data/text_unhyphenated/.
✅ Split plain text without line breaks & without hyphens into sentences & save it as one sentence per line TSV files [via blatt to_tsv]. See data/sentences_raw/.
✅ Correct sentence splitting manually and remove "noisy data" (e.g., tables). See data/sentences_checked/.
✅ Import plain text (one sentence per line) to INCEpTION
✅ Create the annotation guidelines
✅ Create a tagset and annotation layer in INCEpTION according to the annotation guidelines. See inception/tagsets/ and inception/layers.
✅ Annotate plain text according to the annotation guidelines
✅ Export the annotations in INCEpTION formats (e.g., UIMA CAS XMI). See data/.
✅ Create a convertor from XMI to IOB format and convert XMI files into IOB files (created cas2iob)
⏳ Curate the annotations from two annotators
🔜 Train baseline models for NER/NEL

Annotation Software: INCEpTION

We tested INCEpTION, neat and MedTator. INCEpTION is chosen as the most advanced among them.

When we annotate old German plain text in INCEpTION and MedTator and export annotations in IOB format, tokenization is often incorrect. In these cases one can use neat as tokenization corrector.

If we import plain text with one sentence per line instead of just plain text into INCEpTION, the annotations exported into IOB format have a decent quality of tokenization.

Annotation Guidelines

We decided to develop the annotation guidelines iteratively based on the existing annotation guidelines for historical German texts as well as via analysing the sample pages from the Reichsanzeiger.

Related work

HIPE competition on historical texts – Identifying Historical People, Places and other Entities

CLEF-HIPE-2020 [datasets] [guidelines]
HIPE-2022 [tasks & data] [paper]

Existing NER/NEL corpora for historical German

Dataset	Text type	Century	Project	Annotation Guidelines	Annotation Tool	Tasks	Tagset
AjMC	Commentaries	XIX	Ajax MultiCommentary	Zenodo	INCEpTION	NER, NEL	pers, work, loc, object, date, scope
HIPE-2020	Newspaper	mid XIX - mid XX	CLEF-HIPE-2020	Zenodo	INCEpTION	NER, NEL	pers, org, prod, time, loc
Newseye	Newspaper	mid XIX - mid XX	Newseye	Zenodo	Transkribus	NER, NEL	PER, LOC, ORG, HumanProd
SoNAR	Newspaper	mid XIX - mid XX	SoNAR	Zenodo	neat	NER, NEL	PER, LOC, ORG

Name		Name	Last commit message	Last commit date
Latest commit History 468 Commits
code		code
data		data
docs		docs
inception		inception
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

docs

docs

inception

inception

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Reichsanzeiger NLP

Ongoing work

Annotation Software: INCEpTION

Annotation Guidelines

Related work

HIPE competition on historical texts – Identifying Historical People, Places and other Entities

Existing NER/NEL corpora for historical German

Reichsanzeiger at UB Mannheim

About

Releases

Contributors 3

Languages

License

UB-Mannheim/reichsanzeiger-nlp

Folders and files

Latest commit

History

Repository files navigation

Reichsanzeiger NLP

Ongoing work

Annotation Software: INCEpTION

Annotation Guidelines

Related work

HIPE competition on historical texts – Identifying Historical People, Places and other Entities

Existing NER/NEL corpora for historical German

Reichsanzeiger at UB Mannheim

About

Resources

License

Stars

Watchers

Forks

Languages