DataCollection

Collecting data for machine translation training from CommonCrawl is a two-phase process illustrated in the following diagram:

Phase 1: Language annotation, building a meta-data database and monolingual data extraction

The first phase detects the languages of the web pages contained in the crawl and other meta-data. A database is built from this data that can be accessed via a RESTful web API.

The metadata documentation describes phase 1 step-by-step.

In this phase monolingual data for language model training can be extracted. The data for some of the CommonCrawl crawls and some languages can be found on:

For more details on the monolingual data see ModernMT Deliverable 2.1.

Phase 2: Extracting parallel data and optional cleaning

In the second phase the meta-data collected in phase 1 is used to extract parallel data from CommonCrawl data based on URL pattern matching. Phase 2 is documented step-by-step in the baseline documentation

For the language pairs en↔it, en↔fr and en↔it matched URL data is available for quick data extraction in release 0.1.0

Name		Name	Last commit message	Last commit date
Latest commit History 396 Commits
Results		Results
baseline		baseline
candidates		candidates
crawlertest		crawlertest
dicts		dicts
docalign_task		docalign_task
docaligner		docaligner
html_convert		html_convert
merge/metadata		merge/metadata
metadata		metadata
monolingual		monolingual
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
common_crawl_process.png		common_crawl_process.png
parseXML.py		parseXML.py
requirements.txt		requirements.txt

License

paracrawl/DataCollection

Folders and files

Latest commit

History

Repository files navigation

DataCollection

Phase 1: Language annotation, building a meta-data database and monolingual data extraction

Phase 2: Extracting parallel data and optional cleaning

About

Resources

License

Stars

Watchers

Forks

Languages