DataCollection

Collecting data for machine translation training from CommonCrawl is a two-phase process illustrated in the following diagram:

Installation

Hardware requirements and installation instructions can be found here.

Phase 1: Language annotation, building a meta-data file and monolingual data extraction

The first phase detects the languages of the web pages contained in the crawl and other meta-data. A meta-data file is built from this analysis.

The metadata documentation describes phase 1 step-by-step.

With data from this phase monolingual data for language model training can be extracted. The data for most of the CommonCrawl crawls and many languages can be found on:

Phase 2: Extracting parallel data and optional cleaning

In the second phase the meta-data collected in phase 1 is used to extract parallel data from CommonCrawl data based on URL pattern matching. Phase 2 is documented step-by-step in the baseline documentation

For the language pairs en↔de, en↔fr, en↔es, en↔it, en↔pt, en↔nl and en↔ru matched URL data for CommonCrawl 2015_32 is available for data extraction in release 0.1.0

Name		Name	Last commit message	Last commit date
Latest commit History 433 Commits
Results		Results
baseline		baseline
candidates		candidates
crawlertest		crawlertest
dicts		dicts
docalign_task		docalign_task
docaligner		docaligner
html_convert		html_convert
merge/metadata		merge/metadata
metadata		metadata
monolingual		monolingual
.gitignore		.gitignore
INSTALL.md		INSTALL.md
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
common_crawl_process.png		common_crawl_process.png
parseXML.py		parseXML.py
requirements.txt		requirements.txt

License

modernmt/DataCollection

Folders and files

Latest commit

History

Repository files navigation

DataCollection

Installation

Phase 1: Language annotation, building a meta-data file and monolingual data extraction

Phase 2: Extracting parallel data and optional cleaning

About

Resources

License

Stars

Watchers

Forks

Languages