Skip to content

Latest commit

 

History

History
30 lines (16 loc) · 2.23 KB

SENTENCES.md

File metadata and controls

30 lines (16 loc) · 2.23 KB

Sentences on Common Voice

As Common Voice is a read dataset, sentences are our currency. You can help by adding new sentences to our dataset for other contributors to read, helping with bulk sentence extractions, or reporting problematic sentences.

In a few words

📝 Sentence Collector is the sentence writing part of the Common Voice website. For others to be able to record their voices the Common Voice project needs sentences to be read. This is a good place to start for newcomers to this project.

📘 Contributors who want to bulk upload sentences, like for books, should check out the Bulk Submission guidelines.

🖥️ For automatic extraction of data sources, the Sentence Extractor is dedicated for extracting from sources such as Wikipedia, Wikisource or raw files.

Sentence Collector

The Sentence Collector is the "write" section of Common Voice. You can either:

  • Add sentences for your language
  • Validate sentences that other contributors have added

Each sentence requires at least two upvotes from human validation to be considered valid.

Automatic extraction

The Sentence Extractor is a tool that can scrape public domain data sources for sentences. There are multiple sources integrated into the Sentence Extractor, such as Wikipedia and Wikisource. Please see this post for detailed guidance on how to use the Sentence Extractor.

Correcting existing data

Some methods don't go through automated cleanup/validation/rules, and they are not unified. Thus, there is a process to remove old data that might need to be discarded.

If you notice sentences that need to be deleted, you can create a migration as a Pull Request. Alternatively you can also create an issue in this repository and we will take care of it.