Skip to content

TatianaShavrina/taiga_site

Repository files navigation

Taiga corpus

Welcome to the taiga site repository!

Here, as well as on our website, you can explore our documentation, leave feedback, open issues and create pull requests

About the project

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

  • open source
  • big - about 6 billion words by now
  • sorted by datasets applicable to different machine laearning tasks
  • made by linguists, experienced in text crawling, parsing and filtering
  • rich with metainformation
  • POS-tagged and syntactically tagged in Universal Dependencies

Our motivation

A wisely constructed web corpus has a lot more potential applications than is classically accounted to have. The “web as corpus” paradigm recently has had its natural continuation as a formulation “web as train set”. Open-source websites provide ample opportunities for NLP-developers and computational linguists, who nevertheless have to gather all the corresponding data by themselves, repeating the same actions for cleaning and de-duplicating the material, as traditional web corpora provide only search interface and do not give any access to the whole data. The "Taiga" corpus project unites the needs of developers, machine learners and computational linguists, as a web corpus for big linguistic data analysis and actual NLP and NLU systems modeling. Its main aim is to influence the culture of corpus research for Russian language and reflect the paradigm shift in linguistic methodology.

Project creators

Under inspiring supervision of Olga Lyashevskaya

References:

  1. Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of "CORPORA2017", international conference , Saint-Petersbourg, 2017.
  2. Shavrina T. (2018) Differential approach to webcorpus construction. In Dialogue, Russian International Conference on Computational Linguistics, RSUH, Moscow.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published