Here you can explore information about our corpus sources and download them
Genres | Tokens, millions | % |
---|---|---|
News | 92 | 1.5 |
Literary Texts | 4605 | 76 |
Special datasets | 2.5 | 0.5 |
Social media | 80 | 1.5 |
Subtitles | 101 | 1.5 |
Poems | 1130 | 19 |
- 'textrubric' – genre of the poem
- 'textid' – unique ID
- 'textname' – poem title
- 'author' – author(s)
- 'authortexts' – number of poems written by the author
- 'authorreaders' – number of visitors who read the poem
- 'date' – date of publication
- 'time' – time of publication
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/stihi_ru_rubrics.png "corpus segments")
Click here for more info.
- 'textrubric' – text genre
- 'textid' – unique ID
- 'textname' – title
- 'author' – author(s)
- 'authortexts' – number of texts written by the author
- 'authorreaders' – number of visitors who read the text
- 'date' – date of publication
- 'time' – time of publication
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/proza_ru_textrubric.png "corpus segments")
Click here for more info.
- 'textid' – unique ID
- 'textname' – article title
- 'textrubric' – article category
- 'date' – date of publication
- 'time' – time of publication
- 'tags' – article tags
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/lenta_rubrics.png "corpus segments")
Click here for more info.
- 'textid' – unique ID
- 'textname' – title
- 'textrubric' – article category
- 'date' – date of publication
- 'time' – time of publication
- 'tags' – article tags
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/interfax_tags.png "corpus segments")
Click here for more info.
- 'textid' – unique ID
- 'textname' – title
- 'textdiff' – text difficulty
- 'author' – author(s)
- 'textrubric' – article category
- 'date' – date of publication
- 'time' – time of publication
- 'tags' – article tags
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/nplus1_diff.png "corpus segments")
Click here for more info.
- 'textid' – unique ID
- 'textname' – title
- 'textregion' – news by region
- 'textrubric' – article category
- 'date' – date of publication
- 'time' – time of publication
- 'tags' – article tags
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/kp_regions.png "corpus segments")
Click here for more info.
- 'textid' – unique ID
- 'textname' – title
- 'magazine' – magazine title
- 'author' – author(s)
- 'date' – date of publication
- 'time' – time of publication
- 'tags' – tags
- 'source' – reference to the original source (sometimes unavailable)
Click here for more info.
- 'textid' – unique ID
- 'textname' – title
- 'textregion' – news by region
- 'textrubric' – article category
- 'date' – date of publication
- 'time' – time of publication
- 'tags' – article tags
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/fontanka_years.png "corpus segments")
Click here for more info.
- 'textid' – unique ID
- 'textname' – title
- 'authors' – author(s)
- 'authorprofession' – author's profession
- 'about_author' – short author bio
- 'textrubric' – article category
- 'date' – date of publication
- 'time' – time of publication
- 'tags' – article tags
- 'source' – reference to the original source (sometimes unavailable)
![alt text]({{ site.baseurl }}/assets/images/arzamas_rubrics.png "corpus segments")
Click here for more info.
- 'textid' – unique ID
- 'title' – film title
- 'language' – language
- 'filepath' – file path
![alt text]({{ site.baseurl }}/assets/images/tvsubtitles_langs.png "corpus segments")
Click here for more info.