Skip to content

Latest commit

 

History

History
195 lines (140 loc) · 6.23 KB

segments.md

File metadata and controls

195 lines (140 loc) · 6.23 KB

Corpus segments

Here you can explore information about our corpus sources and download them

Segment information

Genres Tokens, millions %
News 92 1.5
Literary Texts 4605 76
Special datasets 2.5 0.5
Social media 80 1.5
Subtitles 101 1.5
Poems 1130 19
<iframe src="https://cdn.datamatic.io/runtime/echarts/3.7.2_230/embedded/index.html#id=115038797393892898117/1XxvinvhVz-Gh0WJzjQ_0sD5_f7coQueI" frameborder="0" width="687" height="493" allowtransparency="true"></iframe>

Token distribution per segment

Stihi.ru

Meta-attributes:

  • 'textrubric' – genre of the poem
  • 'textid' – unique ID
  • 'textname' – poem title
  • 'author' – author(s)
  • 'authortexts' – number of poems written by the author
  • 'authorreaders' – number of visitors who read the poem
  • 'date' – date of publication
  • 'time' – time of publication
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Poems by Genre

![alt text]({{ site.baseurl }}/assets/images/stihi_ru_rubrics.png "corpus segments")

Click here for more info.

Proza.ru

Meta-attributes:

  • 'textrubric' – text genre
  • 'textid' – unique ID
  • 'textname' – title
  • 'author' – author(s)
  • 'authortexts' – number of texts written by the author
  • 'authorreaders' – number of visitors who read the text
  • 'date' – date of publication
  • 'time' – time of publication
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Texts by Genre

![alt text]({{ site.baseurl }}/assets/images/proza_ru_textrubric.png "corpus segments")

Click here for more info.

Lenta.ru

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – article title
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Category

![alt text]({{ site.baseurl }}/assets/images/lenta_rubrics.png "corpus segments")

Click here for more info.

Interfax

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Tag

![alt text]({{ site.baseurl }}/assets/images/interfax_tags.png "corpus segments")

Click here for more info.

NPlus1

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textdiff' – text difficulty
  • 'author' – author(s)
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Texts by Difficulty

![alt text]({{ site.baseurl }}/assets/images/nplus1_diff.png "corpus segments")

Click here for more info.

Komsomolskaya Pravda

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textregion' – news by region
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Region

![alt text]({{ site.baseurl }}/assets/images/kp_regions.png "corpus segments")

Click here for more info.

Russian Magazines Hall

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'magazine' – magazine title
  • 'author' – author(s)
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – tags
  • 'source' – reference to the original source (sometimes unavailable)

Click here for more info.

Fontanka.ru

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'textregion' – news by region
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Year

![alt text]({{ site.baseurl }}/assets/images/fontanka_years.png "corpus segments")

Click here for more info.

Arzamas

Meta-attributes:

  • 'textid' – unique ID
  • 'textname' – title
  • 'authors' – author(s)
  • 'authorprofession' – author's profession
  • 'about_author' – short author bio
  • 'textrubric' – article category
  • 'date' – date of publication
  • 'time' – time of publication
  • 'tags' – article tags
  • 'source' – reference to the original source (sometimes unavailable)

Distribution of Articles by Category

![alt text]({{ site.baseurl }}/assets/images/arzamas_rubrics.png "corpus segments")

Click here for more info.

TV Subtitles

Meta-attributes:

  • 'textid' – unique ID
  • 'title' – film title
  • 'language' – language
  • 'filepath' – file path

Distribution of Texts by Language

![alt text]({{ site.baseurl }}/assets/images/tvsubtitles_langs.png "corpus segments")

Click here for more info.