Want more data in Russian? 33 billion words available at Omnia Russica
Taiga corpus + Aranea + Wikipedia + Common Crawl altogether
Universal language model for Russian - see on github!
Taiga corpus is now open for downloading!
Spark in me releases a Fasttext model, trained on random mix of Russian Wikipedia, Taiga and Common Crawl
Params Standard params - (3,6) n-grams + vector dimensionality is 300.
Usage:
import fastText as ft
ft_model_big = ft.load_model('model')
And then just refer to https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py
We have added fake news dataset in our corpus collection: 150k tokens from panorama.pub (Russian fake news site, analogous to Onion) - tagged and added to "News" segment.
Other news sources can be considered a "reliable news" class in case of binary classification task.
Thanks to Andrey Kutuzov - a skipgram model for Taiga is now available at RusVectores project page
On RusVectores page you can also find online semantic similarity calculator on different models, including Taiga
Taiga subcorpus with manual annotation is now presented on universaldependencies.org (Russian)
- training: 50% (10K tokens, 880 sentences)
- test: 50% (10K tokens, 884 sentences)
Subcorpus repository
All credits go to Olga Lyashevskaya and students