Skip to content

vanangamudi/awesome-resources-for-indic-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

Awesome Resources for IndicNLP

Common Resources

OPUS the open parrallel corpus

A Dravidian Etymological Dictionary

Byte Pair Encoding - Pretrained for 275 language

FastText word vectors for 157 languages

Indian Language Technology Proliferation and Deployment Center

Center For Indian Language Technology - CFILT FB page

Indian Institute of Language Studies (IILS)

Central Institute of Indian Languages

Central Institute of Indian Languages

OpenSLR Speech datasets

Research Papers

Survey:Natural Language Parsing For Indian Languages

Language Specific

Malayalam

mlmorph - Malayalam Morphological Analyzer using Finite State Transducer

Tamil

Datasets

Datasets in tamil text

Other projects

Open Tamil Suite of tools for operating on tamil text.

Tokenizer, Language model and Classifier for Tamil language by Ravi Annaswamy

Scrapers

  1. Tamil Etymological Dictionary
  2. Newspaper Crawlers

ML models

Text Classification model in Pytorch: Can be easily applied to other datasets, infact the linked repository also contains a dataset for film reviews in tamil.

Bengali

Bangla2Vec

Bengali News Classification

NLP for Bengali

  • Contains Wikipedia Articles Dataset (72,374 articles) and scripts which were used to scrape Wikipedia and clean that dataset
  • Contains Language Model with Perplexity ~41
  • Contains Bengali News Classification Model with 94% accuracy

Scrapers

Bengali News Channel Scraper

Telgu

Telugu-NLP - Contains NLP tools developed for telugu

Research Papers and Data

Research Papers in Bengali NLP

Collection of Repositories

Language Repository Perplexity of Language model Wikipedia Articles Dataset Classification accuracy Classification Kappa score
Hindi NLP for Hindi ~36 55,000 articles ~79 (News Classification) ~30 (Movie Review Classification)
Punjabi NLP for Punjabi ~13 44,000 articles ~89 (News Classification) ~60 (News Classification)
Sanskrit NLP for Sanskrit ~6 22,273 articles ~70 (Shloka Classification) ~56 (Shloka Classification)
Gujarati NLP for Gujarati ~34 31,913 articles ~91 (News Classification) ~85 (News Classification)
Kannada NLP for Kannada ~70 32,997 articles ~94 (News Classification) ~90 (News Classification)
Malyalam NLP for Malyalam ~26 12,388 articles ~94 (News Classification) ~91 (News Classification)
Nepali NLP for Nepali ~32 38,757 articles ~97 (News Classification) ~96 (News Classification)
Odia NLP for Odia ~27 17,781 articles ~95 (News Classification) ~92 (News Classification)
Marathi NLP for Marathi ~18 85,537 articles ~91 (News Classification) ~84 (News Classification)
Bengali NLP for Bengali ~41 72,374 articles ~94 (News Classification) ~92 (News Classification)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •