Skip to content

A text PREprocessor for TWeets in the ITAlian language

License

Notifications You must be signed in to change notification settings

andreafailla/pretwita

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PreTwITA – DISCONTINUED

PreTwITA is an open source Preprocessor for Tweets in the ITAlian language written in Python. The purpose of such library is to provide the user with language-specific tools for text cleaning (i.e. the process of preparing raw text for Natural Language Processing).

Included features

  • correction of most common italian abbreviations (e.g. xk replaced with perché)
  • remove urls
  • remove emojis
  • remove emoticons
  • remove mentions
  • remove hashtags
  • remove twitter reserved words (i.e. 'rt' and 'fav')
  • remove stopwords
    • an option to define additional stopwords
  • remove punctuation
  • remove numbers
    • an option to avoid removing dates in yyyy format
  • remove multiple spaces
  • tokenization

Installing PreTwITA

You can install PreTwITA via pip:
$python -m pip install pretwita

Otherwise, use git if you want to be sure to get the latest updates:
$pip install git+https://github.com/andreafailla/pretwita.git

Usage

For usage and tips, please refer to the demo.ipynb file

About

A text PREprocessor for TWeets in the ITAlian language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages