PreTwITA
is an open source Preprocessor for Tweets in the ITAlian language written in Python. The purpose of such library is to provide the user with language-specific tools for text cleaning (i.e. the process of preparing raw text for Natural Language Processing).
- correction of most common italian abbreviations (e.g. xk replaced with perché)
- remove urls
- remove emojis
- remove emoticons
- remove mentions
- remove hashtags
- remove twitter reserved words (i.e. 'rt' and 'fav')
- remove stopwords
- an option to define additional stopwords
- remove punctuation
- remove numbers
- an option to avoid removing dates in yyyy format
- remove multiple spaces
- tokenization
You can install PreTwITA via pip
:
$python -m pip install pretwita
Otherwise, use git
if you want to be sure to get the latest updates:
$pip install git+https://github.com/andreafailla/pretwita.git
For usage and tips, please refer to the demo.ipynb
file