-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences #3034
Conversation
409f3f6
to
3aa1147
Compare
See #2578 (comment) for some related discussion. In short, I am in favour of adding this behaviour, but I would like to start a discussion on whether it would be wise to include an option to turn it off in After all, as you are aware, flags are made up of regional indicator symbols. These symbols are also frequently used outside of the context of flags, like to spell out words: 🇦 🇵 🇵 🇱 🇪. However, this is commonly done with spaces between the tokens, as otherwise that spelling of "APPLE" would have had a Polish flag in the middle. In short, it may not be an issue. Beyond that discussion, would you be able to write a few tests to back up that your implementation works as intended? For example in here: https://github.com/nltk/nltk/blob/develop/nltk/test/unit/test_tokenize.py Also, I'm uncertain why the CI is failing. It seems like a third party download of a file used in one of the tests has failed:
It's unrelated to your PR, though, so no stress. |
Yes, I'll modify that file as part of this PR. Given your comments, what do you think of the decision use a regex that encodes all pairs of enclosed letters as tokens? Under this behavior, your example is tokenized as follow. Is this "correct"? Or should this repo have a record of all ISO country code letter pairs and treat these separately in a much more complicated regex? My preference is to avoid this.
|
I also prefer your current implementation, i.e. any two directly adjacent regional indicator symbols. This is also more future-proof. The behaviour from your little snippet seems like the expected output as well. 👍 |
3aa1147
to
c6504c6
Compare
Great, implemented with a test in test_tokenize.py. |
c6504c6
to
b16e4bf
Compare
…ag sequences * TweetTokenizer currently splits emoji flags into individual tokens of enclosed letters, e.g. 🇨🇦 -> '🇨 ', '🇦 ' * This patch keeps emoji flag sequences intact * Without this PR: > ```python > nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!') > ['Hi', '🇨', '🇦', ',', '😍', '!', '!'] > ``` * With this PR: > ```python > nltk.tokenize.casual.TweetTokenizer().tokenize(text='Hi 🇨🇦, 😍!!') > ['Hi', '🇨🇦', ',', '😍', '!', '!'] > ```
71e97de
to
0770bf9
Compare
I've expanded on your test case a little bit, and fixed a merge conflict on the AUTHORS.md file. Should be all set now! |
nltk.tokenize.casual.TweetTokenizer: Add support to tokenize emoji flag sequences