Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent to nltk.corpus stopwords #947

Open
Utopiah opened this issue Aug 12, 2022 · 1 comment
Open

Equivalent to nltk.corpus stopwords #947

Utopiah opened this issue Aug 12, 2022 · 1 comment

Comments

@Utopiah
Copy link

Utopiah commented Aug 12, 2022

Hi, I'm just learning about the project and it's pretty amazing. I tinkered with NTLK and Gensim before but this is so convenient to explore and embed on a page. Learning with Observable notebooks is also great!

That being said I end up for a lot of noise in my selection. I tried a bit of normalize() and remove() with encouraging results. Still, I'm quite surprised that when I search in this repository I don't seem to find stop words.

This made me wonder, is this the "wrong" way in this context? Is the philosophy of compromise not to rely on such lists?

PS: I apologize for hijacking issues but is there a forum/chat/platform for discussions on using compromise that would a better place? I have other questions like using .tfidf() on .ngrams() but I don't make to create noise here.

@spencermountain
Copy link
Owner

hey Fabien, you're talking about the results of the wikipedia plugin right?

Yeah, super noisy. it really needs a lot of work. Yeah, i was using a stop-list here but that was just me eyeballing it. It could really use a PR, if you want to take a swing at it.

To do it properly, we should also add (some!) wikipedia redirects. I held-off because the results were still so rowdy.
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants