Equivalent to nltk.corpus stopwords #947

Utopiah · 2022-08-12T14:15:50Z

Hi, I'm just learning about the project and it's pretty amazing. I tinkered with NTLK and Gensim before but this is so convenient to explore and embed on a page. Learning with Observable notebooks is also great!

That being said I end up for a lot of noise in my selection. I tried a bit of normalize() and remove() with encouraging results. Still, I'm quite surprised that when I search in this repository I don't seem to find stop words.

This made me wonder, is this the "wrong" way in this context? Is the philosophy of compromise not to rely on such lists?

PS: I apologize for hijacking issues but is there a forum/chat/platform for discussions on using compromise that would a better place? I have other questions like using .tfidf() on .ngrams() but I don't make to create noise here.

The text was updated successfully, but these errors were encountered:

spencermountain · 2022-08-12T18:04:15Z

hey Fabien, you're talking about the results of the wikipedia plugin right?

Yeah, super noisy. it really needs a lot of work. Yeah, i was using a stop-list here but that was just me eyeballing it. It could really use a PR, if you want to take a swing at it.

To do it properly, we should also add (some!) wikipedia redirects. I held-off because the results were still so rowdy.
cheers

spencermountain added the Discussion label Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent to nltk.corpus stopwords #947

Equivalent to nltk.corpus stopwords #947

Utopiah commented Aug 12, 2022 •

edited

spencermountain commented Aug 12, 2022

Equivalent to nltk.corpus stopwords #947

Equivalent to nltk.corpus stopwords #947

Comments

Utopiah commented Aug 12, 2022 • edited

spencermountain commented Aug 12, 2022

Utopiah commented Aug 12, 2022 •

edited