added TfidfTransformer and TfidfVectorizer to feature_extraction.text #869

ParticularMiner · 2021-10-23T08:15:35Z

Hi,

Thanks to all dask-developers for your outstanding work!

In this PR, I have attempted to apply my rudimentary knowledge of dask to include dask-implementations of TfidfTransformer and TfidfVectorizer (found in the sklearn.feature_extraction.text module) in the dask-ml.feature_extraction.text module. For now, just a minimal working code (no unit-test yet) is available. Though the examples I've hard-coded into their docstrings should be able to run without incident.

I think it requires someone with proper dask-expertise to inspect it and give me some pointers. In the meantime, I'll draw-up the tests.

Hopefully, this will prove to be a useful extension to dask-ml. At least it would for me, if it would eventually be merged upstream.

just the skeleton (no tests yet)

ParticularMiner · 2021-10-23T08:17:42Z

dask_ml/feature_extraction/text.py

@@ -12,9 +12,13 @@
 import scipy.sparse
 import sklearn.base
 import sklearn.feature_extraction.text
+import sklearn.preprocessing


This import is needed for its normalize() function.

Does sklearn.processing.normalize eagerly return a NumPy array? Or does it operate lazily on Dask Arrays?

If it's eager, we would need to reimplement it.

Since I use sklearn.processing.normalize() only through dask.array.Array.map_blocks(), the operation is lazy even though sklearn.processing.normalize() is itself not lazy.

I hope this is acceptable.

ParticularMiner · 2021-10-23T08:23:18Z

dask_ml/feature_extraction/text.py

-        params = self.get_params()
+        subclass_instance_params = self.get_params()
+        excluded_keys = getattr(self, '_non_CountVectorizer_params', [])
+        params = {key: subclass_instance_params[key]


This is my "patch-up" solution to get params to hold only parameters from CountVectorizer and not its subclasses.

I'm a bit worried about a parent class needing to know about the details of its subclasses.

Is it possible for each subclass to override get_params to do the right thing?

ParticularMiner · 2021-10-23T08:24:02Z

dask_ml/feature_extraction/text.py

+    --------
+    sklearn.feature_extraction.text.TfidfTransformer
+
+    Examples


These examples have worked for me.

ParticularMiner · 2021-10-23T08:24:35Z

dask_ml/feature_extraction/text.py

+    --------
+    sklearn.feature_extraction.text.TfidfVectorizer
+
+    Examples


These examples have worked for me.

ParticularMiner · 2021-10-23T15:36:15Z

Tests have now been added.

ParticularMiner · 2021-10-24T09:13:06Z

I have also just now added support for dask.dataframe.Series to CountVectorizer and TfidfVectorizer.

TomAugspurger

Started to review, it'll be a while before I can finish.

Can you share a bit about:

How does this scale for large inputs
Where all does computation occur during initialization, fitting, and transforming, and why can't it be done lazily?

TomAugspurger · 2021-10-24T12:57:38Z

.gitignore

@@ -122,3 +122,5 @@ docs/source/auto_examples/
 docs/source/examples/mydask.png

 dask-worker-space
+/.project


I'd recommend putting this in a global gitignore file: https://stackoverflow.com/questions/7335420/global-git-ignore

Many thanks for the recommendation. I was unaware of this trick.

TomAugspurger · 2021-10-24T12:58:48Z

dask_ml/feature_extraction/text.py

@@ -12,9 +12,13 @@
 import scipy.sparse
 import sklearn.base
 import sklearn.feature_extraction.text
+import sklearn.preprocessing


Does sklearn.processing.normalize eagerly return a NumPy array? Or does it operate lazily on Dask Arrays?

If it's eager, we would need to reimplement it.

TomAugspurger · 2021-10-24T12:59:55Z

dask_ml/feature_extraction/text.py

+                        aggregate=np.sum,
+                        axis=0,
+                        concatenate=False,
+                        dtype=dtype).compute().astype(dtype)


Why is the astype needed? Shouldn't passing dtype to reduction ensure it's already the right type?

Also, do we need to compute in this function? Or can it be done lazily (I haven't looked at how this is used yet)

You're right, I should have removed those.

TomAugspurger · 2021-10-24T13:01:32Z

dask_ml/feature_extraction/text.py

-        params = self.get_params()
+        subclass_instance_params = self.get_params()
+        excluded_keys = getattr(self, '_non_CountVectorizer_params', [])
+        params = {key: subclass_instance_params[key]


I'm a bit worried about a parent class needing to know about the details of its subclasses.

Is it possible for each subclass to override get_params to do the right thing?

TomAugspurger · 2021-10-24T13:02:11Z

dask_ml/feature_extraction/text.py

-                *vocabularies.to_delayed()
-            )
+            vocabulary = vocabulary_for_transform = (
+                _merge_vocabulary( *vocabularies.to_delayed() ))


This seems like it'll cause a linting error. The contributing docs should have some info about setting up pre-commit.

I'll check the contributing docs. Does this repo execute Action Workflow Scripts that also lint PRs? If so, that would make it easier to standardize coding style.

TomAugspurger · 2021-10-24T13:03:02Z

dask_ml/feature_extraction/text.py

+            result = raw_documents.map_partitions(
+                _count_vectorizer_transform, vocabulary_for_transform, params)
+            result = build_array(result, n_features, meta)
+        result.compute_chunk_sizes()


Why is this necessary? Ideally we avoid all unnecessary computation.

ParticularMiner · 2021-10-24T14:11:37Z

Hi @TomAugspurger

Thanks for your review.

From your comments I realize that the dask programming paradigm is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.

TomAugspurger · 2021-10-24T14:40:00Z

is to delay all computations until such a time that the user executes compute() outside of the class, right? I guess that's the challenge for me right now. I'll see what I can do to achieve this.

If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.

ParticularMiner · 2021-10-25T14:38:27Z

Hi @TomAugspurger

If possible, yes. But sometimes intermediate computation is inevitable. .fit will often require a compute to learn the parameters (e.g. StandardScaler), we just want to make sure it's actually required.

True.

I've cleaned things up a bit now — all unnecessary calls to compute() have been removed — and TfidfTransformer's fit() function is now lazy, that is, it does not learn the parameters until the first call to TfidfTransformer's transform() function is made. Thereafter, all learned data remains in memory and so does not need to be computed again.

Also, TfidfTransformer's transform() function, and TfidfVectorizer's fit(), fit_transform(), and transform() functions are all lazy.

Currently all tests are passing.

The one outstanding issue regards the 'worrying' side-effect of using self.get_params() in CountVectorizer. I'm not sure why the original developers chose to use this function, since I found that sklearn.feature_extraction.text.CountVectorizer's fit_transform() and transform() functions do not use this function themselves. So there might be a way to circumvent its use. I'll take a closer look at it.

ParticularMiner · 2021-10-25T16:49:57Z

@TomAugspurger

By the way, I do not have access to a cluster, so I'm not sure how the code scales with the cluster-size. I merely presumed that if I wrote code similar to that of already existing dask-ml's CountVectorizer, things would be fine.

If you know of any way I can test the code in a truly distributed environment, kindly let me know.

get_CountVectorizer_params()

ParticularMiner · 2021-10-26T12:47:14Z

dask_ml/feature_extraction/text.py

@@ -166,10 +215,35 @@ class CountVectorizer(sklearn.feature_extraction.text.CountVectorizer):
    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    """

+    def get_CountVectorizer_params(self, deep=True):


@TomAugspurger

I'm a bit worried about a parent class needing to know about the details of its subclasses.
Is it possible for each subclass to override get_params to do the right thing?

How about this? I've instead added a new method to CountVectorizer called .get_CountVectorizer_params() whose implementation is a slight modification of the original .get_params() of the sklearn.Base.BaseEstimator class but which does what is expected. Subclasses do not need to override it. Moreover, CountVectorizer does not get to "know" the parameters of its subclasses. I hope this is acceptable.

added TfidfTransformer and TfidfVectorizer to feature_extraction.text

bff5ddc

just the skeleton (no tests yet)

ParticularMiner commented Oct 23, 2021

View reviewed changes

exploited minlength parameter in numpy.bincount(); added a few tests

ebedfa8

added dask.dataframe.Series support to CountVectorizer & TfidfVectorizer

1ac8e38

TomAugspurger reviewed Oct 24, 2021

View reviewed changes

removed all unnecessary calls to compute()

1fa55ca

ParticularMiner force-pushed the tfidf branch from 0a20f5c to 1fa55ca Compare October 25, 2021 14:46

ParticularMiner added 2 commits October 26, 2021 13:28

replaced get_params() in CountVectorizer with

39f9f57

get_CountVectorizer_params()

fixed dtype of output

c331d9d

ParticularMiner commented Oct 26, 2021

View reviewed changes

ParticularMiner added 2 commits October 27, 2021 09:07

applied multiple inheritance to TfidfVectorizer to shrink source-code

95e1172

fixed dtype in TfidfTransformer

9cf00fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added TfidfTransformer and TfidfVectorizer to feature_extraction.text #869

added TfidfTransformer and TfidfVectorizer to feature_extraction.text #869

ParticularMiner commented Oct 23, 2021

ParticularMiner Oct 23, 2021

TomAugspurger Oct 24, 2021

ParticularMiner Oct 25, 2021

ParticularMiner Oct 23, 2021

TomAugspurger Oct 24, 2021

ParticularMiner Oct 23, 2021

ParticularMiner Oct 23, 2021

ParticularMiner commented Oct 23, 2021

ParticularMiner commented Oct 24, 2021

TomAugspurger left a comment

TomAugspurger Oct 24, 2021

ParticularMiner Oct 24, 2021

TomAugspurger Oct 24, 2021

TomAugspurger Oct 24, 2021

ParticularMiner Oct 24, 2021

TomAugspurger Oct 24, 2021

TomAugspurger Oct 24, 2021

ParticularMiner Oct 24, 2021

TomAugspurger Oct 24, 2021

ParticularMiner commented Oct 24, 2021

TomAugspurger commented Oct 24, 2021

ParticularMiner commented Oct 25, 2021

ParticularMiner commented Oct 25, 2021

ParticularMiner Oct 26, 2021

added TfidfTransformer and TfidfVectorizer to feature_extraction.text #869

Are you sure you want to change the base?

added TfidfTransformer and TfidfVectorizer to feature_extraction.text #869

Conversation

ParticularMiner commented Oct 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParticularMiner commented Oct 23, 2021

ParticularMiner commented Oct 24, 2021

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ParticularMiner commented Oct 24, 2021

TomAugspurger commented Oct 24, 2021

ParticularMiner commented Oct 25, 2021

ParticularMiner commented Oct 25, 2021

Choose a reason for hiding this comment