Fix simple impute #788

abduhbm · 2021-02-04T17:04:29Z

Fix #787.
Also related to #779

abduhbm · 2021-02-04T23:12:01Z

CI failing for linting issues

TomAugspurger

Thanks. A few questions / comments.

TomAugspurger · 2021-02-28T13:27:21Z

dask_ml/impute.py

@@ -70,12 +70,18 @@ def _fit_frame(self, X):
        if self.strategy == "mean":
            avg = X.mean(axis=0).values
        elif self.strategy == "median":
-            avg = X.quantile().values
+            avg = [np.median(X[col].dropna()) for col in X.columns]


I believe this will eagerly compute the values, thanks to np.median. Since that's done in a list comprehension, we'd end up executing the graph for X once per column. We want to delay computation till the end.

I also think this will end up pulling all the data for a column into a single ndarray, to do the median, which we also want to avoid.

How about using delayed here?

avg = [dask.delayed(np.median(X[col].dropna())) for col in X.columns]

TomAugspurger · 2021-02-28T13:27:47Z

dask_ml/impute.py

+            for col in X.columns:
+                val_counts = X[col].value_counts().reset_index()
+                if isinstance(X, dd.DataFrame):
+                    x = val_counts.to_dask_array(lengths=True)


Do we need lengths here? This also triggers a computation.

This is needed to compute chunk sizes ... any suggestion on how to avoid it? Thanks,

abduhbm added 4 commits February 4, 2021 19:28

Fix median and most_frequent strategies in SimpleImpute._fit_frame

a8c228c

Lint

3c2831c

compat

a92bfd5

Fix compat for finding smallest most_frequent

ffaeb80

Merge branch 'main' of github.com:dask/dask-ml into fix-simple-impute

b15ef37

TomAugspurger reviewed Feb 28, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix simple impute #788

Fix simple impute #788

abduhbm commented Feb 4, 2021

abduhbm commented Feb 4, 2021

TomAugspurger left a comment

TomAugspurger Feb 28, 2021

abduhbm Mar 25, 2021

TomAugspurger Feb 28, 2021

abduhbm Mar 25, 2021

Fix simple impute #788

Are you sure you want to change the base?

Fix simple impute #788

Conversation

abduhbm commented Feb 4, 2021

abduhbm commented Feb 4, 2021

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Feb 28, 2021

Choose a reason for hiding this comment

abduhbm Mar 25, 2021

Choose a reason for hiding this comment

TomAugspurger Feb 28, 2021

Choose a reason for hiding this comment

abduhbm Mar 25, 2021

Choose a reason for hiding this comment