Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix simple impute #788

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Fix simple impute #788

wants to merge 5 commits into from

Conversation

abduhbm
Copy link
Contributor

@abduhbm abduhbm commented Feb 4, 2021

Fix #787.
Also related to #779

@abduhbm
Copy link
Contributor Author

abduhbm commented Feb 4, 2021

CI failing for linting issues

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. A few questions / comments.

@@ -70,12 +70,18 @@ def _fit_frame(self, X):
if self.strategy == "mean":
avg = X.mean(axis=0).values
elif self.strategy == "median":
avg = X.quantile().values
avg = [np.median(X[col].dropna()) for col in X.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will eagerly compute the values, thanks to np.median. Since that's done in a list comprehension, we'd end up executing the graph for X once per column. We want to delay computation till the end.

I also think this will end up pulling all the data for a column into a single ndarray, to do the median, which we also want to avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using delayed here?

avg = [dask.delayed(np.median(X[col].dropna())) for col in X.columns]

for col in X.columns:
val_counts = X[col].value_counts().reset_index()
if isinstance(X, dd.DataFrame):
x = val_counts.to_dask_array(lengths=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need lengths here? This also triggers a computation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to compute chunk sizes ... any suggestion on how to avoid it? Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer.fit strange behavior with median and most_frequent strategies
2 participants