Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Fix empty partition prediction with ParallelPostFit #912

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

VibhuJawa
Copy link
Collaborator

@VibhuJawa VibhuJawa commented Mar 25, 2022

Thiis PR fixes #911

@TomAugspurger
Copy link
Member

Thanks @VibhuJawa. When looking at the traceback in #911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?

@VibhuJawa
Copy link
Collaborator Author

Thanks for reviewing the issue tom.

When looking at the traceback in #911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?

So i tried exploring a clean way to expose the self._validate_data parameter but came up with nothing. Open to any ideas on that front.

I think the problem is that each family of models in sklearn calls it with different parameters, (See below) and I am also not sure if all models will just work even if we can some how coerce it to accept them. (See related discussion here) .

  1. _base_chain

  2. naive_bayes

  3. kmeans

  4. _kernel_pca

Please let me know if the approach i am taking in this PR is not feasible. I will try to explore other ways to go about solving this problem.

@mmccarty
Copy link
Member

@TomAugspurger - Any further feedback or guidance here? I don't see a way to expose check_array's kwargs without changes to sklearn. IMO, it seems reasonable to handle this case in dask-ml.

cc @jrbourbeau @betatim for vis

@betatim
Copy link

betatim commented Sep 21, 2022

I think what is happening is that predict() (and co) are being called with an empty input that contains zero samples. It seems sensibly for the scikit-learn estimators to consider that an error. So I agree with Mike that this is something that dask-ml should handle. Probably by not calling predict(), transform() etc when a partition is empty and instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be, None also an empty array?).

@VibhuJawa
Copy link
Collaborator Author

VibhuJawa commented Sep 21, 2022

instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be, None also an empty array?).

In this PR,

  1. If the output is supposed to be arrays , I return empty arrays (for both sparse and dense arrays)

  2. If the output is supposed to be dataframe (or dataframe like objec), I return an empty dataframe like objects

dask-ml/dask_ml/wrappers.py

Lines 661 to 688 in 28b97e0

if hasattr(output_meta, "__array_function__"):
if len(output_meta.shape) == 1:
shape = 0
else:
shape = list(output_meta.shape)
shape[0] = 0
ar = np.zeros(
shape=shape,
dtype=output_meta.dtype,
like=output_meta,
)
return ar
elif "scipy.sparse" in type(output_meta).__module__:
# sparse matrices dont support
# `like` due to non implimented __array_function__
# Refer https://github.com/scipy/scipy/issues/10362
# Note below works for both cupy and scipy sparse matrices
# TODO: REMOVE code duplication
if len(ar.shape) == 1:
shape = 0
else:
shape = list(ar.shape)
shape[0] = 0
ar = type(output_meta)(shape, dtype=output_meta.dtype)
return ar
elif hasattr(output_meta, "iloc"):
return output_meta.iloc[:0, :]

This also matches what cuML does (which returns an empty series) . See below:

>>> type(reg)
<class 'cuml.linear_model.logistic_regression.LogisticRegression'>
>>> reg.predict(X_new.iloc[:0])
Series([], dtype: float32)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Dask-ML ParallelPostFit prediction fails on empty partitions
4 participants