support for non dask arrays for HyperbandSearchCV #751

gioxc88 · 2020-10-28T18:26:04Z

The main changes in this PR are:

change to the async def _fit function in dask_ml.model_selection._incremental.py to allow Hyperband to work with non dask arrays
fixed default test_size which didn't work with pandas dataframe

…hanged the async def _fit function to allow Hyperband to work with non dask arrays

stsievert

Thanks for the PR!

Could you also implement a short test for this? The custom data/model doesn't have to be complex, something like

from sklearn.base import BaseEstimator

class CustomDataLoader:
    def __len__(self):
        return 100
    ...

class CustomModel(BaseEstimator):
    def _partial_fit(self, X, y=None, **kwargs):
        assert isinstance(X, CustomDataLoader)
        return self

    fit = partial_fit = _partial_fit

    def score(self, X, y=None):
        assert isinstance(X, CustomDataLoader)
        return np.random.uniform()

stsievert · 2020-10-28T22:17:10Z

dask_ml/model_selection/_incremental.py

+
+        train_eg = await client.gather(client.map(len, y_train))
+        msg = "[CV%s] For training there are between %d and %d examples in each chunk"
+        logger.info(msg, prefix, min(train_eg), max(train_eg))


Why won't this code work?

train_eg = ... msg = ... if len(train_eg): logger.info(msg, prefix, min(train_eg), max(train_eg))

This avoids min([]) as mentioned in #748 (comment).

Then, I think the if-statement could be moved inside get_futures:

def get_futures(partial_fit_calls): if not isinstance(X_train, da.Array): return X_train, y_train ... # existing implementation

you are right, I was not careful when I red your comment, this would work.

The reason why I did not implement it that way is because saying if len(train_eg): or even shorter if train_eg is the same as checking isinstance(X_train, da.Array) because train_eg will be empty only if X_train is not a da.Array. This means that in your implementation you are still executing these lines, even if not needed:

X_train = sorted(futures_of(X_train), key=lambda f: f.key) y_train = sorted(futures_of(y_train), key=lambda f: f.key) assert len(X_train) == len(y_train) train_eg = await client.gather(client.map(len, y_train))

finally the reason why it might be better not to incorporate the if statement inside the get_futures function is because this will force you to check the condition every time you call the function which will not be ideal for dask.collections with a large number of partitions.

I think the logging message should be issued if the dataset object has an implementation of __len__. The computation of train_eg only happens once (train_eg is a list of ints).

I don't like two separate definitions of get_futures, especially if the both return training data.

stsievert · 2020-10-28T22:26:26Z

dask_ml/model_selection/_incremental.py

-    train_eg = await client.gather(client.map(len, y_train))
-    msg = "[CV%s] For training there are between %d and %d examples in each chunk"
-    logger.info(msg, prefix, min(train_eg), max(train_eg))
+    if hasattr(X_train, 'npartitions'):


Can this if-statement be changed to if isinstance(X_train, da.Array)?

the reason I don't want to restrict the if statement to be isinstance(X_train, da.Array) is because, potentially any custom data structure like the hypotetical CustomFrame could have a dask-like API by implementing npartitions and __dask_graph__.

In this case, for example, in the future I could further extend my CustomFrame (I or any other user) to work with data larger than memory and this would still work with hyperband even if is not a da.Array, because the API is compatible.

With this in mind maybe a good compromise could be:

if dask.is_dask_collection(X_train) ?

if dask.is_dask_collection(X_train)

I think that's a better choice (though @TomAugspurger might have more input).

Yeah, I think we is_dask_collection is better.

TomAugspurger

Even just testing with a NumPy array or pandas DataFrame would be good.

TomAugspurger · 2020-10-29T19:47:23Z

dask_ml/model_selection/_incremental.py

-    train_eg = await client.gather(client.map(len, y_train))
-    msg = "[CV%s] For training there are between %d and %d examples in each chunk"
-    logger.info(msg, prefix, min(train_eg), max(train_eg))
+    if hasattr(X_train, 'npartitions'):


Yeah, I think we is_dask_collection is better.

- change in checking array using dask.is_dask_collection - added test in tests.model_selection.test_hyperband_non_daskarray.py

gioxc88 · 2020-11-20T14:16:44Z

@stsievert I implemented the changes we discussed.
sorry it took so long but I have been busy with other projects.

Cheers
Gio

TomAugspurger · 2020-11-22T16:51:17Z

@stsievert do you have another chance to look at this?

stsievert

I'm surprised this Hyperband/Incremental didn't support Pandas Dataframes already. That shouldn't be surprising – Dask DataFrames weren't supported until #701, and I have verified that the test in this PR fails on v1.7.0.

This PR LGTM, past a couple style nits and one implementation detail.

stsievert · 2020-11-22T19:05:10Z

dask_ml/model_selection/_incremental.py

        # Shuffle blocks going forward to get uniform-but-random access
        while partial_fit_calls >= len(order):
            L = list(range(len(X_train)))
            rng.shuffle(L)
            order.extend(L)
        j = order[partial_fit_calls]
        return X_train[j], y_train[j]
+    ### end addition ###


Style nit: could this comment be removed?

stsievert · 2020-11-22T19:05:38Z

dask_ml/model_selection/_incremental.py

@@ -218,13 +220,20 @@ def get_futures(partial_fit_calls):
        This function handles that policy internally, and also controls random
        access to training data.
        """
+        if dask.is_dask_collection(y_train):


I think this should only check X_train because y_train is an optional argument.

I've done some tracing and it appears that passing y=None into search.fit isn't supported? That's unrelated to this PR.

stsievert · 2020-11-22T19:05:51Z

dask_ml/model_selection/_incremental.py


-    # Order by which we process training data futures
-    order = []
+    ### start addition ###


Style nit: can this comment be removed?

gioxc88 · 2020-11-22T19:20:53Z

Hi @stsievert I agree with you in principle the check should be done on X_train.
And sorry for the comments, it was just for me to remember where I made changes.
Of course I can remove them.

EDIT: Done

stsievert · 2020-11-22T21:44:11Z

tests/model_selection/test_hyperband_non_daskarray.py

+@gen_cluster(client=True)
+def test_pandas(c, s, a, b):
+    X, y = make_classification(chunks=100)
+    X, y = pd.DataFrame(X.compute()), pd.Series(y.compute())


Looks like the test is failing in the .compute function. Why not use from sklearn.datasets import make_classification?

It's strange, it was working on my local machine.
Let me try to change it. Maybe it's something to do with how I wrote the test function. I just tried to emulate the structure I found in test_hyperband.py

stsievert · 2020-11-22T21:44:34Z

tests/model_selection/test_hyperband_non_daskarray.py

+
+
+@gen_cluster(client=True)
+def test_pandas(c, s, a, b):


Could this test go in test_hyperband.py?

yes let me move it there

…d dask make_classification to sklearn make_calssification

gioxc88 · 2020-12-02T23:15:36Z

does anyone have any idea of what happened with the tests?
it's like it's stuck

many thanks

stsievert · 2020-12-02T23:31:27Z

I'm not sure, but it does look there's a linting issue (both black and isort fail):

Checking black...
black, version 19.10b0
would reformat /home/vsts/work/1/s/dask_ml/model_selection/_incremental.py
would reformat /home/vsts/work/1/s/tests/model_selection/test_hyperband.py
Oh no! 💥 💔 💥
2 files would be reformatted, 98 files would be left unchanged.
Checking black... DONE
Checking isort...
4.3.21
ERROR: /home/vsts/work/1/s/tests/model_selection/test_hyperband.py Imports are incorrectly sorted.
Checking isort... DONE

Try making those lint changes and pushing; that will (likely) resolve the issue.

hristog · 2021-03-21T11:14:11Z

I've fixed the linting changes locally (and substantially reduced the computational size of the test), but the test_pandas still hangs. I've played around a bit, but haven't been able to identify the root cause yet.

fixed default test_size which didn't work with pandas dataframe and c…

6b3ad51

…hanged the async def _fit function to allow Hyperband to work with non dask arrays

gioxc88 mentioned this pull request Oct 28, 2020

Allowing HyperbandSearchCV to also work with non dask arrays #748

Open

stsievert reviewed Oct 28, 2020

View reviewed changes

TomAugspurger reviewed Oct 29, 2020

View reviewed changes

- change in get future function

101e6a9

- change in checking array using dask.is_dask_collection - added test in tests.model_selection.test_hyperband_non_daskarray.py

gioxc88 added 2 commits November 20, 2020 14:45

fix in test_hyperband_non_daskarray

fe154c6

fix in test_hyperband_non_daskarray

a73e6ec

stsievert reviewed Nov 22, 2020

View reviewed changes

comment removed and length cheked on X_train

cd5645f

stsievert reviewed Nov 22, 2020

View reviewed changes

gioxc88 added 3 commits November 23, 2020 00:45

moved test_hyperband_non_daskarray.py to test_hyperband.py and change…

0017e09

…d dask make_classification to sklearn make_calssification

fix make_classification

f7c01c7

put yield in test

ef2956a

Base automatically changed from master to main February 2, 2021 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for non dask arrays for HyperbandSearchCV #751

support for non dask arrays for HyperbandSearchCV #751

gioxc88 commented Oct 28, 2020

stsievert left a comment

stsievert Oct 28, 2020

gioxc88 Oct 29, 2020 •

edited

stsievert Oct 29, 2020

stsievert Oct 28, 2020

gioxc88 Oct 29, 2020 •

edited

stsievert Oct 29, 2020

TomAugspurger Oct 29, 2020

TomAugspurger left a comment

TomAugspurger Oct 29, 2020

gioxc88 commented Nov 20, 2020

TomAugspurger commented Nov 22, 2020

stsievert left a comment

stsievert Nov 22, 2020

stsievert Nov 22, 2020

stsievert Nov 22, 2020

stsievert Nov 22, 2020

gioxc88 commented Nov 22, 2020 •

edited

stsievert Nov 22, 2020

gioxc88 Nov 23, 2020

stsievert Nov 22, 2020

gioxc88 Nov 23, 2020

gioxc88 commented Dec 2, 2020

stsievert commented Dec 2, 2020

hristog commented Mar 21, 2021

support for non dask arrays for HyperbandSearchCV #751

Are you sure you want to change the base?

support for non dask arrays for HyperbandSearchCV #751

Conversation

gioxc88 commented Oct 28, 2020

stsievert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gioxc88 Oct 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gioxc88 Oct 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gioxc88 commented Nov 20, 2020

TomAugspurger commented Nov 22, 2020

stsievert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gioxc88 commented Nov 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gioxc88 commented Dec 2, 2020

stsievert commented Dec 2, 2020

hristog commented Mar 21, 2021

gioxc88 Oct 29, 2020 •

edited

gioxc88 Oct 29, 2020 •

edited

gioxc88 commented Nov 22, 2020 •

edited