typo in DaskScikitLearnBase and confusion regarding sample_weight #6237

pseudotensor · 2020-10-14T22:13:16Z

There is some confusion in naming. non-Dask-API uses "sample_weight" but Dask uses "sample_weights" even while the code docs for Dask show "sample_weight" like non-Dask case.

class DaskScikitLearnBase(XGBModel):
    '''Base class for implementing scikit-learn interface with Dask'''

    _client = None

    # pylint: disable=arguments-differ
    def fit(self, X, y,
            sample_weights=None,
            eval_set=None,
            sample_weight_eval_set=None,
            verbose=True):
        '''Fit the regressor.

        Parameters
        ----------
        X : array_like
            Feature matrix
        y : array_like
            Labels
        sample_weight : array_like
            instance weights
        eval_set : list, optional
            A list of (X, y) tuple pairs to use as validation sets, for which
            metrics will be computed.
            Validation metrics will help us track the performance of the model.
        sample_weight_eval_set : list, optional
            A list of the form [L_1, L_2, ..., L_n], where each L_i is a list
            of group weights on the i-th validation set.
        verbose : bool
            If `verbose` and an evaluation set is used, writes the evaluation
            metric measured on the validation set to stderr.'''
        raise NotImplementedError

In same file one also has:

sample_weights: list of arrays
        The weight vector for validation data.

As if it is a list of arrays, but what is the meaning of that? I think it should be just like non-Dask case and be a single array.

Perhaps there is some confusion between the fact that the sample_weight for X,y is just 1 thing but the sample_weights for eval_set are for each eval_set provided.

The text was updated successfully, but these errors were encountered:

pseudotensor · 2020-10-14T22:18:38Z

A related confusion, is that the normal xgboost scikit-learn API is: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit

So fit takes early_stopping_rounds, etc.

Is it correct that dask has no such option and so cannot do early stopping? That would be a critical feature.

I understand the docs say dask has no callback testing: https://xgboost.readthedocs.io/en/latest/tutorials/dask.html#limitations

But does this mean no eval_metric or early stopping is possible?

Are we supposed to use dask for distributed multi-GPU xgboost training? Or is it all deprecated and one can just pass dask_cudf into normal model.

Also, in general, it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Related: dask/dask-xgboost#38

hcho3 · 2020-10-14T22:22:16Z

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).

it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.

pseudotensor · 2020-10-14T22:24:36Z

Hmm, this suggests it's optional:

https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22

pseudotensor · 2020-10-14T22:26:52Z

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).

it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.

I would still say that a single API and just calling model.fit() model.predict() no matter which is highly beneficial. If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.

pseudotensor · 2020-10-14T22:30:15Z

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask AI (#6199).

That's great. I see it's merged, so will be in 130?

hcho3 · 2020-10-14T22:32:13Z

@pseudotensor So DaskXGBRegressor can automatically fetch get_client() from Dask.distributed, if no client is given by the user.

In general, I agree that users should be able to use DaskXGBRegressor just like XGBRegressor. Any major API mismatch should be avoided.

That's great. I see it's merged, so will be in 130?

Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.

pseudotensor · 2020-10-14T22:35:19Z

@pseudotensor So DaskXGBRegressor can automatically fetch get_client() from Dask.distributed, if no client is given by the user.

In general, I agree that users should be able to use DaskXGBRegressor just like XGBRegressor. Any major API mismatch should be avoided.

That's great. I see it's merged, so will be in 130?

Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.

Ya, and I mean even further there really only needs to be XGBRegressor/XGBClassifier. I can't see why need dask specialized versions. From usability perspective it's extra confusion. Setting up the dask client and the frames has specialized options, but the fit/predict/etc. do not need those I think.

pseudotensor · 2020-10-14T22:35:55Z

(For context the primary "sample_weight(s)" issue is still there.)

hcho3 · 2020-10-14T22:42:56Z

We can consider merging DaskXGBRegressor with XGBRegressor once we have feature parity between the two. Dask API must support the callback mechanism first, and that's currently in progress.

hcho3 · 2020-10-14T22:45:58Z

Assigning myself to this issue, to address any mismatch in DaskXGBRegressor and XGBRegressor API.

pseudotensor · 2020-10-14T23:18:30Z

Thanks. I'm not even sure things function as they are. At least something like pycharm does not like me passing sample_weights, probably because one of the base classes uses sample_weight. There is no way to get pycharm to like it, plural or not.

pseudotensor · 2020-10-15T02:21:51Z

@hcho3 BTW, a reason why it's not a good idea to have the model accept "client" is because "client" is not serializable. So this will break anything that relied upon pickle. So using the context manager way as above or lazy getter from dask is probably best.

That is, I experienced this just now. Following https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22 and adding client attribute leads to a model that cannot be pickled.

trivialfis · 2020-10-15T04:57:13Z

For the skl interface, accepting an optional client parameter might be necessary, as the client object carries out most of the async switches. If a wrong client returned by get_client the rest will be completely broken.

trivialfis · 2020-10-15T04:59:52Z

The typo however, should be fixed. @hcho3 I will submit a PR for that after getting around the overflow issue.

hcho3 · 2020-10-15T05:05:28Z

@trivialfis What should we do about client not being serializable?

trivialfis · 2020-10-15T05:05:37Z

If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.

I tried that before having the current interface and went through some discussions. The result is not possible at the moment. But I agree that's something we can work toward in the future. An unified interface with different backends. Right now getting dask interface to have feature parity with single node would be a priority, sorry the for the inconvenience.

trivialfis · 2020-10-15T05:07:43Z

What should we do about client not being serializable?

Use train and predict functions. Or
Don't set the optional client object for skl interface when not needed. Or
Use get_booster to obtain the real model. Or
Set it to None before serialization.

pseudotensor changed the title ~~typo in DaskScikitLearnBase~~ typo in DaskScikitLearnBase and confusion regarding sample_weight Oct 14, 2020

hcho3 self-assigned this Oct 14, 2020

trivialfis mentioned this issue Oct 15, 2020

Fix typo in dask interface. #6240

Merged

trivialfis closed this as completed in #6240 Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

typo in DaskScikitLearnBase and confusion regarding sample_weight #6237

typo in DaskScikitLearnBase and confusion regarding sample_weight #6237

pseudotensor commented Oct 14, 2020

pseudotensor commented Oct 14, 2020 •

edited

hcho3 commented Oct 14, 2020 •

edited

pseudotensor commented Oct 14, 2020

pseudotensor commented Oct 14, 2020 •

edited by hcho3

pseudotensor commented Oct 14, 2020

hcho3 commented Oct 14, 2020

pseudotensor commented Oct 14, 2020 •

edited

pseudotensor commented Oct 14, 2020 •

edited

hcho3 commented Oct 14, 2020

hcho3 commented Oct 14, 2020

pseudotensor commented Oct 14, 2020

pseudotensor commented Oct 15, 2020 •

edited

trivialfis commented Oct 15, 2020

trivialfis commented Oct 15, 2020

hcho3 commented Oct 15, 2020

trivialfis commented Oct 15, 2020

trivialfis commented Oct 15, 2020 •

edited

typo in DaskScikitLearnBase and confusion regarding sample_weight #6237

typo in DaskScikitLearnBase and confusion regarding sample_weight #6237

Comments

pseudotensor commented Oct 14, 2020

pseudotensor commented Oct 14, 2020 • edited

hcho3 commented Oct 14, 2020 • edited

pseudotensor commented Oct 14, 2020

pseudotensor commented Oct 14, 2020 • edited by hcho3

pseudotensor commented Oct 14, 2020

hcho3 commented Oct 14, 2020

pseudotensor commented Oct 14, 2020 • edited

pseudotensor commented Oct 14, 2020 • edited

hcho3 commented Oct 14, 2020

hcho3 commented Oct 14, 2020

pseudotensor commented Oct 14, 2020

pseudotensor commented Oct 15, 2020 • edited

trivialfis commented Oct 15, 2020

trivialfis commented Oct 15, 2020

hcho3 commented Oct 15, 2020

trivialfis commented Oct 15, 2020

trivialfis commented Oct 15, 2020 • edited

pseudotensor commented Oct 14, 2020 •

edited

hcho3 commented Oct 14, 2020 •

edited

pseudotensor commented Oct 14, 2020 •

edited by hcho3

pseudotensor commented Oct 14, 2020 •

edited

pseudotensor commented Oct 14, 2020 •

edited

pseudotensor commented Oct 15, 2020 •

edited

trivialfis commented Oct 15, 2020 •

edited