Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typo in DaskScikitLearnBase and confusion regarding sample_weight #6237

Closed
pseudotensor opened this issue Oct 14, 2020 · 17 comments · Fixed by #6240
Closed

typo in DaskScikitLearnBase and confusion regarding sample_weight #6237

pseudotensor opened this issue Oct 14, 2020 · 17 comments · Fixed by #6240
Assignees

Comments

@pseudotensor
Copy link
Contributor

There is some confusion in naming. non-Dask-API uses "sample_weight" but Dask uses "sample_weights" even while the code docs for Dask show "sample_weight" like non-Dask case.

class DaskScikitLearnBase(XGBModel):
    '''Base class for implementing scikit-learn interface with Dask'''

    _client = None

    # pylint: disable=arguments-differ
    def fit(self, X, y,
            sample_weights=None,
            eval_set=None,
            sample_weight_eval_set=None,
            verbose=True):
        '''Fit the regressor.

        Parameters
        ----------
        X : array_like
            Feature matrix
        y : array_like
            Labels
        sample_weight : array_like
            instance weights
        eval_set : list, optional
            A list of (X, y) tuple pairs to use as validation sets, for which
            metrics will be computed.
            Validation metrics will help us track the performance of the model.
        sample_weight_eval_set : list, optional
            A list of the form [L_1, L_2, ..., L_n], where each L_i is a list
            of group weights on the i-th validation set.
        verbose : bool
            If `verbose` and an evaluation set is used, writes the evaluation
            metric measured on the validation set to stderr.'''
        raise NotImplementedError

In same file one also has:

sample_weights: list of arrays
        The weight vector for validation data.

As if it is a list of arrays, but what is the meaning of that? I think it should be just like non-Dask case and be a single array.

Perhaps there is some confusion between the fact that the sample_weight for X,y is just 1 thing but the sample_weights for eval_set are for each eval_set provided.

@pseudotensor pseudotensor changed the title typo in DaskScikitLearnBase typo in DaskScikitLearnBase and confusion regarding sample_weight Oct 14, 2020
@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 14, 2020

A related confusion, is that the normal xgboost scikit-learn API is: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit

So fit takes early_stopping_rounds, etc.

Is it correct that dask has no such option and so cannot do early stopping? That would be a critical feature.

I understand the docs say dask has no callback testing: https://xgboost.readthedocs.io/en/latest/tutorials/dask.html#limitations

But does this mean no eval_metric or early stopping is possible?

Are we supposed to use dask for distributed multi-GPU xgboost training? Or is it all deprecated and one can just pass dask_cudf into normal model.

Also, in general, it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Related: dask/dask-xgboost#38

@hcho3
Copy link
Collaborator

hcho3 commented Oct 14, 2020

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).

it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.

@pseudotensor
Copy link
Contributor Author

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 14, 2020

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).

it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier.

Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster.

I would still say that a single API and just calling model.fit() model.predict() no matter which is highly beneficial. If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.

@pseudotensor
Copy link
Contributor Author

@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask AI (#6199).

That's great. I see it's merged, so will be in 130?

@hcho3
Copy link
Collaborator

hcho3 commented Oct 14, 2020

@pseudotensor So DaskXGBRegressor can automatically fetch get_client() from Dask.distributed, if no client is given by the user.

In general, I agree that users should be able to use DaskXGBRegressor just like XGBRegressor. Any major API mismatch should be avoided.

That's great. I see it's merged, so will be in 130?

Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 14, 2020

@pseudotensor So DaskXGBRegressor can automatically fetch get_client() from Dask.distributed, if no client is given by the user.

In general, I agree that users should be able to use DaskXGBRegressor just like XGBRegressor. Any major API mismatch should be avoided.

That's great. I see it's merged, so will be in 130?

Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release.

Ya, and I mean even further there really only needs to be XGBRegressor/XGBClassifier. I can't see why need dask specialized versions. From usability perspective it's extra confusion. Setting up the dask client and the frames has specialized options, but the fit/predict/etc. do not need those I think.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 14, 2020

(For context the primary "sample_weight(s)" issue is still there.)

@hcho3
Copy link
Collaborator

hcho3 commented Oct 14, 2020

We can consider merging DaskXGBRegressor with XGBRegressor once we have feature parity between the two. Dask API must support the callback mechanism first, and that's currently in progress.

@hcho3 hcho3 self-assigned this Oct 14, 2020
@hcho3
Copy link
Collaborator

hcho3 commented Oct 14, 2020

Assigning myself to this issue, to address any mismatch in DaskXGBRegressor and XGBRegressor API.

@pseudotensor
Copy link
Contributor Author

Thanks. I'm not even sure things function as they are. At least something like pycharm does not like me passing sample_weights, probably because one of the base classes uses sample_weight. There is no way to get pycharm to like it, plural or not.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 15, 2020

@hcho3 BTW, a reason why it's not a good idea to have the model accept "client" is because "client" is not serializable. So this will break anything that relied upon pickle. So using the context manager way as above or lazy getter from dask is probably best.

That is, I experienced this just now. Following https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22 and adding client attribute leads to a model that cannot be pickled.

@trivialfis
Copy link
Member

For the skl interface, accepting an optional client parameter might be necessary, as the client object carries out most of the async switches. If a wrong client returned by get_client the rest will be completely broken.

@trivialfis
Copy link
Member

The typo however, should be fixed. @hcho3 I will submit a PR for that after getting around the overflow issue.

@hcho3
Copy link
Collaborator

hcho3 commented Oct 15, 2020

@trivialfis What should we do about client not being serializable?

@trivialfis
Copy link
Member

If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed.

I tried that before having the current interface and went through some discussions. The result is not possible at the moment. But I agree that's something we can work toward in the future. An unified interface with different backends. Right now getting dask interface to have feature parity with single node would be a priority, sorry the for the inconvenience.

@trivialfis
Copy link
Member

trivialfis commented Oct 15, 2020

What should we do about client not being serializable?

  • Use train and predict functions. Or
  • Don't set the optional client object for skl interface when not needed. Or
  • Use get_booster to obtain the real model. Or
  • Set it to None before serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants