New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
typo in DaskScikitLearnBase and confusion regarding sample_weight #6237
Comments
A related confusion, is that the normal xgboost scikit-learn API is: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit So fit takes early_stopping_rounds, etc. Is it correct that dask has no such option and so cannot do early stopping? That would be a critical feature. I understand the docs say dask has no callback testing: https://xgboost.readthedocs.io/en/latest/tutorials/dask.html#limitations But does this mean no eval_metric or early stopping is possible? Are we supposed to use dask for distributed multi-GPU xgboost training? Or is it all deprecated and one can just pass dask_cudf into normal model. Also, in general, it seems like API overhead to maintain separate APIs for dask and non-dask. You know if the frame incoming is dask_cudf, dask, or not. So you should be able to use a single API. This would make using it alot easier. Related: dask/dask-xgboost#38 |
@pseudotensor The early stopping support for the Dask API is currently in progress. We had to redesign the callback mechanism to accommodate the Dask API (#6199).
Not quite, since the user has to pass in the Dask client object into the Dask API. This is because the user has a wide latitude in configuring a Dask cluster. |
Hmm, this suggests it's optional: https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22 |
I would still say that a single API and just calling model.fit() model.predict() no matter which is highly beneficial. If "client" must be passed, it can be an extra optional kwarg (default None) that is asserted on in case dask frame is passed. |
That's great. I see it's merged, so will be in 130? |
@pseudotensor So In general, I agree that users should be able to use
Yes, we aim to have a working callback mechanism for Dask in 1.3.0 release. |
Ya, and I mean even further there really only needs to be XGBRegressor/XGBClassifier. I can't see why need dask specialized versions. From usability perspective it's extra confusion. Setting up the dask client and the frames has specialized options, but the fit/predict/etc. do not need those I think. |
(For context the primary "sample_weight(s)" issue is still there.) |
We can consider merging |
Assigning myself to this issue, to address any mismatch in |
Thanks. I'm not even sure things function as they are. At least something like pycharm does not like me passing sample_weights, probably because one of the base classes uses sample_weight. There is no way to get pycharm to like it, plural or not. |
@hcho3 BTW, a reason why it's not a good idea to have the model accept "client" is because "client" is not serializable. So this will break anything that relied upon pickle. So using the context manager way as above or lazy getter from dask is probably best. That is, I experienced this just now. Following https://github.com/dmlc/xgboost/blob/master/demo/dask/sklearn_gpu_training.py#L22 and adding client attribute leads to a model that cannot be pickled. |
For the skl interface, accepting an optional client parameter might be necessary, as the |
The typo however, should be fixed. @hcho3 I will submit a PR for that after getting around the overflow issue. |
@trivialfis What should we do about |
I tried that before having the current interface and went through some discussions. The result is not possible at the moment. But I agree that's something we can work toward in the future. An unified interface with different backends. Right now getting dask interface to have feature parity with single node would be a priority, sorry the for the inconvenience. |
|
There is some confusion in naming. non-Dask-API uses "sample_weight" but Dask uses "sample_weights" even while the code docs for Dask show "sample_weight" like non-Dask case.
In same file one also has:
As if it is a list of arrays, but what is the meaning of that? I think it should be just like non-Dask case and be a single array.
Perhaps there is some confusion between the fact that the sample_weight for X,y is just 1 thing but the sample_weights for eval_set are for each eval_set provided.
The text was updated successfully, but these errors were encountered: