New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] [pyspark] Make cuDF optional in PySpark package #8469
Conversation
@WeichenXu123 Would you like to test this patch? |
@hcho3 Btw, for the released xgboost 1.7.1, which cuda version is it built with ? does it compatible with all of cuda 11.3 / 11.4 / 11.5 ? |
Yes. We build our binaries using CUDA 11.0. |
awesome! |
@WeichenXu123 Have you had a chance to test this patch? Did it work? |
works |
Great! Let's merge this after @trivialfis approves. |
Interesting, I suppose the "vector" features and use_gpu will not reach that piece of code. @trivialfis |
Hold on.. with your PR change, although GPU training works without I think when |
Right, just checked the code, use_gpu forces use_qdm=True. and use _qdm will construct QuantileDMatrix instead of the DMatrix. So if there is no cudf, the issue happens. Looks like we need to check if cudf is installed from here and decide if use QuantileDMatrix |
I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up. |
Sounds good! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on.. with your PR change, although GPU training works without cudf installed, but there's a performance regression (in my test on taxi dataset about 2x slower) compared with the code when I contributed the xgboost spark estimator.
This will depend on the size of the dataset and generated model. For a small model, the overhead might be significant due to expensive initialization using QuantileDMatrix
on CPU. For a larger model, the cost can be amortized.
I think the overhead is fine. If you want to use GPU pipeline fully you need GPU-based data storage like cudf. There's little benefit to working around that. Based on our discussion @wbo4958 and other stakeholders, we might make cuDF a hard requirement for pyspark-rapidsai in the future anyway with the help of cuda IPC.
We can move ahead with this PR and let users know there's overhead between CPU/GPU data conversion.
I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up.
This is a bug in the cluster. We can set a check before the construction of DMatrix to make sure all workers are doing the same thing.
I hope we can make it easier to install cuDF with pip. |
It requires an additional pypi index, but shouldn't be too difficult. |
@WeichenXu123 had some trouble installing cuDF using pip. See #8467 |
I already uses the NVIDA index for pip install cuDF, but the installed version only supports cuda >= 11.5, see my comments in #8467 |
Currently cuDF cannot support databricks runtime 2.0 (to be released soon), so we need a version that works well without cuDF and does not introduce performance regression, i.e. in this case, for GPU training, we should still use plain DMatrix instead of QuantileDMatrix. |
Hi all, I'm in favor of merging this PR with an additional check that every worker has the same required package or performing the same steps based on #8469 (review) . What do you think? |
Can we make xgboost use |
The QDM doesn't depend on cuDF. With cuDF the performance can be better but it's not a hard requirement. We can continue to use QDM even if cuDF is missing. |
We can, but then we have more inconsistencies in the code base. sklearn is already using QDM by default when it's appropriate. |
Also, DMatrix consumes more memory |
But QDM without cuDF performs slower than DMatrix when model is small (it should be a common case) |
I guess @WeichenXu123 's PR is ok to introduce a parameter use_qdm (default to true). But I still don't want to check the cudf in the driver side. and meanwhile, I'd also like xgboost to throw an exception with the message (to set use_qdm to false) when no cudf installed. Does that make sense? |
I don't prefer the extra parameter as I don't think it's necessary. Allow me to summarize the issues and conflicts here.
Can we meet at a middle ground? We make QDM optional by checking whether cuDF is available to workers and use rabit allreduce to check all workers share the same result from the predicate. After the work on pyspark with CUDA IPC and an upgrade to databricks' runtime, we revert the change and proceed with using cuDF as default. This way we have a hot fix without introducing an extra parameter that we need to maintain. |
My lack of support for the additional parameter is its combinatory increase in complexity. We have |
Then what about add a special handling for databricks , if on databricks runtime (DATARICKS_RUNTIME_VERSION environ exists), then use DMatrix |
@WeichenXu123 Do you prefer XGBoost checking databricks env over checking the availability of cuDF? |
It's OK too. :) |
Can we continue this PR? Checking cuDF seems to be simpler and users have some control. |
@trivialfis What's the final decision ? Based on this PR, for databricks runtime, add an additional check and use DMatrix instead ? |
Let's make QDM optional based on the cuDF check. |
Got it. I will update my PR. |
@trivialfis |
Closes #8467
Not using cuDF may have adverse performance implication, but at least it's better not to crash due to missing
cudf
.