Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] [pyspark] Make cuDF optional in PySpark package #8469

Closed
wants to merge 1 commit into from

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Nov 16, 2022

Closes #8467

Not using cuDF may have adverse performance implication, but at least it's better not to crash due to missing cudf.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 16, 2022

@WeichenXu123 Would you like to test this patch?

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Nov 16, 2022

@hcho3 Btw, for the released xgboost 1.7.1, which cuda version is it built with ? does it compatible with all of cuda 11.3 / 11.4 / 11.5 ?

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 16, 2022

Yes. We build our binaries using CUDA 11.0.

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Nov 16, 2022

Yes. We build our binaries using CUDA 11.0.

awesome!
We hope to have a patch release 1.7.2 ASAP including this patch.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 16, 2022

@WeichenXu123 Have you had a chance to test this patch? Did it work?

@WeichenXu123
Copy link
Contributor

@WeichenXu123 Have you had a chance to test this patch? Did it work?

works

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 16, 2022

Great! Let's merge this after @trivialfis approves.

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 16, 2022

Interesting, I suppose the "vector" features and use_gpu will not reach that piece of code. @trivialfis

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Nov 16, 2022

Hold on.. with your PR change, although GPU training works without cudf installed, but there's a performance regression (in my test on taxi dataset about 2x slower) compared with the code when I contributed the xgboost spark estimator.
@hcho3 @wbo4958

I think when cudf not installed, we should set use_qdm=False

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 16, 2022

Right, just checked the code, use_gpu forces use_qdm=True. and use _qdm will construct QuantileDMatrix instead of the DMatrix. So if there is no cudf, the issue happens.

Looks like we need to check if cudf is installed from here and decide if use QuantileDMatrix

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 16, 2022

I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up.
If it is reasonable about defining a parameter to control use_qdm ?

@WeichenXu123
Copy link
Contributor

I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up. If it is reasonable about defining a parameter to control use_qdm ?

Sounds good!
I will create a PR.

@WeichenXu123
Copy link
Contributor

@wbo4958 New PR created #8471

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on.. with your PR change, although GPU training works without cudf installed, but there's a performance regression (in my test on taxi dataset about 2x slower) compared with the code when I contributed the xgboost spark estimator.

This will depend on the size of the dataset and generated model. For a small model, the overhead might be significant due to expensive initialization using QuantileDMatrix on CPU. For a larger model, the cost can be amortized.

I think the overhead is fine. If you want to use GPU pipeline fully you need GPU-based data storage like cudf. There's little benefit to working around that. Based on our discussion @wbo4958 and other stakeholders, we might make cuDF a hard requirement for pyspark-rapidsai in the future anyway with the help of cuda IPC.

We can move ahead with this PR and let users know there's overhead between CPU/GPU data conversion.

I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up.

This is a bug in the cluster. We can set a check before the construction of DMatrix to make sure all workers are doing the same thing.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 17, 2022

we might make cuDF a hard requirement for pyspark-rapidsai in the future

I hope we can make it easier to install cuDF with pip.

@trivialfis
Copy link
Member

I hope we can make it easier to install cuDF with pip.

It requires an additional pypi index, but shouldn't be too difficult.

@hcho3
Copy link
Collaborator Author

hcho3 commented Nov 17, 2022

@trivialfis

It requires an additional pypi index, but shouldn't be too difficult.

@WeichenXu123 had some trouble installing cuDF using pip. See #8467

@WeichenXu123
Copy link
Contributor

@trivialfis

It requires an additional pypi index, but shouldn't be too difficult.

@WeichenXu123 had some trouble installing cuDF using pip. See #8467

I already uses the NVIDA index for pip install cuDF, but the installed version only supports cuda >= 11.5, see my comments in #8467

@WeichenXu123
Copy link
Contributor

@trivialfis

This will depend on the size of the dataset and generated model. For a small model, the overhead might be significant due to expensive initialization using QuantileDMatrix on CPU. For a larger model, the cost can be amortized.

I think the overhead is fine. If you want to use GPU pipeline fully you need GPU-based data storage like cudf. There's little benefit to working around that. Based on our discussion @wbo4958 and other stakeholders, we might make cuDF a hard requirement for pyspark-rapidsai in the future anyway with the help of cuda IPC.

We can move ahead with this PR and let users know there's overhead between CPU/GPU data conversion.

Currently cuDF cannot support databricks runtime 2.0 (to be released soon), so we need a version that works well without cuDF and does not introduce performance regression, i.e. in this case, for GPU training, we should still use plain DMatrix instead of QuantileDMatrix.

@trivialfis
Copy link
Member

Hi all, I'm in favor of merging this PR with an additional check that every worker has the same required package or performing the same steps based on #8469 (review) . What do you think?

@WeichenXu123
Copy link
Contributor

@trivialfis

Hi all, I'm in favor of merging this PR with an additional check that every worker has the same required package or performing the same steps based on #8469 (review) . What do you think?

Can we make xgboost use DMatrix when cuDF is not installed ? To avoid the performance regression here #8469 (comment) :)

@trivialfis
Copy link
Member

trivialfis commented Nov 18, 2022

The QDM doesn't depend on cuDF. With cuDF the performance can be better but it's not a hard requirement.

We can continue to use QDM even if cuDF is missing.

@trivialfis
Copy link
Member

We can, but then we have more inconsistencies in the code base. sklearn is already using QDM by default when it's appropriate.

@trivialfis
Copy link
Member

Also, DMatrix consumes more memory

@WeichenXu123
Copy link
Contributor

But QDM without cuDF performs slower than DMatrix when model is small (it should be a common case)

@wbo4958
Copy link
Contributor

wbo4958 commented Nov 18, 2022

I guess @WeichenXu123 's PR is ok to introduce a parameter use_qdm (default to true). But I still don't want to check the cudf in the driver side. and meanwhile, I'd also like xgboost to throw an exception with the message (to set use_qdm to false) when no cudf installed. Does that make sense?

@trivialfis
Copy link
Member

trivialfis commented Nov 18, 2022

I don't prefer the extra parameter as I don't think it's necessary.

Allow me to summarize the issues and conflicts here.

  • When cuDF is not available, QDM slows down performance for small models.
  • cuDF pypi support is limited to latest CUDA versions with enhanced compatibility, as a result, we prefer not to make it a hard dependency for pyspark-xgboost GPU pipeline. Databricks' current runtime doesn't have the latest CUDA version.
  • Optional QDM may introduce inconsistency between pyspark and other Python interfaces including sklearn and dask.
  • DMatrix increases performance on CPU input and GPU model but decreases memory usage efficiency.

Can we meet at a middle ground? We make QDM optional by checking whether cuDF is available to workers and use rabit allreduce to check all workers share the same result from the predicate. After the work on pyspark with CUDA IPC and an upgrade to databricks' runtime, we revert the change and proceed with using cuDF as default. This way we have a hot fix without introducing an extra parameter that we need to maintain.

@trivialfis trivialfis added this to To Do in PySpark Support via automation Nov 18, 2022
@trivialfis
Copy link
Member

trivialfis commented Nov 18, 2022

My lack of support for the additional parameter is its combinatory increase in complexity. We have use_gpu, tree_method, use_qdm, feature_cols, and the availability of cuDF all affecting the choice. The logic is unnecessarily complex for long-term maintenance. One can try to document and explain the exact behavior for different combinations of these parameters and environment conditions. ;-)

@WeichenXu123
Copy link
Contributor

Then what about add a special handling for databricks , if on databricks runtime (DATARICKS_RUNTIME_VERSION environ exists), then use DMatrix

@trivialfis
Copy link
Member

@WeichenXu123 Do you prefer XGBoost checking databricks env over checking the availability of cuDF?

@WeichenXu123
Copy link
Contributor

@WeichenXu123 Do you prefer XGBoost checking databricks env over checking the availability of cuDF?

It's OK too. :)

@trivialfis
Copy link
Member

Can we continue this PR? Checking cuDF seems to be simpler and users have some control.

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Nov 21, 2022

Can we continue this PR? Checking cuDF seems to be simpler and users have some control.

@trivialfis What's the final decision ? Based on this PR, for databricks runtime, add an additional check and use DMatrix instead ?

@trivialfis
Copy link
Member

Let's make QDM optional based on the cuDF check.

@WeichenXu123
Copy link
Contributor

@trivialfis

Let's make QDM optional based on the cuDF check.

Got it. I will update my PR.

@WeichenXu123
Copy link
Contributor

@trivialfis
My PR updated according to your latest suggestion: #8471

@hcho3 hcho3 deleted the pyspark_cudf_optional branch July 17, 2023 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

Xgboost regressor training with GPU does not work when python environment does not have "cudf" package
4 participants