[jvm-packages] [pyspark] Make cuDF optional in PySpark package #8469

hcho3 · 2022-11-16T01:39:28Z

Not using cuDF may have adverse performance implication, but at least it's better not to crash due to missing cudf.

hcho3 · 2022-11-16T01:48:59Z

@WeichenXu123 Would you like to test this patch?

WeichenXu123 · 2022-11-16T04:53:51Z

@hcho3 Btw, for the released xgboost 1.7.1, which cuda version is it built with ? does it compatible with all of cuda 11.3 / 11.4 / 11.5 ?

hcho3 · 2022-11-16T04:54:38Z

Yes. We build our binaries using CUDA 11.0.

WeichenXu123 · 2022-11-16T04:55:19Z

Yes. We build our binaries using CUDA 11.0.

awesome!
We hope to have a patch release 1.7.2 ASAP including this patch.

hcho3 · 2022-11-16T05:02:15Z

@WeichenXu123 Have you had a chance to test this patch? Did it work?

WeichenXu123 · 2022-11-16T05:07:45Z

@WeichenXu123 Have you had a chance to test this patch? Did it work?

works

hcho3 · 2022-11-16T05:28:18Z

Great! Let's merge this after @trivialfis approves.

wbo4958 · 2022-11-16T10:06:13Z

Interesting, I suppose the "vector" features and use_gpu will not reach that piece of code. @trivialfis

WeichenXu123 · 2022-11-16T10:33:10Z

Hold on.. with your PR change, although GPU training works without cudf installed, but there's a performance regression (in my test on taxi dataset about 2x slower) compared with the code when I contributed the xgboost spark estimator.
@hcho3 @wbo4958

I think when cudf not installed, we should set use_qdm=False

wbo4958 · 2022-11-16T11:28:38Z

Right, just checked the code, use_gpu forces use_qdm=True. and use _qdm will construct QuantileDMatrix instead of the DMatrix. So if there is no cudf, the issue happens.

Looks like we need to check if cudf is installed from here and decide if use QuantileDMatrix

wbo4958 · 2022-11-16T11:31:50Z

I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up.
If it is reasonable about defining a parameter to control use_qdm ?

WeichenXu123 · 2022-11-16T11:34:12Z

I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up. If it is reasonable about defining a parameter to control use_qdm ?

Sounds good!
I will create a PR.

WeichenXu123 · 2022-11-16T12:00:40Z

@wbo4958 New PR created #8471

trivialfis

Hold on.. with your PR change, although GPU training works without cudf installed, but there's a performance regression (in my test on taxi dataset about 2x slower) compared with the code when I contributed the xgboost spark estimator.

This will depend on the size of the dataset and generated model. For a small model, the overhead might be significant due to expensive initialization using QuantileDMatrix on CPU. For a larger model, the cost can be amortized.

I think the overhead is fine. If you want to use GPU pipeline fully you need GPU-based data storage like cudf. There's little benefit to working around that. Based on our discussion @wbo4958 and other stakeholders, we might make cuDF a hard requirement for pyspark-rapidsai in the future anyway with the help of cuda IPC.

We can move ahead with this PR and let users know there's overhead between CPU/GPU data conversion.

I am also concerned about what if one worker has installed cudf while others don't install cudf. It will be messed up.

This is a bug in the cluster. We can set a check before the construction of DMatrix to make sure all workers are doing the same thing.

hcho3 · 2022-11-17T03:00:53Z

we might make cuDF a hard requirement for pyspark-rapidsai in the future

I hope we can make it easier to install cuDF with pip.

trivialfis · 2022-11-17T03:42:34Z

I hope we can make it easier to install cuDF with pip.

It requires an additional pypi index, but shouldn't be too difficult.

hcho3 · 2022-11-17T04:09:09Z

@trivialfis

It requires an additional pypi index, but shouldn't be too difficult.

@WeichenXu123 had some trouble installing cuDF using pip. See #8467

WeichenXu123 · 2022-11-17T08:46:58Z

@trivialfis

It requires an additional pypi index, but shouldn't be too difficult.

@WeichenXu123 had some trouble installing cuDF using pip. See #8467

I already uses the NVIDA index for pip install cuDF, but the installed version only supports cuda >= 11.5, see my comments in #8467

WeichenXu123 · 2022-11-18T02:32:19Z

@trivialfis

This will depend on the size of the dataset and generated model. For a small model, the overhead might be significant due to expensive initialization using QuantileDMatrix on CPU. For a larger model, the cost can be amortized.

I think the overhead is fine. If you want to use GPU pipeline fully you need GPU-based data storage like cudf. There's little benefit to working around that. Based on our discussion @wbo4958 and other stakeholders, we might make cuDF a hard requirement for pyspark-rapidsai in the future anyway with the help of cuda IPC.

We can move ahead with this PR and let users know there's overhead between CPU/GPU data conversion.

Currently cuDF cannot support databricks runtime 2.0 (to be released soon), so we need a version that works well without cuDF and does not introduce performance regression, i.e. in this case, for GPU training, we should still use plain DMatrix instead of QuantileDMatrix.

trivialfis · 2022-11-18T02:33:04Z

Hi all, I'm in favor of merging this PR with an additional check that every worker has the same required package or performing the same steps based on #8469 (review) . What do you think?

WeichenXu123 · 2022-11-18T02:38:03Z

@trivialfis

Hi all, I'm in favor of merging this PR with an additional check that every worker has the same required package or performing the same steps based on #8469 (review) . What do you think?

Can we make xgboost use DMatrix when cuDF is not installed ? To avoid the performance regression here #8469 (comment) :)

trivialfis · 2022-11-18T02:40:43Z

The QDM doesn't depend on cuDF. With cuDF the performance can be better but it's not a hard requirement.

We can continue to use QDM even if cuDF is missing.

trivialfis · 2022-11-18T02:43:20Z

We can, but then we have more inconsistencies in the code base. sklearn is already using QDM by default when it's appropriate.

trivialfis · 2022-11-18T02:43:53Z

Also, DMatrix consumes more memory

WeichenXu123 · 2022-11-18T02:59:09Z

But QDM without cuDF performs slower than DMatrix when model is small (it should be a common case)

wbo4958 · 2022-11-18T04:06:59Z

I guess @WeichenXu123 's PR is ok to introduce a parameter use_qdm (default to true). But I still don't want to check the cudf in the driver side. and meanwhile, I'd also like xgboost to throw an exception with the message (to set use_qdm to false) when no cudf installed. Does that make sense?

trivialfis · 2022-11-18T05:16:29Z

I don't prefer the extra parameter as I don't think it's necessary.

Allow me to summarize the issues and conflicts here.

When cuDF is not available, QDM slows down performance for small models.
cuDF pypi support is limited to latest CUDA versions with enhanced compatibility, as a result, we prefer not to make it a hard dependency for pyspark-xgboost GPU pipeline. Databricks' current runtime doesn't have the latest CUDA version.
Optional QDM may introduce inconsistency between pyspark and other Python interfaces including sklearn and dask.
DMatrix increases performance on CPU input and GPU model but decreases memory usage efficiency.

Can we meet at a middle ground? We make QDM optional by checking whether cuDF is available to workers and use rabit allreduce to check all workers share the same result from the predicate. After the work on pyspark with CUDA IPC and an upgrade to databricks' runtime, we revert the change and proceed with using cuDF as default. This way we have a hot fix without introducing an extra parameter that we need to maintain.

trivialfis · 2022-11-18T05:22:32Z

My lack of support for the additional parameter is its combinatory increase in complexity. We have use_gpu, tree_method, use_qdm, feature_cols, and the availability of cuDF all affecting the choice. The logic is unnecessarily complex for long-term maintenance. One can try to document and explain the exact behavior for different combinations of these parameters and environment conditions. ;-)

WeichenXu123 · 2022-11-18T05:44:49Z

Then what about add a special handling for databricks , if on databricks runtime (DATARICKS_RUNTIME_VERSION environ exists), then use DMatrix

trivialfis · 2022-11-18T06:13:47Z

@WeichenXu123 Do you prefer XGBoost checking databricks env over checking the availability of cuDF?

WeichenXu123 · 2022-11-18T07:30:23Z

@WeichenXu123 Do you prefer XGBoost checking databricks env over checking the availability of cuDF?

It's OK too. :)

trivialfis · 2022-11-21T09:22:52Z

Can we continue this PR? Checking cuDF seems to be simpler and users have some control.

WeichenXu123 · 2022-11-21T09:52:47Z

Can we continue this PR? Checking cuDF seems to be simpler and users have some control.

@trivialfis What's the final decision ? Based on this PR, for databricks runtime, add an additional check and use DMatrix instead ?

trivialfis · 2022-11-22T14:54:14Z

Let's make QDM optional based on the cuDF check.

WeichenXu123 · 2022-11-23T02:30:49Z

@trivialfis

Let's make QDM optional based on the cuDF check.

Got it. I will update my PR.

WeichenXu123 · 2022-11-23T02:37:25Z

@trivialfis
My PR updated according to your latest suggestion: #8471

[jvm-packages] [pyspark] Make cuDF optional in PySpark package

268cc8a

hcho3 mentioned this pull request Nov 16, 2022

Xgboost regressor training with GPU does not work when python environment does not have "cudf" package #8467

Closed

hcho3 requested a review from trivialfis November 16, 2022 01:51

WeichenXu123 approved these changes Nov 16, 2022

View reviewed changes

WeichenXu123 mentioned this pull request Nov 16, 2022

[jvm-packages] [pyspark] Make QDM optional based on cuDF check #8471

Merged

trivialfis reviewed Nov 17, 2022

View reviewed changes

trivialfis added this to To Do in PySpark Support via automation Nov 18, 2022

trivialfis closed this in #8471 Nov 27, 2022

hcho3 deleted the pyspark_cudf_optional branch July 17, 2023 17:32

[jvm-packages] [pyspark] Make cuDF optional in PySpark package #8469

[jvm-packages] [pyspark] Make cuDF optional in PySpark package #8469

Conversation

hcho3 commented Nov 16, 2022 • edited

hcho3 commented Nov 16, 2022

WeichenXu123 commented Nov 16, 2022 • edited

hcho3 commented Nov 16, 2022

WeichenXu123 commented Nov 16, 2022 • edited

hcho3 commented Nov 16, 2022

WeichenXu123 commented Nov 16, 2022

hcho3 commented Nov 16, 2022

wbo4958 commented Nov 16, 2022

WeichenXu123 commented Nov 16, 2022 • edited

wbo4958 commented Nov 16, 2022

wbo4958 commented Nov 16, 2022

WeichenXu123 commented Nov 16, 2022

WeichenXu123 commented Nov 16, 2022

trivialfis left a comment • edited

Choose a reason for hiding this comment

hcho3 commented Nov 17, 2022

trivialfis commented Nov 17, 2022

hcho3 commented Nov 17, 2022 • edited

WeichenXu123 commented Nov 17, 2022

WeichenXu123 commented Nov 18, 2022

trivialfis commented Nov 18, 2022

WeichenXu123 commented Nov 18, 2022

trivialfis commented Nov 18, 2022 • edited

trivialfis commented Nov 18, 2022

trivialfis commented Nov 18, 2022

WeichenXu123 commented Nov 18, 2022

wbo4958 commented Nov 18, 2022

trivialfis commented Nov 18, 2022 • edited

trivialfis commented Nov 18, 2022 • edited

WeichenXu123 commented Nov 18, 2022

trivialfis commented Nov 18, 2022

WeichenXu123 commented Nov 18, 2022

trivialfis commented Nov 21, 2022

WeichenXu123 commented Nov 21, 2022 • edited

trivialfis commented Nov 22, 2022

WeichenXu123 commented Nov 23, 2022

WeichenXu123 commented Nov 23, 2022

hcho3 commented Nov 16, 2022 •

edited

WeichenXu123 commented Nov 16, 2022 •

edited

WeichenXu123 commented Nov 16, 2022 •

edited

WeichenXu123 commented Nov 16, 2022 •

edited

trivialfis left a comment •

edited

hcho3 commented Nov 17, 2022 •

edited

trivialfis commented Nov 18, 2022 •

edited

trivialfis commented Nov 18, 2022 •

edited

trivialfis commented Nov 18, 2022 •

edited

WeichenXu123 commented Nov 21, 2022 •

edited