Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xgboost regressor training with GPU does not work when python environment does not have "cudf" package #8467

Closed
WeichenXu123 opened this issue Nov 15, 2022 · 14 comments · Fixed by #8471

Comments

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented Nov 15, 2022

Testing on xgboost==1.7.1

Reproducing code:

# run on a spark cluster configured with GPU.

df_train = spark.createDataFrame([
    (Vectors.dense(1.0, 2.0, 3.0), 0, False, 1.0),
    (Vectors.sparse(3, {1: 1.0, 2: 5.5}), 1, False, 2.0),
    (Vectors.dense(4.0, 5.0, 6.0), 0, True, 1.0),
    (Vectors.sparse(3, {1: 6.0, 2: 7.5}), 1, True, 2.0),
] * 100, ["features", "label", "isVal", "weight"])
df_test = spark.createDataFrame([
    (Vectors.dense(1.5, 2.0, 3.0), ),
    (Vectors.sparse(3, {1: -1.0, 2: 5.5}), ),
] * 100, ["features"])

from xgboost.spark import SparkXGBRegressor
xgb_regressor = SparkXGBRegressor(
    num_workers=2,
    max_depth=5, missing=0.0, use_gpu=True,
    validation_indicator_col='isVal', weight_col='weight',
    early_stopping_rounds=1, eval_metric='rmse'
)
xgb_reg_model = xgb_regressor.fit(df_train)
xgb_reg_model.transform(df_test).collect()

Error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(71, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: 'ModuleNotFoundError: No module named 'cudf''. Full traceback below:
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/spark/core.py", line 809, in _train_booster
    booster = worker_train(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/spark/data.py", line 309, in create_dmatrix_from_partitions
    dtrain = make_qdm(train_data, gpu_id, meta, None, params)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/spark/data.py", line 167, in make_qdm
    m = QuantileDMatrix(it, **params, ref=ref)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/core.py", line 1386, in __init__
    self._init(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/core.py", line 1445, in _init
    it.reraise()
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/core.py", line 488, in reraise
    raise exc  # pylint: disable=raising-bad-type
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/core.py", line 478, in _handle_exception
    return dft_ret
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/core.py", line 534, in <lambda>
    return self._handle_exception(lambda: self.next(input_data), 0)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/spark/data.py", line 99, in next
    data=self._fetch(self._data[alias.data]),
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-824c7a6f-bc83-452f-baf0-520e7f6f9371/lib/python3.9/site-packages/xgboost/spark/data.py", line 85, in _fetch
    import cudf  # pylint: disable=import-error
ModuleNotFoundError: No module named 'cudf'
@WeichenXu123
Copy link
Contributor Author

Seemingly I need to manually build cudf if I don't use conda environment.

@WeichenXu123
Copy link
Contributor Author

@trivialfis Could you release a pypi xgboost package that containing the cudf lib ?
cudf cannot be installed correctly via pip , and building it is hard.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 16, 2022

@WeichenXu123 It is now possible to install cudf using pip. See https://rapids.ai/pip.html

@WeichenXu123
Copy link
Contributor Author

@hcho3

I tried:

pip install cudf-cu11==22.10.0 --extra-index-url=https://pypi.ngc.nvidia.com

it will install a dependency cupy-cuda115, which is only compatible with cuda-11.5

I cannot find a version that is compatible with cuda 11.3 (which is the version currently we use)

@hcho3 Can we make xgboost-spark using "cudf" optionally ? If no cudf installed, fallback to use normal pandas dataframe.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 16, 2022

@WeichenXu123 https://rapids.ai/pip.html says that you can install cuPy version for CUDA 11.3:

When installing these packages with CUDA 11.2, 11.3, or 11.4, you may experience a “Failed to import CuPy” error. To resolve this error, please uninstall cupy-cuda115 and install cupy-cuda11x:

pip uninstall cupy-cuda115; pip install cupy-cuda11x

Have you tried this?

@hcho3
Copy link
Collaborator

hcho3 commented Nov 16, 2022

And I'm not sure if cuDF can be made optional. I'll defer to @trivialfis

@WeichenXu123
Copy link
Contributor Author

@WeichenXu123 https://rapids.ai/pip.html says that you can install cuPy version for CUDA 11.3:

When installing these packages with CUDA 11.2, 11.3, or 11.4, you may experience a “Failed to import CuPy” error. To resolve this error, please uninstall cupy-cuda115 and install cupy-cuda11x:

pip uninstall cupy-cuda115; pip install cupy-cuda11x

Have you tried this?

Yes I tried, then got error:

>>> import cudf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/cudf/__init__.py", line 22, in <module>
    from cudf.core.dataframe import DataFrame, from_dataframe, from_pandas, merge
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/cudf/core/dataframe.py", line 57, in <module>
    from cudf.core import column, df_protocol, reshape
  File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/cudf/core/df_protocol.py", line 645, in <module>
    _INTS = {8: cp.int8, 16: cp.int16, 32: cp.int32, 64: cp.int64}
AttributeError: module 'cupy' has no attribute 'int8'

@hcho3

@WeichenXu123
Copy link
Contributor Author

WeichenXu123 commented Nov 16, 2022

And I'm not sure if cuDF can be made optional. I'll defer to @trivialfis

I think it is doable, because when I contributed the xgboost.spark module, it works without cudf, but now it does not work without cudf

The cudf dependency makes xgboost-spark gpu mode does not work on cuda<11.5, that is very bad.

and xgboost package does not install cudf automatically, most user does not know how to install the dependency, it is not a package that can be installed from offical pypi repo. This is also quite bad.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 16, 2022

@WeichenXu123 The import cudf statement was introduced in #8088, which you approved. It seems like we accidentally introduced cudf dependency there. (I suppose that you didn't intend this outcome.)

At any rate, we should try to make cudf optional.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 16, 2022

Just made #8469 as a first attempt to fix.

@okrave
Copy link

okrave commented Dec 1, 2022

Hi everybody, is there any solution or we have to wait 1.7.2 patch release?

Thanks guyz

@trivialfis
Copy link
Member

You can use nightly build if you don't want to wait for a patch release.

@okrave
Copy link

okrave commented Dec 1, 2022

nightly build?

@trivialfis
Copy link
Member

https://xgboost.readthedocs.io/en/stable/install.html#nightly-build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants