Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Python - Cuda error (without using Cuda) #10171

Open
stavoltafunzia opened this issue Apr 8, 2024 · 5 comments
Open

[bug] Python - Cuda error (without using Cuda) #10171

stavoltafunzia opened this issue Apr 8, 2024 · 5 comments

Comments

@stavoltafunzia
Copy link

stavoltafunzia commented Apr 8, 2024

I've recently upgraded to xgboost version 2.0.3 (Python), and since then I cannot use it anymore as keeps crashing. The following simple code fails to run:

import xgboost as xgb
import numpy as np

train = xgb.DMatrix(np.array([1,2,3]).reshape((-1, 1)), label=np.array([2,3,4]))  # Weird enough, if I don't specify the label the error does not show up

And the error message shows the following traceback:

Traceback (most recent call last):
  File "/home/nicola/mega_workspace/trader/debug_5.py", line 4, in <module>
    train = xgb.DMatrix(np.array([1,2,3]).reshape((-1, 1)), label=np.array([2,3,4]))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 730, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 869, in __init__
    self.set_info(
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 730, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 932, in set_info
    self.set_label(label)
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 1070, in set_label
    dispatch_meta_backend(self, label, "label", "float")
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/data.py", line 1218, in dispatch_meta_backend
    _meta_from_numpy(data, name, dtype, handle)
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/data.py", line 1159, in _meta_from_numpy
    _check_call(_LIB.XGDMatrixSetInfoFromInterface(handle, c_str(field), interface_str))
  File "/home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/site-packages/xgboost/core.py", line 282, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [20:49:47] /home/conda/feedstock_root/build_artifacts/xgboost-split_1712072663242/work/src/data/array_interface.cu:44: Check failed: err == cudaGetLastError() (0 vs. 2) : 
Stack trace:
  [bt] (0) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6e) [0x7f12a9d2164e]
  [bt] (1) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(xgboost::ArrayInterfaceHandler::IsCudaPtr(void const*)+0xdb) [0x7f12aa3801fb]
  [bt] (2) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(xgboost::MetaInfo::SetInfo(xgboost::Context const&, xgboost::StringView, xgboost::StringView)+0x126) [0x7f12a9f0f426]
  [bt] (3) /home/nicola/Software/miniconda3/envs/text_xgb/lib/libxgboost.so(XGDMatrixSetInfoFromInterface+0xf7) [0x7f12a9d02927]
  [bt] (4) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/../../libffi.so.8(+0xa052) [0x7f12c86a5052]
  [bt] (5) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/../../libffi.so.8(+0x8925) [0x7f12c86a3925]
  [bt] (6) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/../../libffi.so.8(ffi_call+0xde) [0x7f12c86a406e]
  [bt] (7) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x92e5) [0x7f12c87bc2e5]
  [bt] (8) /home/nicola/Software/miniconda3/envs/text_xgb/lib/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so(+0x8837) [0x7f12c87bb837]

It surprises me that it throws an error related to Cuda, even though I'm trying to use only classic CPU xgboost.
My configuration is as follows:

xgboost version: 2.0.3 (Python 3.11, clean anaconda environment with only xgboost installed)
OS: Debian 12, with Nvidia drivers 550.54.15  and Cuda 12.4
Hardware: RTX 4000 series card present

The code above used to run flawlessly in Python xgboost 1.7.x.


2024-04-09 update: it turns out that there was another process using my GPU, specifically utilizing almost the entire VRam. After closing such application, the example above works. Nevertheless, I don't know if it should be considered a bug that any xgboost (even CPU-based) application crashes due to issues on the Cuda layer. I leave this decision for the developers (though I personally think it should not happen).

@trivialfis
Copy link
Member

Haven't been able to reproduce with CUDA 12.3, trying 12.4 now.

@trivialfis
Copy link
Member

Still haven't reproduced it.

@trivialfis
Copy link
Member

trivialfis commented Apr 9, 2024

That's odd, how come that getting the last error is cudaErrorMemoryAllocation.

Except for debian v.s. ubuntu, I have pretty much the same configuration:

OS: Debian 12, with Nvidia drivers 550.54.15  and Cuda 12.4
Hardware: RTX 4000 series card present

@stavoltafunzia
Copy link
Author

stavoltafunzia commented Apr 9, 2024

Apparently, there was another process using my GPU, specifically utilizing almost the entire VRam. After closing such application, the example above works.
Nevertheless, I don't know if it should be considered a bug that any xgboost (even CPU-based) application crashes due to issues on the Cuda layer.

@trivialfis
Copy link
Member

That makes sense, I will open a PR to workaround that. It's just XGBoost needs to know whether the data is from GPU or CPU, and we use CUDA runtime to obtain this information. As a result, there's a CUDA error when checking the input data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants