New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] List index out of range when mixing categorical and non-categorical series with XGBoost #11788
Labels
bug
Something isn't working
Projects
Comments
miguelusque
added
Needs Triage
Need team to review and classify
bug
Something isn't working
labels
Sep 27, 2022
Here is the repro in the txt file: import cudf
import pandas as pd
import numpy as np
import xgboost as xgb
from typing import Tuple
def make_categorical(n_samples: int, n_features: int, n_categories: int, onehot: bool) -> Tuple[pd.DataFrame, pd.Series]:
"""Make some random data for demo."""
rng = np.random.RandomState(1994)
pd_dict = {}
for i in range(n_features + 1):
c = rng.randint(low=0, high=n_categories, size=n_samples)
pd_dict[str(i)] = pd.Series(c, dtype=np.int64)
df = pd.DataFrame(pd_dict)
label = df.iloc[:, 0]
df = df.iloc[:, 1:]
for i in range(0, n_features):
label += df.iloc[:, i]
label += 1
df = df.astype("category")
categories = np.arange(0, n_categories)
for col in df.columns:
df[col] = df[col].cat.set_categories(categories)
if onehot:
return pd.get_dummies(df), label
return df, label
def main(X, y, gpu_training=False, cudf_dataframe=False, mix_datatypes=False) -> None:
if cudf_dataframe:
# Copy data from host to GPU memory
X = cudf.DataFrame.from_pandas(X)
y = cudf.Series.from_pandas(y)
if mix_datatypes:
# Convert first series from categoy to int64
X[X.columns.values[0]] = X[X.columns.values[0]].astype('int64')
print(list(X.dtypes))
# Specify `enable_categorical` to True, also we use onehot encoding based split
# here for demonstration. For details see the document of `max_cat_to_onehot`.
tree_method="gpu_hist" if gpu_training else "hist"
reg = xgb.XGBRegressor(tree_method=tree_method, enable_categorical=True, max_cat_to_onehot=5, n_estimators=2)
reg.fit(X, y, eval_set=[(X, y)])
if __name__ == "__main__":
# Print libraries versions
print("Pandas version: " + pd.__version__)
print("NumPy version: " + np.__version__)
print("cuDF version: " + cudf.__version__)
print("XGBoost version: " + xgb.__version__)
# Use builtin categorical data support
# For scikit-learn interface, the input data must be pandas DataFrame or cudf
# DataFrame with categorical features
X, y = make_categorical(n_samples=100, n_features=2, n_categories=4, onehot=False)
# The following code works fine.
print("\n**************")
print("Test #1: GPU training, Pandas dataframe, categorical and integer data.")
main(X.copy(), y.copy(), gpu_training=True, cudf_dataframe=False, mix_datatypes=True)
# The following code fails.
print("\n**************")
print("Test #2: GPU training, cuDF dataframe, categorical and integer data.")
main(X.copy(), y.copy(), gpu_training=True, cudf_dataframe=True, mix_datatypes=True) |
Hi, the fix is merged in XGBoost. We will have a new release next month. dmlc/xgboost#8282 |
That is outstanding. Thank you!!!! |
With dmlc/xgboost#8280 merged in I'm going to close this issue. Thanks @trivialfis for the quick response |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Hi! I am facing an error when using both categorical and not categorical series when training an XGBRegressor.
The error only happens when using cuDF dataframes + gpu_hist. If I use Pandas dataframes + gpu_hist, the issue doesn't happen.
Thanks!!!
Steps/Code to reproduce bug
I have attached a reproducer in both .pdf and .txt formats. In the pdf file, it can be observer both the reproducer and the error.
Expected behavior
No errors.
Environment overview (please complete the following information)
DGX-1. rapidsai/rapidsai-core/22.08-cuda11.5-runtime-ubuntu20.04-py3.9 container
categorical.pdf
categorical.py.txt
The text was updated successfully, but these errors were encountered: