[BUG] List index out of range when mixing categorical and non-categorical series with XGBoost #11788

miguelusque · 2022-09-27T16:12:33Z

Describe the bug
Hi! I am facing an error when using both categorical and not categorical series when training an XGBRegressor.

The error only happens when using cuDF dataframes + gpu_hist. If I use Pandas dataframes + gpu_hist, the issue doesn't happen.

Thanks!!!

Steps/Code to reproduce bug
I have attached a reproducer in both .pdf and .txt formats. In the pdf file, it can be observer both the reproducer and the error.

Expected behavior
No errors.

Environment overview (please complete the following information)
DGX-1. rapidsai/rapidsai-core/22.08-cuda11.5-runtime-ubuntu20.04-py3.9 container
categorical.pdf
categorical.py.txt

quasiben · 2022-09-27T16:45:30Z

Here is the repro in the txt file:

import cudf
import pandas as pd
import numpy as np
import xgboost as xgb
from typing import Tuple

def make_categorical(n_samples: int, n_features: int, n_categories: int, onehot: bool) -> Tuple[pd.DataFrame, pd.Series]:
    """Make some random data for demo."""
    rng = np.random.RandomState(1994)

    pd_dict = {}
    for i in range(n_features + 1):
        c = rng.randint(low=0, high=n_categories, size=n_samples)
        pd_dict[str(i)] = pd.Series(c, dtype=np.int64)

    df = pd.DataFrame(pd_dict)
    label = df.iloc[:, 0]
    df = df.iloc[:, 1:]
    for i in range(0, n_features):
        label += df.iloc[:, i]
    label += 1

    df = df.astype("category")
    categories = np.arange(0, n_categories)
    for col in df.columns:
        df[col] = df[col].cat.set_categories(categories)
        
    if onehot:
        return pd.get_dummies(df), label
    return df, label


def main(X, y, gpu_training=False, cudf_dataframe=False, mix_datatypes=False) -> None:    
    if cudf_dataframe:
        # Copy data from host to GPU memory
        X = cudf.DataFrame.from_pandas(X)
        y = cudf.Series.from_pandas(y)

    if mix_datatypes:
        # Convert first series from categoy to int64
        X[X.columns.values[0]] = X[X.columns.values[0]].astype('int64')
        
    print(list(X.dtypes))

    # Specify `enable_categorical` to True, also we use onehot encoding based split
    # here for demonstration. For details see the document of `max_cat_to_onehot`.
    tree_method="gpu_hist" if gpu_training else "hist"
    reg = xgb.XGBRegressor(tree_method=tree_method, enable_categorical=True, max_cat_to_onehot=5, n_estimators=2)
    reg.fit(X, y, eval_set=[(X, y)])

if __name__ == "__main__":
    # Print libraries versions
    print("Pandas version: " + pd.__version__)
    print("NumPy version: " + np.__version__)
    print("cuDF version: " + cudf.__version__)
    print("XGBoost version: " + xgb.__version__)

    # Use builtin categorical data support
    # For scikit-learn interface, the input data must be pandas DataFrame or cudf
    # DataFrame with categorical features
    X, y = make_categorical(n_samples=100, n_features=2, n_categories=4, onehot=False)

    # The following code works fine.
    print("\n**************")
    print("Test #1: GPU training, Pandas dataframe, categorical and integer data.")
    main(X.copy(), y.copy(), gpu_training=True, cudf_dataframe=False, mix_datatypes=True)

    # The following code fails.
    print("\n**************")
    print("Test #2: GPU training, cuDF dataframe, categorical and integer data.")
    main(X.copy(), y.copy(), gpu_training=True, cudf_dataframe=True, mix_datatypes=True)

trivialfis · 2022-09-28T17:57:38Z

Hi, the fix is merged in XGBoost. We will have a new release next month. dmlc/xgboost#8282

miguelusque · 2022-09-29T09:46:24Z

That is outstanding. Thank you!!!!

quasiben · 2022-09-30T14:15:52Z

With dmlc/xgboost#8280 merged in I'm going to close this issue. Thanks @trivialfis for the quick response

miguelusque added Needs Triage Need team to review and classify bug Something isn't working labels Sep 27, 2022

github-actions bot added this to Needs prioritizing in Bug Squashing Sep 27, 2022

trivialfis mentioned this issue Sep 28, 2022

Fix mixed types with cuDF. dmlc/xgboost#8280

Merged

quasiben closed this as completed Sep 30, 2022

Bug Squashing automation moved this from Needs prioritizing to Closed Sep 30, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] List index out of range when mixing categorical and non-categorical series with XGBoost #11788

[BUG] List index out of range when mixing categorical and non-categorical series with XGBoost #11788

miguelusque commented Sep 27, 2022 •

edited

quasiben commented Sep 27, 2022

trivialfis commented Sep 28, 2022

miguelusque commented Sep 29, 2022

quasiben commented Sep 30, 2022

[BUG] List index out of range when mixing categorical and non-categorical series with XGBoost #11788

[BUG] List index out of range when mixing categorical and non-categorical series with XGBoost #11788

Comments

miguelusque commented Sep 27, 2022 • edited

quasiben commented Sep 27, 2022

trivialfis commented Sep 28, 2022

miguelusque commented Sep 29, 2022

quasiben commented Sep 30, 2022

miguelusque commented Sep 27, 2022 •

edited