Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] List index out of range when mixing categorical and non-categorical series with XGBoost #11788

Closed
miguelusque opened this issue Sep 27, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@miguelusque
Copy link
Member

miguelusque commented Sep 27, 2022

Describe the bug
Hi! I am facing an error when using both categorical and not categorical series when training an XGBRegressor.

The error only happens when using cuDF dataframes + gpu_hist. If I use Pandas dataframes + gpu_hist, the issue doesn't happen.

Thanks!!!

Steps/Code to reproduce bug
I have attached a reproducer in both .pdf and .txt formats. In the pdf file, it can be observer both the reproducer and the error.

Expected behavior
No errors.

Environment overview (please complete the following information)
DGX-1. rapidsai/rapidsai-core/22.08-cuda11.5-runtime-ubuntu20.04-py3.9 container
categorical.pdf
categorical.py.txt

@miguelusque miguelusque added Needs Triage Need team to review and classify bug Something isn't working labels Sep 27, 2022
@github-actions github-actions bot added this to Needs prioritizing in Bug Squashing Sep 27, 2022
@quasiben
Copy link
Member

Here is the repro in the txt file:

import cudf
import pandas as pd
import numpy as np
import xgboost as xgb
from typing import Tuple

def make_categorical(n_samples: int, n_features: int, n_categories: int, onehot: bool) -> Tuple[pd.DataFrame, pd.Series]:
    """Make some random data for demo."""
    rng = np.random.RandomState(1994)

    pd_dict = {}
    for i in range(n_features + 1):
        c = rng.randint(low=0, high=n_categories, size=n_samples)
        pd_dict[str(i)] = pd.Series(c, dtype=np.int64)

    df = pd.DataFrame(pd_dict)
    label = df.iloc[:, 0]
    df = df.iloc[:, 1:]
    for i in range(0, n_features):
        label += df.iloc[:, i]
    label += 1

    df = df.astype("category")
    categories = np.arange(0, n_categories)
    for col in df.columns:
        df[col] = df[col].cat.set_categories(categories)
        
    if onehot:
        return pd.get_dummies(df), label
    return df, label


def main(X, y, gpu_training=False, cudf_dataframe=False, mix_datatypes=False) -> None:    
    if cudf_dataframe:
        # Copy data from host to GPU memory
        X = cudf.DataFrame.from_pandas(X)
        y = cudf.Series.from_pandas(y)

    if mix_datatypes:
        # Convert first series from categoy to int64
        X[X.columns.values[0]] = X[X.columns.values[0]].astype('int64')
        
    print(list(X.dtypes))

    # Specify `enable_categorical` to True, also we use onehot encoding based split
    # here for demonstration. For details see the document of `max_cat_to_onehot`.
    tree_method="gpu_hist" if gpu_training else "hist"
    reg = xgb.XGBRegressor(tree_method=tree_method, enable_categorical=True, max_cat_to_onehot=5, n_estimators=2)
    reg.fit(X, y, eval_set=[(X, y)])

if __name__ == "__main__":
    # Print libraries versions
    print("Pandas version: " + pd.__version__)
    print("NumPy version: " + np.__version__)
    print("cuDF version: " + cudf.__version__)
    print("XGBoost version: " + xgb.__version__)

    # Use builtin categorical data support
    # For scikit-learn interface, the input data must be pandas DataFrame or cudf
    # DataFrame with categorical features
    X, y = make_categorical(n_samples=100, n_features=2, n_categories=4, onehot=False)

    # The following code works fine.
    print("\n**************")
    print("Test #1: GPU training, Pandas dataframe, categorical and integer data.")
    main(X.copy(), y.copy(), gpu_training=True, cudf_dataframe=False, mix_datatypes=True)

    # The following code fails.
    print("\n**************")
    print("Test #2: GPU training, cuDF dataframe, categorical and integer data.")
    main(X.copy(), y.copy(), gpu_training=True, cudf_dataframe=True, mix_datatypes=True)

@trivialfis
Copy link
Member

Hi, the fix is merged in XGBoost. We will have a new release next month. dmlc/xgboost#8282

@miguelusque
Copy link
Member Author

That is outstanding. Thank you!!!!

@quasiben
Copy link
Member

With dmlc/xgboost#8280 merged in I'm going to close this issue. Thanks @trivialfis for the quick response

Bug Squashing automation moved this from Needs prioritizing to Closed Sep 30, 2022
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

No branches or pull requests

4 participants