CPU predictor should throw an error when categorical splits are present #6488

hcho3 · 2020-12-10T06:15:34Z

Reproducer:

import pandas as pd
import numpy as np
import xgboost as xgb

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

bst.save_model('serialized.json')

bst2 = xgb.Booster(model_file='./serialized.json')
pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)

Log:

[0]	train-rmse:2.24335
[1]	train-rmse:0.67111
[2]	train-rmse:0.24109
[3]	train-rmse:0.09666
[4]	train-rmse:0.05782
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    np.testing.assert_almost_equal(pred, pred2)
  File "/home/phcho/miniconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 579, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/home/phcho/miniconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1042, in assert_array_almost_equal
    assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,
  File "/home/phcho/miniconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 7 decimals

Mismatched elements: 20 / 20 (100%)
Max absolute difference: 22.214174
Max relative difference: 67.735565
 x: array([ -1.2340746,  -8.012611 ,  -8.714991 , -21.886219 , -19.201965 ,
       -17.83752  , -18.708538 , -21.886219 , -17.53006  ,   0.6563143,
        -8.012611 ,  -2.5466423, -12.684027 ,  -8.012611 ,  -2.5466423,...
 y: array([0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544,
       0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544,
       0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544,
       0.3279544, 0.3279544], dtype=float32)

Note to others: The categorical split feature is currently in experimental status.

The text was updated successfully, but these errors were encountered:

hcho3 · 2020-12-10T06:30:17Z

The example script is salvaged by saving memory snapshot with pickle:

import pandas as pd
import numpy as np
import xgboost as xgb
import pickle

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

with open('serialized.pkl', 'wb') as f:
    pickle.dump(bst, f)

with open('serialized.pkl', 'rb') as f:
    bst2 = pickle.load(f)
pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)  # this passes

In addition, manually saving configuration with save_config() and load_config() also works:

import pandas as pd
import numpy as np
import xgboost as xgb

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

bst.save_model('serialized.json')
with open('config.json', 'w') as f:
    f.write(bst.save_config())

bst2 = xgb.Booster(model_file='./serialized.json')
with open('config.json', 'r') as f:
    config = f.read()
bst2.load_config(config)

pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)  # this passes

hcho3 · 2020-12-10T07:05:05Z

It suffices to set predictor='gpu_predictor' to obtain the correct prediction:

import pandas as pd
import numpy as np
import xgboost as xgb

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

bst.save_model('serialized.json')

bst2 = xgb.Booster(model_file='./serialized.json')
bst2.set_param({'predictor': 'gpu_predictor'})
pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)

Given a model with categorical splits, we should throw an error when the predictor is not gpu_predictor.

trivialfis · 2020-12-10T08:37:48Z

I will support CPU predictor.

trivialfis · 2020-12-15T12:19:11Z

Closing in favor of #6503 .

hcho3 changed the title ~~Model with categorical splits fail to preserve after round-trip serialization~~ CPU predictor should throw an error when categorical splits are present Dec 10, 2020

trivialfis closed this as completed Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU predictor should throw an error when categorical splits are present #6488

CPU predictor should throw an error when categorical splits are present #6488

hcho3 commented Dec 10, 2020

hcho3 commented Dec 10, 2020 •

edited

hcho3 commented Dec 10, 2020

trivialfis commented Dec 10, 2020

trivialfis commented Dec 15, 2020

CPU predictor should throw an error when categorical splits are present #6488

CPU predictor should throw an error when categorical splits are present #6488

Comments

hcho3 commented Dec 10, 2020

hcho3 commented Dec 10, 2020 • edited

hcho3 commented Dec 10, 2020

trivialfis commented Dec 10, 2020

trivialfis commented Dec 15, 2020

hcho3 commented Dec 10, 2020 •

edited