Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU predictor should throw an error when categorical splits are present #6488

Closed
hcho3 opened this issue Dec 10, 2020 · 4 comments
Closed

CPU predictor should throw an error when categorical splits are present #6488

hcho3 opened this issue Dec 10, 2020 · 4 comments

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Dec 10, 2020

Reproducer:

import pandas as pd
import numpy as np
import xgboost as xgb

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

bst.save_model('serialized.json')

bst2 = xgb.Booster(model_file='./serialized.json')
pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)

Log:

[0]	train-rmse:2.24335
[1]	train-rmse:0.67111
[2]	train-rmse:0.24109
[3]	train-rmse:0.09666
[4]	train-rmse:0.05782
Traceback (most recent call last):
  File "test.py", line 30, in <module>
    np.testing.assert_almost_equal(pred, pred2)
  File "/home/phcho/miniconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 579, in assert_almost_equal
    return assert_array_almost_equal(actual, desired, decimal, err_msg)
  File "/home/phcho/miniconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1042, in assert_array_almost_equal
    assert_array_compare(compare, x, y, err_msg=err_msg, verbose=verbose,
  File "/home/phcho/miniconda3/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Arrays are not almost equal to 7 decimals

Mismatched elements: 20 / 20 (100%)
Max absolute difference: 22.214174
Max relative difference: 67.735565
 x: array([ -1.2340746,  -8.012611 ,  -8.714991 , -21.886219 , -19.201965 ,
       -17.83752  , -18.708538 , -21.886219 , -17.53006  ,   0.6563143,
        -8.012611 ,  -2.5466423, -12.684027 ,  -8.012611 ,  -2.5466423,...
 y: array([0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544,
       0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544,
       0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544, 0.3279544,
       0.3279544, 0.3279544], dtype=float32)

Note to others: The categorical split feature is currently in experimental status.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2020

The example script is salvaged by saving memory snapshot with pickle:

import pandas as pd
import numpy as np
import xgboost as xgb
import pickle

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

with open('serialized.pkl', 'wb') as f:
    pickle.dump(bst, f)

with open('serialized.pkl', 'rb') as f:
    bst2 = pickle.load(f)
pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)  # this passes

In addition, manually saving configuration with save_config() and load_config() also works:

import pandas as pd
import numpy as np
import xgboost as xgb

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

bst.save_model('serialized.json')
with open('config.json', 'w') as f:
    f.write(bst.save_config())

bst2 = xgb.Booster(model_file='./serialized.json')
with open('config.json', 'r') as f:
    config = f.read()
bst2.load_config(config)

pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)  # this passes

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 10, 2020

It suffices to set predictor='gpu_predictor' to obtain the correct prediction:

import pandas as pd
import numpy as np
import xgboost as xgb

rng = np.random.default_rng(seed=0)
x0 = rng.integers(low=0, high=3, size=20)
x1 = rng.integers(low=0, high=5, size=20)
noise = rng.normal(loc=0, scale=0.1, size=20)

df = pd.DataFrame({'x0': x0, 'x1': x1}).astype('category')
X = np.column_stack((x0, x1))
y = (x0 * 10 - 20) + (x1 - 2) + noise

params = {'tree_method': 'gpu_hist',
          'predictor': 'gpu_predictor',
          'enable_experimental_json_serialization': True,
          'max_depth': 6,
          'learning_rate': 1.0}

dtrain = xgb.DMatrix(df, label=y, enable_categorical=True)

bst = xgb.train(params, dtrain, num_boost_round=5, evals=[(dtrain, 'train')])
pred = bst.predict(dtrain)

bst.save_model('serialized.json')

bst2 = xgb.Booster(model_file='./serialized.json')
bst2.set_param({'predictor': 'gpu_predictor'})
pred2 = bst2.predict(dtrain)

np.testing.assert_almost_equal(pred, pred2)

Given a model with categorical splits, we should throw an error when the predictor is not gpu_predictor.

@hcho3 hcho3 changed the title Model with categorical splits fail to preserve after round-trip serialization CPU predictor should throw an error when categorical splits are present Dec 10, 2020
@trivialfis
Copy link
Member

I will support CPU predictor.

@trivialfis
Copy link
Member

Closing in favor of #6503 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants