New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support sequence interface like lightgbm.Sequence #10091
Comments
is there any use case that numpy/pandas and alike is not a better alternative? |
For timeseries data like stock exchange data, to predict the next server days return. Saying that there are 100 features, and then rolling the data with 20 days. In order to fit DMatrix, we have to shift the features 20 times, and the memory usage become 20x. Actually most of the data are duplicated. If I can define custom BTW, please do me a favor, check #9625. |
Currently, can can consume data in batch by using the callback function, I took a quick look into LGB, which implements the See https://github.com/dmlc/xgboost/blob/master/demo/guide-python/quantile_data_iterator.py |
sure, will look into it. |
I have looked into this demo code. It looks like that the # run in version 1.7.6
import numpy as np
import pandas as pd
import xgboost as xgb
np.random.seed(42)
n_groups = 100
group_size = 2000
n_features = 10
n_levels = 20
rows = n_groups * group_size
features = pd.DataFrame(np.random.randn(rows, n_features).astype('float32'), columns=[f'f{i:03d}' for i in range(n_features)])
qids = pd.Series(np.arange(rows, dtype='int') // group_size)
labels = pd.Series(np.random.randn(rows).astype('float32')).groupby(qids).rank(method='first').sub(1) // (group_size // n_levels)
weights = np.arange(1, 101)
# dmatrix = xgb.DMatrix(features, label=labels, qid=qids)
qmatrix = xgb.QuantileDMatrix(features, label=labels, qid=qids)
sub_rows = 10000
sub_qmatrix = xgb.QuantileDMatrix(features.tail(sub_rows))
sub_dmatrix = xgb.DMatrix(features.tail(sub_rows))
params = {
'objective': 'rank:pairwise',
# 'objective': 'multi:softprob',
# 'num_class': n_levels,
'base_score': 0.5,
# 'lambdarank_pair_method': 'mean',
# 'lambdarank_num_pair_per_sample': 1,
'booster': 'gbtree',
'tree_method': 'hist',
'verbosity': 1,
# 'seed': 42,
'learning_rate': 0.1,
'max_depth': 6,
'gamma': 1,
'min_child_weight': 4,
'subsample': 0.9,
'colsample_bytree': 0.7,
'nthread': 20,
'reg_lambda': 1,
'reg_alpha': 1,
'eval_metric': ['ndcg@100', 'ndcg@500', 'ndcg@1000'],
}
booster = xgb.train(params, qmatrix, 100, verbose_eval=10, evals=[(qmatrix, 'train')])
preds_d = booster.predict(sub_dmatrix)
preds_q = booster.predict(sub_qmatrix)
preds_o = booster.predict(qmatrix)[-sub_rows:]
assert np.allclose(preds_d, preds_q) # False
assert np.allclose(preds_o, preds_q) # False
assert np.allclose(preds_o, preds_d) # True The script above will raise error. So if one train booster with |
Is it possible to support sequence interface (an object with
__getitem__
and__len__
) in DMatrix without copying data.The text was updated successfully, but these errors were encountered: