Initial support for one hot split. #5949

trivialfis · 2020-07-28T06:58:55Z

This PR aims to have a working pipeline for categorical data, but the support is very limited at current form. To run tests, one needs to:

Use Python interface.
Use gbtree.
Use gpu_hist tree method. Other tree methods are coming.
Use DMatrix, DeviceQuantileDMatrix is not yet supported.
Specify enable_categorical for DMatrix.
Do not use weights.
Use pandas with categorical feature type.
Set gpu_predictor explicitly.
Use JSON for model persistent.
Specify enable_experimental_json_serialization even if you don't use pickle.

Limitations

The support is limited to 1 vs rest categorical split. Other categorical specializations are coming.
There's no mapping between categorical value and histogram bin. So memory usage might be sub-optimal when categories are sparse.

include/xgboost/feature_map.h

python-package/xgboost/data.py

src/common/categorical.h

src/common/quantile.cu

src/tree/gpu_hist/histogram.cu

src/tree/updater_gpu_hist.cu

tests/cpp/tree/gpu_hist/test_evaluate_splits.cu

tests/cpp/predictor/test_predictor.cc

tests/python/testing.py

trivialfis · 2020-07-28T07:46:18Z

@hcho3 Right now I'm reusing the split cond in RegTree for categorical split. But once we go beyond one hot split, the split condition can be a vector containing multiple categories. So a better structure with JSON schema is required. We need to have more discussion around this otherwise the model format might subject to change.

hcho3 · 2020-07-28T08:01:06Z

@trivialfis We will want to add additional fields to the JSON schema to indicate categorical splits. For example, LightGBM stores decision_type, cat_threshold and cat_boundaries fields. The decision_type[i] tells us whether the i-th internal node is a categorical or a numerical split. The cat_threshold and cat_boundaries together store categories associated with the left child of each categorical split.
https://github.com/dmlc/treelite/blob/7f01e631da8687189473ad6b177ba0615b19496b/src/frontend/lightgbm.cc#L516-L528

It should not be too difficult to add new fields to the current JSON schema. (I'm assuming that vector with multiple categories will be JSON only and won't support legacy binary serialization.)

trivialfis · 2020-07-28T08:10:59Z

@hcho3

will be JSON only and won't support legacy binary serialization

You are right. But at the same time we can't break the binary format. For example, we can't add anything to RegTree::Node. Also before this PR is merged, I think we need to set JSON as the default pickle format, as Python interface goes through a serialization at the end of training to release GPU memory.

It should not be too difficult to add new fields to the current JSON schema.

I think so. Just to make sure that we have considered enough different cases.

hcho3 · 2020-07-28T08:26:47Z

Got it. Let's discuss about how to add the necessary fields without touching RegTree::Node. One idea is to relegate RegTree::Node as an external facing interface and convert from RegTree::Node to the real node structure that includes the new fields. This will slow down deserializing binary models.

Another idea is to set the info_ field of RegTree::Node to NaN, to indicate a categorical split, and use the payload field of NaN to indicate where the extra information can be looked up.

tests/cpp/common/test_hist_util.h

codecov-commenter · 2020-07-28T12:01:54Z

Codecov Report

Merging #5949 into master will increase coverage by 0.00%.
The diff coverage is 84.61%.

@@           Coverage Diff           @@
##           master    #5949   +/-   ##
=======================================
  Coverage   78.49%   78.49%           
=======================================
  Files          12       12           
  Lines        3013     3018    +5     
=======================================
+ Hits         2365     2369    +4     
- Misses        648      649    +1

Impacted Files	Coverage Δ
python-package/xgboost/core.py	`77.73% <ø> (ø)`
python-package/xgboost/data.py	`58.56% <84.61%> (+0.25%)`	⬆️
python-package/xgboost/dask.py	`76.38% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8599f87...a4795d1. Read the comment docs.

trivialfis · 2020-07-30T04:27:39Z

@hcho3 I changed the categories in tree into bitfield.

trivialfis · 2020-08-09T21:11:35Z

I reverted change of min value to reduce the size of this PR.

hcho3

The general approach looks good. I am looking forward to reviewing specifics once this PR gets broken up into smaller PRs.

include/xgboost/span.h

src/tree/tree_model.cc

hcho3 · 2020-08-19T01:01:47Z

src/tree/updater_gpu_hist.cu

+
+    auto is_cat = candidate.split.is_cat;
+    if (is_cat) {
+      auto cat = common::AsCat(candidate.split.fvalue);


Note to myself: in one-hot encoded setting, there is only one matching category in every categorical split. However, the split_categories_ structure can later store multiple matching categories per split.

Another reminder to myself: Treelite must support JSON format of XGBoost.

trivialfis · 2020-10-10T09:34:04Z

All merged.

trivialfis commented Jul 28, 2020

View reviewed changes

tests/cpp/common/test_hist_util.h Outdated Show resolved Hide resolved

trivialfis force-pushed the categorical-split branch from 81e2a0e to 49bc2ab Compare July 29, 2020 18:16

trivialfis force-pushed the categorical-split branch 2 times, most recently from bbb81dd to d8ac122 Compare August 6, 2020 14:53

trivialfis self-assigned this Aug 9, 2020

trivialfis added the status: need review label Aug 9, 2020

trivialfis mentioned this pull request Aug 18, 2020

Expand categorical node. #6028

Merged

trivialfis removed the status: need review label Aug 18, 2020

trivialfis mentioned this pull request Aug 18, 2020

[Roadmap] 1.3.0 Roadmap #6031

Closed

14 tasks

hcho3 approved these changes Aug 19, 2020

View reviewed changes

trivialfis force-pushed the categorical-split branch from a4795d1 to d86f917 Compare August 25, 2020 18:53

trivialfis force-pushed the categorical-split branch from d86f917 to cb77e3a Compare September 20, 2020 11:25

Initial support for one hot categorical split.

18f734a

trivialfis force-pushed the categorical-split branch from cb77e3a to 18f734a Compare September 24, 2020 19:33

trivialfis closed this Oct 10, 2020

trivialfis deleted the categorical-split branch October 10, 2020 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for one hot split. #5949

Initial support for one hot split. #5949

trivialfis commented Jul 28, 2020 •

edited

trivialfis commented Jul 28, 2020 •

edited

hcho3 commented Jul 28, 2020

trivialfis commented Jul 28, 2020 •

edited

hcho3 commented Jul 28, 2020 •

edited

codecov-commenter commented Jul 28, 2020 •

edited

trivialfis commented Jul 30, 2020

trivialfis commented Aug 9, 2020

hcho3 left a comment

hcho3 Aug 19, 2020

hcho3 Aug 19, 2020

trivialfis commented Oct 10, 2020

Initial support for one hot split. #5949

Initial support for one hot split. #5949

Conversation

trivialfis commented Jul 28, 2020 • edited

Limitations

trivialfis commented Jul 28, 2020 • edited

hcho3 commented Jul 28, 2020

trivialfis commented Jul 28, 2020 • edited

hcho3 commented Jul 28, 2020 • edited

codecov-commenter commented Jul 28, 2020 • edited

Codecov Report

trivialfis commented Jul 30, 2020

trivialfis commented Aug 9, 2020

hcho3 left a comment

Choose a reason for hiding this comment

hcho3 Aug 19, 2020

Choose a reason for hiding this comment

hcho3 Aug 19, 2020

Choose a reason for hiding this comment

trivialfis commented Oct 10, 2020

trivialfis commented Jul 28, 2020 •

edited

trivialfis commented Jul 28, 2020 •

edited

trivialfis commented Jul 28, 2020 •

edited

hcho3 commented Jul 28, 2020 •

edited

codecov-commenter commented Jul 28, 2020 •

edited