Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical data support. #6503

Closed
62 of 67 tasks
trivialfis opened this issue Dec 15, 2020 · 5 comments
Closed
62 of 67 tasks

Categorical data support. #6503

trivialfis opened this issue Dec 15, 2020 · 5 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Dec 15, 2020

We started initial experimental support for categorical in xgboost 1.3. The initial target is to make the one-hot encoding based tree split available for all xgboost components. Here is a list of to-do items:

One Hot

Partitioning Based (LGB)

Feature items can only be marked as completed if there's a corresponding (unit)test. Please let me know if there are missing items or if you want to help to accelerate the progress. ;-)

@trivialfis
Copy link
Member Author

As of posting, I have prototypes on new GPU evaluation function and refactored CPU hist/approx/local tree methods. Dart support is on the way.

@mayer79
Copy link
Contributor

mayer79 commented Jan 21, 2021

Great idea! Will this be compatible with

  • interaction constraints?
  • monotonic constraints (probably not)?
  • SHAP contribution calculations?

@hcho3
Copy link
Collaborator

hcho3 commented Jan 21, 2021

interaction constraints?

Yes, since we only care whether two features appear together or not.

monotonic constraints

No, since categorical values cannot be sorted in increasing order.

SHAP contribution calculations

Probably, given that LightGBM also supports computing SHAP with categorical features. But we need to test it.

@trivialfis
Copy link
Member Author

Status update:

Here are the few big items remaining for feature completeness:

  • GPU evaluation function rewrite to have better performance.
  • Add partition-based cat split for the rewritten evaluation function.
  • Extract the partitioner from hist and implement categorical data support in it.
  • Add categorical feature specific regularization parameters.

@trivialfis
Copy link
Member Author

Closing, initial support is completed. We will continue to add optimization and new features in the future.

@trivialfis trivialfis unpinned this issue Mar 14, 2022
@trivialfis trivialfis moved this from 1.6 In Progress to 1.6 Done in 2.0 Roadmap Mar 14, 2022
vruusmann added a commit to jpmml/jpmml-xgboost that referenced this issue Mar 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants