Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature? Java binding for categorical feature support #8727

Open
shadyelgewily-slimstock opened this issue Jan 26, 2023 · 2 comments
Open

Comments

@shadyelgewily-slimstock
Copy link

shadyelgewily-slimstock commented Jan 26, 2023

We are using XGBoost using the Java binding (outside of Spark) and we have a strong appetite for categorical feature support, where splits are considered in terms of subset partitioning of the categorical feature as opposed to one-hot encoding and having XGboost considering each category separately. The release notes for v1.6 states:

"In the future, we will continue to improve categorical data support with new features and
optimizations. Also, we are looking forward to bringing the feature beyond Python binding,
contributions and feedback are welcomed!
Lastly, as a result of experimental status, the
behavior might be subject to change, especially the default value of related
hyper-parameters."

I'm raising this issue because I'm wondering what the status is of the Java binding for the experimental parameters related to categorical features. Concretely:

  • Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric?
  • Provided that we have some way to encode the feature type in the DMatrix or elsewhere, how do we communicate that to the C binding (there has to be some way to achieve this, since the Python binding already exists)
  • Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say, provided we contribute a PR that satisfies the general requirements for a XGboost PR?

It seems that some work has already been done on the first two items in (#7966), so perhaps the more general question is:

I see that this feature request is on the roadmap, and we could contribute to help the process move forward.

@trivialfis
Copy link
Member

Is there already a way to communicate to the native C code which columns in the DMatrix should be considered as categorical, and which as numeric

Yes. As referred in your description #7966 .

Is there an appetite at XGboost maintainers to release such a Java binding in a stable version any time in the next 3-6 months say

Yes, that would be 2.0 if all goes well.

Which components are still required to start using categorical features

For the Java interface, I think we can already get some small examples running, but haven't been able to prioritize it yet. The feature_type and supported tree_methods are all it needs. However, my understanding is that most users prefer the scala binding over the java binding and we need to extend the feature info setter/getter to scala and have appropriate integration with the spark estimator interface.

@wbo4958
Copy link
Contributor

wbo4958 commented Jan 30, 2023

Please see this comment. #7802 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants