Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more input types for categorical data. #7220

Merged
merged 6 commits into from Sep 16, 2021

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Sep 12, 2021

  • Shorten the type name from "categorical" to "c".
  • Tests for np/cp array, scipy csr/csc/coo and uri.
  • Specify the type for feature info.

Remaining type:

  • pydatatable, xgboost will throw an error when categorical data is specified.

@trivialfis trivialfis mentioned this pull request Sep 12, 2021
67 tasks
@codecov-commenter
Copy link

codecov-commenter commented Sep 13, 2021

Codecov Report

Merging #7220 (7b629f3) into master (804b2ac) will decrease coverage by 0.02%.
The diff coverage is 87.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7220      +/-   ##
==========================================
- Coverage   82.63%   82.60%   -0.03%     
==========================================
  Files          13       13              
  Lines        4019     4024       +5     
==========================================
+ Hits         3321     3324       +3     
- Misses        698      700       +2     
Impacted Files Coverage Δ
python-package/xgboost/data.py 69.64% <86.20%> (-0.10%) ⬇️
python-package/xgboost/core.py 84.35% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 804b2ac...7b629f3. Read the comment docs.

Set types for features. When `enable_categorical` is set to `True`, string
"c" represents categorical data type. For numerical data, it can be one for
the following:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what happens when I change the numerical types, is there more documentation somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment, it's only used for text model dump. Actually, I have been thinking if it's possible to remove them along with the fmap parameter. (they are the same thing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the doc for numerical data. Let's just keep it as it's for now.

src/data/data.cc Outdated Show resolved Hide resolved
* Shorten the type name from "categorical" to "c".
* Tests for np/cp array and scipy csr/csc/coo.
* Specify the type for feature info.
@trivialfis
Copy link
Member Author

That's interesting (and unrelated to this PR):

[2021-09-15T11:52:48.240Z] tests/python-gpu/test_from_cudf.py::TestFromColumnar::test_cudf_training_simple_dmatrix terminate called after throwing an instance of 'thrust::system::system_error'

[2021-09-15T11:52:48.240Z]   what():  device free failed: cudaErrorLaunchTimeout: the launch timed out and was terminated

[2021-09-15T11:52:48.240Z] Fatal Python error: Aborted

@trivialfis trivialfis merged commit 0ed979b into dmlc:master Sep 16, 2021
@trivialfis trivialfis deleted the cat-array-inputs branch September 16, 2021 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants