Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle missing values in dataframe with category dtype. #7331

Merged
merged 16 commits into from Oct 27, 2021

Conversation

trivialfis
Copy link
Member

  • Replace -1 in pandas/cudf initializer.
  • Unify IsValid functor.
  • Mimic pandas data handling in cuDF glue code.
  • Check invalid categories.

Close #7329 .

Depending on the difficulty of backporting, this can be part of the next patch release (1.5.1).

@trivialfis
Copy link
Member Author

trivialfis commented Oct 21, 2021

This PR adds some more tests. We have a number of different cases:

  • with and without missing value.
  • with cudf and pandas.
  • with and without weight.
  • with and without quantized dmatrix.
  • with and without external memory.

Tests in c++ now cover both DMatrix/DDM, weighted/normal. Missing with cudf and pandas are tested in Python. Didn't expect the complexity when I was creating the interface, should have been more thorough.

Also, I should expose the nnz of DMatrix to the public for better testing in Python in the coming PRs.

@trivialfis trivialfis added this to 1.5.1 in 2.0 Roadmap Oct 21, 2021
@trivialfis
Copy link
Member Author

Will test dask in a different PR.

python-package/xgboost/core.py Outdated Show resolved Hide resolved
tests/cpp/common/test_hist_util.cu Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Oct 22, 2021

Codecov Report

Merging #7331 (1e9fd1c) into master (fbb0dc4) will decrease coverage by 0.23%.
The diff coverage is 44.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7331      +/-   ##
==========================================
- Coverage   83.68%   83.44%   -0.24%     
==========================================
  Files          13       13              
  Lines        3885     3920      +35     
==========================================
+ Hits         3251     3271      +20     
- Misses        634      649      +15     
Impacted Files Coverage Δ
python-package/xgboost/data.py 68.18% <39.65%> (-1.10%) ⬇️
python-package/xgboost/core.py 84.48% <66.66%> (-0.08%) ⬇️
python-package/xgboost/dask.py 82.79% <0.00%> (+0.04%) ⬆️
python-package/xgboost/tracker.py 86.55% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fbb0dc4...1e9fd1c. Read the comment docs.

@trivialfis
Copy link
Member Author

@hcho3 Could you please take another look?

@trivialfis trivialfis merged commit ac9bfaa into dmlc:master Oct 27, 2021
2.0 Roadmap automation moved this from 1.5.1 to 1.6 Done Oct 27, 2021
@trivialfis trivialfis deleted the cat-invalid-values branch October 27, 2021 19:33
@trivialfis trivialfis moved this from 1.6 Done to 1.5.1 in 2.0 Roadmap Oct 27, 2021
trivialfis added a commit to trivialfis/xgboost that referenced this pull request Nov 10, 2021
* Replace -1 in pandas initializer.
* Unify `IsValid` functor.
* Mimic pandas data handling in cuDF glue code.
* Check invalid categories.
* Fix DDM sketching.
trivialfis added a commit that referenced this pull request Nov 10, 2021
…7331) (#7413)

* Handle missing values in dataframe with category dtype. (#7331)

* Replace -1 in pandas initializer.
* Unify `IsValid` functor.
* Mimic pandas data handling in cuDF glue code.
* Check invalid categories.
* Fix DDM sketching.

* Fix pick error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Missing" category not supported in pandas Dataframe
3 participants