Add `max_cat_threshold` to GPU and handle missing cat values. #8212

trivialfis · 2022-08-30T20:29:35Z

related: #8193

Port the max_cat_threshold parameter to GPU.
Use a 2-pass scan on GPU instead of a pseudo-bin, with missing values being handled.

RAMitchell

This PR is redoing #8193?

One problem here is that the missing value reduction is still being calculated on the constructor of EvaluateSplitAgent but not used. Ideally I think we want numerical and one hot splits all doing the same thing (although not necessary for this PR). This evaluate split kernel is quite delicate in terms of performance. If you change everything to do forward and backwards passes you might find it slows down significantly or even speeds up. Epsilon is a good test case for numerical splits.

If we merge this I think you should take the edge case tests from #8193, although I reversed the split direction in that PR so they need to be inverted. Those test cases guard against the bugs I saw in the categorical airline dataset.

trivialfis · 2022-09-05T09:26:34Z

This PR is redoing #8193?

In a slightly different way.

One problem here is that the missing value reduction is still being calculated on the constructor of EvaluateSplitAgent but not used

Yes, that's a problem. I can skip that by spliting up the Agent class.

If we merge this I think you should take the edge case tests from

The 2 approaches are not entirely the same. Using 2-pass scan helps with an edge case where we need to avoid including the last bin into split calculation. We would like to avoid spliting only on missing value.

RAMitchell · 2022-09-05T09:44:43Z

Splitting only on missing values seems valid.

trivialfis · 2022-09-06T09:19:39Z

Copied the tests from #8193 and added some checks for default direction.

RAMitchell · 2022-09-06T10:07:59Z

tests/cpp/tree/gpu_hist/test_evaluate_splits.cu

+    EXPECT_FLOAT_EQ(result.left_sum.GetHess() + result.right_sum.GetHess(), parent_sum.GetHess());
+  }
+  // With 3.0/3.0 missing values
+  // Forward, first 2 categories are selected, while the last one go to left along with missing value


This definitely seems weird to me to arbitrarily keep one category along with the missing values and have a worse loss.

If we can solve the numerical issues I definitely think missing only splits should be allowed.

Sounds good. Looking forward to your findings!

trivialfis added 10 commits August 30, 2022 03:36

Initial commit.

9f68393

Small cleanup to CPU.

c33d9fc

Specialize on cat update.

8397bb0

Dense, tie breaking.

dcfdbca

Scan.

0e4489a

Revert debug.

b1ada59

Cleanup.

2951586

fix.

6c3de5e

Support max_cat_thresh.

37c79cb

Tests.

e52b4e5

trivialfis mentioned this pull request Aug 30, 2022

Categorical data support (part 2) #7899

Open

16 tasks

trivialfis added 3 commits August 31, 2022 04:32

Cleanup.

408f9bc

Size.

2732c6b

lint.

3d87790

trivialfis marked this pull request as ready for review August 30, 2022 20:44

trivialfis added 2 commits August 31, 2022 16:17

Fix test.

93aab02

Add quick test.

19a8fbd

RAMitchell reviewed Sep 5, 2022

View reviewed changes

trivialfis added 2 commits September 6, 2022 17:12

Add tests from Rory.

fb4bba4

add check for direction.

76ac92d

comment.

4acc9b4

RAMitchell approved these changes Sep 6, 2022

View reviewed changes

floating point.

8e43aa3

trivialfis merged commit b5eb36f into dmlc:master Sep 6, 2022

trivialfis deleted the cat-gpu-evaluation-missing branch September 6, 2022 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `max_cat_threshold` to GPU and handle missing cat values. #8212

Add `max_cat_threshold` to GPU and handle missing cat values. #8212

trivialfis commented Aug 30, 2022

RAMitchell left a comment

trivialfis commented Sep 5, 2022

RAMitchell commented Sep 5, 2022

trivialfis commented Sep 6, 2022

RAMitchell Sep 6, 2022

trivialfis Sep 6, 2022

Add max_cat_threshold to GPU and handle missing cat values. #8212

Add max_cat_threshold to GPU and handle missing cat values. #8212

Conversation

trivialfis commented Aug 30, 2022

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis commented Sep 5, 2022

RAMitchell commented Sep 5, 2022

trivialfis commented Sep 6, 2022

RAMitchell Sep 6, 2022

Choose a reason for hiding this comment

trivialfis Sep 6, 2022

Choose a reason for hiding this comment

Add `max_cat_threshold` to GPU and handle missing cat values. #8212

Add `max_cat_threshold` to GPU and handle missing cat values. #8212