Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max_cat_threshold to GPU and handle missing cat values. #8212

Merged
merged 19 commits into from Sep 6, 2022

Conversation

trivialfis
Copy link
Member

related: #8193

  • Port the max_cat_threshold parameter to GPU.
  • Use a 2-pass scan on GPU instead of a pseudo-bin, with missing values being handled.

@trivialfis trivialfis marked this pull request as ready for review August 30, 2022 20:44
Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is redoing #8193?

One problem here is that the missing value reduction is still being calculated on the constructor of EvaluateSplitAgent but not used. Ideally I think we want numerical and one hot splits all doing the same thing (although not necessary for this PR). This evaluate split kernel is quite delicate in terms of performance. If you change everything to do forward and backwards passes you might find it slows down significantly or even speeds up. Epsilon is a good test case for numerical splits.

If we merge this I think you should take the edge case tests from #8193, although I reversed the split direction in that PR so they need to be inverted. Those test cases guard against the bugs I saw in the categorical airline dataset.

@trivialfis
Copy link
Member Author

This PR is redoing #8193?

In a slightly different way.

One problem here is that the missing value reduction is still being calculated on the constructor of EvaluateSplitAgent but not used

Yes, that's a problem. I can skip that by spliting up the Agent class.

If we merge this I think you should take the edge case tests from

The 2 approaches are not entirely the same. Using 2-pass scan helps with an edge case where we need to avoid including the last bin into split calculation. We would like to avoid spliting only on missing value.

@RAMitchell
Copy link
Member

Splitting only on missing values seems valid.

@trivialfis
Copy link
Member Author

Copied the tests from #8193 and added some checks for default direction.

EXPECT_FLOAT_EQ(result.left_sum.GetHess() + result.right_sum.GetHess(), parent_sum.GetHess());
}
// With 3.0/3.0 missing values
// Forward, first 2 categories are selected, while the last one go to left along with missing value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems weird to me to arbitrarily keep one category along with the missing values and have a worse loss.

If we can solve the numerical issues I definitely think missing only splits should be allowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Looking forward to your findings!

@trivialfis trivialfis merged commit b5eb36f into dmlc:master Sep 6, 2022
@trivialfis trivialfis deleted the cat-gpu-evaluation-missing branch September 6, 2022 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants