New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max_cat_threshold
to GPU and handle missing cat values.
#8212
Add max_cat_threshold
to GPU and handle missing cat values.
#8212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is redoing #8193?
One problem here is that the missing value reduction is still being calculated on the constructor of EvaluateSplitAgent but not used. Ideally I think we want numerical and one hot splits all doing the same thing (although not necessary for this PR). This evaluate split kernel is quite delicate in terms of performance. If you change everything to do forward and backwards passes you might find it slows down significantly or even speeds up. Epsilon is a good test case for numerical splits.
If we merge this I think you should take the edge case tests from #8193, although I reversed the split direction in that PR so they need to be inverted. Those test cases guard against the bugs I saw in the categorical airline dataset.
In a slightly different way.
Yes, that's a problem. I can skip that by spliting up the Agent class.
The 2 approaches are not entirely the same. Using 2-pass scan helps with an edge case where we need to avoid including the last bin into split calculation. We would like to avoid spliting only on missing value. |
Splitting only on missing values seems valid. |
Copied the tests from #8193 and added some checks for default direction. |
EXPECT_FLOAT_EQ(result.left_sum.GetHess() + result.right_sum.GetHess(), parent_sum.GetHess()); | ||
} | ||
// With 3.0/3.0 missing values | ||
// Forward, first 2 categories are selected, while the last one go to left along with missing value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems weird to me to arbitrarily keep one category along with the missing values and have a worse loss.
If we can solve the numerical issues I definitely think missing only splits should be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Looking forward to your findings!
related: #8193
max_cat_threshold
parameter to GPU.