New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor split valuation kernel #8073
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance improvement looks exciting! Some questions in the comments.
XGBOOST_DEVICE float | ||
CalcGainGivenWeight(ParamT const &p, tree::GradStats const& stats, float w) const { | ||
// Fast floating point division instruction on device | ||
XGBOOST_DEVICE float Divide(float a, float b) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract this as an independent function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not expecting to use it anywhere else at this moment, so I think it should stay unless you have something specific in mind A kernel needs to be heavily bottlenecked by arithmetic before this makes a difference, and I can't think of other places in xgboost.
Depth 8 benchmarks:
|
Increases split evaluation kernel throughput on my V100 from 150 GB/s to 400 GB/s.
These gains come from:
The fastest possible throughput I have achieved is 700 GB/s, however it becomes a little complicated to get this fast. Current version I think is fast enough, placing the bottleneck entirely on histogram computation.