Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor split valuation kernel #8073

Merged
merged 12 commits into from Jul 21, 2022
Merged

Refactor split valuation kernel #8073

merged 12 commits into from Jul 21, 2022

Conversation

RAMitchell
Copy link
Member

Increases split evaluation kernel throughput on my V100 from 150 GB/s to 400 GB/s.

These gains come from:

  • using cuda fast division operations
  • avoiding shared memory broadcasts, instead using warp shuffle.
  • not computing parents gain for every possible split (this is constant per node)

The fastest possible throughput I have achieved is 700 GB/s, however it becomes a little complicated to get this fast. Current version I think is fast enough, placing the bottleneck entirely on histogram computation.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance improvement looks exciting! Some questions in the comments.

XGBOOST_DEVICE float
CalcGainGivenWeight(ParamT const &p, tree::GradStats const& stats, float w) const {
// Fast floating point division instruction on device
XGBOOST_DEVICE float Divide(float a, float b) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this as an independent function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not expecting to use it anywhere else at this moment, so I think it should stay unless you have something specific in mind A kernel needs to be heavily bottlenecked by arithmetic before this makes a difference, and I can't think of other places in xgboost.

src/tree/gpu_hist/evaluate_splits.cu Show resolved Hide resolved
src/tree/gpu_hist/evaluate_splits.cu Outdated Show resolved Hide resolved
@RAMitchell
Copy link
Member Author

Depth 8 benchmarks:

dataset master eval
airline 90.88661192 89.93374479
bosch 12.88504644 12.46627029
covtype 18.01187677 17.60098921
epsilon 46.48386218 43.91058178
fraud 1.315704659 1.237399099
higgs 17.19260674 17.27932671
year 7.047273015 6.841409724

@RAMitchell RAMitchell merged commit 1be0984 into dmlc:master Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants