Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise histogram kernels #8118

Merged
merged 15 commits into from Aug 18, 2022
Merged

Optimise histogram kernels #8118

merged 15 commits into from Aug 18, 2022

Conversation

RAMitchell
Copy link
Member

Performed loop unrolling and change compressed iterator to use byte aligned sizes, increasing global memory read throughput.

max_depth=8

dataset master hist
airline 89.51209751 83.09268917
bosch 12.62905315 13.52083097
covtype 17.99281998 15.88525812
epsilon 44.71274849 39.46799638
fraud 1.29335506 1.161479132
higgs 17.27792022 15.09929334
year 6.953637654 4.075826511

@trivialfis
Copy link
Member

There was a discussion about the block size/kernel size being too large and many threads are wasted in the histogram kernel on latest architecture. Did you get a chance to look into that?

@RAMitchell
Copy link
Member Author

Thanks for the reminder. Maybe I should test on Ampere to check that I haven't reintroduced that issue. I think the number of blocks launched should be even smaller in this PR, but I should check.

@RAMitchell
Copy link
Member Author

Here is the A100 benchmark. Everything looks good.

dataset master hist
airline 65.77564727 60.79124835
bosch 13.05801762 13.36868745
covtype 20.95157623 14.26051986
epsilon 47.79153186 48.37207412
fraud 1.514388888 1.128341728
higgs 14.98636844 10.8116073
year 4.462064292 4.655418076

@trivialfis
Copy link
Member

Please convert it to non-draft so that we can run tests on Jenkins.

@RAMitchell RAMitchell marked this pull request as ready for review August 1, 2022 09:03
@RAMitchell RAMitchell closed this Aug 1, 2022
@RAMitchell RAMitchell reopened this Aug 1, 2022
@RAMitchell
Copy link
Member Author

Unfortunately using aligned byte sizes in the compressed iterator increased the memory usage of the large sizes test by 1gb and I think it barely no longer fits on the T4 we use in CI.

The memory used by DeviceQuantileDMatrix in the test went from ~12GB to ~13GB which I think is acceptable, its just slightly annoying the test can't run on these machines.

@trivialfis
Copy link
Member

Seems odd though, I think the memory usage bottleneck is on sketching instead of ellpack.

@RAMitchell RAMitchell closed this Aug 10, 2022
@RAMitchell RAMitchell reopened this Aug 10, 2022
@RAMitchell
Copy link
Member Author

I reverted the changes to compressed iterator. In the test for large sizes the bit packed version to is able to use 10 bits per symbol where the aligned version uses 16. The page size is 2484Mb vs 4294Mb.

Speed seems better actually in some cases with bit packing compression.

Benchmarking results:

dataset Without compression With compression
airline 60.79124835 59.46090026
bosch 13.36868745 13.11017834
covtype 14.26051986 14.26651251
epsilon 48.37207412 37.74572848
fraud 1.128341728 1.119378205
higgs 10.8116073 10.70780578
year 4.655418076 4.122201498

This reverts commit e622026.
This reverts commit 08cc82d.
@RAMitchell RAMitchell merged commit 1703dc3 into dmlc:master Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants