-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise histogram kernels #8118
Conversation
This reverts commit c93641a.
There was a discussion about the block size/kernel size being too large and many threads are wasted in the histogram kernel on latest architecture. Did you get a chance to look into that? |
Thanks for the reminder. Maybe I should test on Ampere to check that I haven't reintroduced that issue. I think the number of blocks launched should be even smaller in this PR, but I should check. |
Here is the A100 benchmark. Everything looks good.
|
Please convert it to non-draft so that we can run tests on Jenkins. |
Unfortunately using aligned byte sizes in the compressed iterator increased the memory usage of the large sizes test by 1gb and I think it barely no longer fits on the T4 we use in CI. The memory used by DeviceQuantileDMatrix in the test went from ~12GB to ~13GB which I think is acceptable, its just slightly annoying the test can't run on these machines. |
Seems odd though, I think the memory usage bottleneck is on sketching instead of ellpack. |
This reverts commit 4058e78.
I reverted the changes to compressed iterator. In the test for large sizes the bit packed version to is able to use 10 bits per symbol where the aligned version uses 16. The page size is 2484Mb vs 4294Mb. Speed seems better actually in some cases with bit packing compression. Benchmarking results:
|
Performed loop unrolling and change compressed iterator to use byte aligned sizes, increasing global memory read throughput.
max_depth=8