Optimise histogram kernels #8118

RAMitchell · 2022-07-26T15:38:51Z

Performed loop unrolling and change compressed iterator to use byte aligned sizes, increasing global memory read throughput.

max_depth=8

dataset	master	hist
airline	89.51209751	83.09268917
bosch	12.62905315	13.52083097
covtype	17.99281998	15.88525812
epsilon	44.71274849	39.46799638
fraud	1.29335506	1.161479132
higgs	17.27792022	15.09929334
year	6.953637654	4.075826511

This reverts commit c93641a.

trivialfis · 2022-07-26T17:18:45Z

There was a discussion about the block size/kernel size being too large and many threads are wasted in the histogram kernel on latest architecture. Did you get a chance to look into that?

RAMitchell · 2022-07-27T09:06:44Z

Thanks for the reminder. Maybe I should test on Ampere to check that I haven't reintroduced that issue. I think the number of blocks launched should be even smaller in this PR, but I should check.

RAMitchell · 2022-07-29T09:09:43Z

Here is the A100 benchmark. Everything looks good.

dataset	master	hist
airline	65.77564727	60.79124835
bosch	13.05801762	13.36868745
covtype	20.95157623	14.26051986
epsilon	47.79153186	48.37207412
fraud	1.514388888	1.128341728
higgs	14.98636844	10.8116073
year	4.462064292	4.655418076

trivialfis · 2022-08-01T07:26:57Z

Please convert it to non-draft so that we can run tests on Jenkins.

RAMitchell · 2022-08-03T11:03:39Z

Unfortunately using aligned byte sizes in the compressed iterator increased the memory usage of the large sizes test by 1gb and I think it barely no longer fits on the T4 we use in CI.

The memory used by DeviceQuantileDMatrix in the test went from ~12GB to ~13GB which I think is acceptable, its just slightly annoying the test can't run on these machines.

trivialfis · 2022-08-03T14:20:12Z

Seems odd though, I think the memory usage bottleneck is on sketching instead of ellpack.

This reverts commit 4058e78.

RAMitchell · 2022-08-11T11:55:26Z

I reverted the changes to compressed iterator. In the test for large sizes the bit packed version to is able to use 10 bits per symbol where the aligned version uses 16. The page size is 2484Mb vs 4294Mb.

Speed seems better actually in some cases with bit packing compression.

Benchmarking results:

dataset	Without compression	With compression
airline	60.79124835	59.46090026
bosch	13.36868745	13.11017834
covtype	14.26051986	14.26651251
epsilon	48.37207412	37.74572848
fraud	1.128341728	1.119378205
higgs	10.8116073	10.70780578
year	4.655418076	4.122201498

This reverts commit e622026.

This reverts commit 08cc82d.

RAMitchell added 9 commits July 21, 2022 04:29

Loop unrolling in GPU histogram kernel.

69fea72

Merge branch 'master' of github.com:dmlc/xgboost into hist2

b008b7e

Tune for higgs.

8abd103

Merge branch 'master' of github.com:dmlc/xgboost into hist2

3389c94

Use 8 items per thread.

2cc8b16

Use aligned loads in compressed iterator.

4058e78

Benchmarks and 32 bit shmem addition.

c93641a

Revert "Benchmarks and 32 bit shmem addition."

7d34f78

This reverts commit c93641a.

Lint

ab4894a

RAMitchell marked this pull request as ready for review August 1, 2022 09:03

RAMitchell closed this Aug 1, 2022

RAMitchell reopened this Aug 1, 2022

RAMitchell added 3 commits August 2, 2022 02:24

Clang-tidy

0d3480e

Fix test

e622026

More test fixes.

08cc82d

RAMitchell closed this Aug 10, 2022

RAMitchell reopened this Aug 10, 2022

Revert "Use aligned loads in compressed iterator."

f356868

This reverts commit 4058e78.

RAMitchell added 2 commits August 11, 2022 04:56

Revert "Fix test"

1e8a842

This reverts commit e622026.

Revert "More test fixes."

39ac148

This reverts commit 08cc82d.

trivialfis approved these changes Aug 18, 2022

View reviewed changes

RAMitchell merged commit 1703dc3 into dmlc:master Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise histogram kernels #8118

Optimise histogram kernels #8118

RAMitchell commented Jul 26, 2022

trivialfis commented Jul 26, 2022

RAMitchell commented Jul 27, 2022

RAMitchell commented Jul 29, 2022

trivialfis commented Aug 1, 2022

RAMitchell commented Aug 3, 2022

trivialfis commented Aug 3, 2022

RAMitchell commented Aug 11, 2022

Optimise histogram kernels #8118

Optimise histogram kernels #8118

Conversation

RAMitchell commented Jul 26, 2022

trivialfis commented Jul 26, 2022

RAMitchell commented Jul 27, 2022

RAMitchell commented Jul 29, 2022

trivialfis commented Aug 1, 2022

RAMitchell commented Aug 3, 2022

trivialfis commented Aug 3, 2022

RAMitchell commented Aug 11, 2022