Fuse gpu_hist all-reduce calls where possible #7867

RAMitchell · 2022-05-06T12:27:25Z

This PR changes the driver class slightly to handle identification of invalid nodes.

'gpu_hist' is changed to iterate over batches of nodes for each operation (e.g. apply split, update position, etc.).

All-reduce may then be called on batches of nodes at the layer level.

The histogram memory allocator is redesigned to allocate contiguous memory for the current batch of nodes. It is no longer able to recycle old memory, and now will simply stop caching allocations when it reaches a maximum size (it is too difficult to try to find and reuse memory blocks now that they have variable size).

Future optimisations can be made to fuse more node operations together.

This reverts commit 80a3e78.

This reverts commit a1cddaa.

trivialfis

Out of curiosity, why is a contiguous histogram required? We can simply use nccl group call right?

RAMitchell · 2022-05-06T14:43:56Z

I think it's possible, but the number of nodes can grow very large and this seems likely to break nccl.

trivialfis · 2022-05-06T14:54:58Z

The upper limit is 2048 as documented in https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/groups.html (last paragraph).

I think one of the problems we try to solve with fusing nodes is deep trees, so it might be desirable to reuse histograms to avoid allocation.

RAMitchell · 2022-05-09T11:16:34Z

If I try to reuse memory I need to write some kind of mutex where histograms can't be recycled when they are being used. The memory management gets more complicated. We also have to zero the memory one by one with separate kernels if we are taking non-contiguous memory.

I will think about it. I think there are still benefits to doing it this way.

RAMitchell · 2022-05-09T13:01:47Z

EDIT: Below results are actually for 1 GPU. The performance regression for epsilon also reversed after running again, so this is just noise.

Gbm-bench results below. Minor improvements in a few places, some regression for epsilon/bosch.

max_depth=8

dataset	master	fuse
airline	103.3786661	103.3746259
bosch	18.15458408	18.01013005
covtype	28.41138208	27.3639999
epsilon	71.15875021	69.51017013
fraud	1.432621169	1.390665454
higgs	21.07405361	20.73446601
year	11.10905975	10.36314736

max_depth=16

dataset	master	fuse
airline	1002.253579	951.7462879
bosch	58.92085508	62.31542211
covtype	146.2093249	141.8389019
epsilon	838.7910689	893.5267582
fraud	1.856290607	1.843482201
higgs	547.7855797	490.7695937
year	1225.145735	1064.302769

RAMitchell · 2022-05-09T13:06:33Z

Any regression might be a result of not recycling memory and fewer uses of the subtraction trick. Investigating further.

RAMitchell · 2022-05-11T11:09:49Z

Results for depth=16 with 8 gpus this time.

dataset	master	fuse
airline	1233.365728	906.3104833
bosch	73.33866537	60.87019034
covtype	204.4933698	160.8755513
epsilon	1432.8182	1367.160304
fraud	10.78470769	10.36000521
higgs	770.2519306	525.754074
year	1733.778181	1342.36472

Examining the output in debug mode we can see that the number of all-reduce calls for Bosch have decreased by 10x.

old bosch:
[06:03:43] ======== NCCL Statistics========
[06:03:43] AllReduce calls: 142535
[06:03:43] AllReduce total MiB communicated: 270572

new bosch:
[06:13:13] AllReduce calls: 12314
[06:13:13] AllReduce total MiB communicated: 275222

trivialfis · 2022-05-11T15:28:45Z

The result looks exciting! Please let me know your plan regarding this optimization. Do we want to merge the result even if there's regression? Do we want to have a better design of the histogram allocator? Do we want to implement the static fuse size (8 or 16 using stack memory) or use the heap for bookkeeping? Just curious, no need to provide any concrete plan if it hasn't been decided yet. Please let me know if it's ready for review or mark it as draft/WIP.

RAMitchell · 2022-05-12T08:35:51Z

The result looks exciting! Please let me know your plan regarding this optimization. Do we want to merge the result even if there's regression? Do we want to have a better design of the histogram allocator? Do we want to implement the static fuse size (8 or 16 using stack memory) or use the heap for bookkeeping? Just curious, no need to provide any concrete plan if it hasn't been decided yet. Please let me know if it's ready for review or mark it as draft/WIP.

There is no actual regression for single GPU, after I ran it again it was just variance. I actually like the current design better due to simplicity. I don't think much is gained by recycling nodes, and it becomes more complicated to track the status of each memory location when dealing with batches. So I think my preference is to go ahead with this, it is ready for review.

I think the static fusing can work very well, in particular for updating position and evaluating splits, so those are my next goals.

trivialfis

The change looks good to me. Would be great if you can investigate the perf change of epsilon depth 16 in future PRs. The variance seems significant but I assume it's a special case and overall performance is more important.

RAMitchell added 14 commits April 21, 2022 03:19

Remove single_precision_histogram

2b4cf67

Batch nodes from driver

f140ebc

Categoricals broken

80a3e78

Refactor categoricals

e1fb702

Refactor categoricals 2

dc100cf

Skip copy if no categoricals

bc74458

Review comment

c4f8eac

Merge branch 'master' of github.com:dmlc/xgboost into categorical

2a53849

Revert "Categoricals broken"

a1cddaa

This reverts commit 80a3e78.

Merge branch 'master' of github.com:dmlc/xgboost into fuse

829bda6

Merge branch 'categorical' of github.com:RAMitchell/xgboost into fuse

0bc8745

Lint

fd0e25e

Merge branch 'master' of github.com:dmlc/xgboost into fuse

9fab64e

Revert "Revert "Categoricals broken""

56785f3

This reverts commit a1cddaa.

trivialfis reviewed May 6, 2022

View reviewed changes

Limit concurrent nodes

1dd1a6c

Lint

8751d14

trivialfis approved these changes May 12, 2022

View reviewed changes

RAMitchell added 3 commits May 14, 2022 06:56

Merge branch 'master' of github.com:dmlc/xgboost into fuse

22423bf

Lint

fd839b4

Lint

5ecb3d8

RAMitchell merged commit 71d3b2e into dmlc:master May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse gpu_hist all-reduce calls where possible #7867

Fuse gpu_hist all-reduce calls where possible #7867

RAMitchell commented May 6, 2022

trivialfis left a comment

RAMitchell commented May 6, 2022

trivialfis commented May 6, 2022

RAMitchell commented May 9, 2022

RAMitchell commented May 9, 2022 •

edited

RAMitchell commented May 9, 2022

RAMitchell commented May 11, 2022

trivialfis commented May 11, 2022

RAMitchell commented May 12, 2022

trivialfis left a comment

Fuse gpu_hist all-reduce calls where possible #7867

Fuse gpu_hist all-reduce calls where possible #7867

Conversation

RAMitchell commented May 6, 2022

trivialfis left a comment

Choose a reason for hiding this comment

RAMitchell commented May 6, 2022

trivialfis commented May 6, 2022

RAMitchell commented May 9, 2022

RAMitchell commented May 9, 2022 • edited

RAMitchell commented May 9, 2022

RAMitchell commented May 11, 2022

trivialfis commented May 11, 2022

RAMitchell commented May 12, 2022

trivialfis left a comment

Choose a reason for hiding this comment

RAMitchell commented May 9, 2022 •

edited