Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse gpu_hist all-reduce calls where possible #7867

Merged
merged 19 commits into from May 17, 2022
Merged

Conversation

RAMitchell
Copy link
Member

This PR changes the driver class slightly to handle identification of invalid nodes.

'gpu_hist' is changed to iterate over batches of nodes for each operation (e.g. apply split, update position, etc.).

All-reduce may then be called on batches of nodes at the layer level.

The histogram memory allocator is redesigned to allocate contiguous memory for the current batch of nodes. It is no longer able to recycle old memory, and now will simply stop caching allocations when it reaches a maximum size (it is too difficult to try to find and reuse memory blocks now that they have variable size).

Future optimisations can be made to fuse more node operations together.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why is a contiguous histogram required? We can simply use nccl group call right?

@RAMitchell
Copy link
Member Author

I think it's possible, but the number of nodes can grow very large and this seems likely to break nccl.

@trivialfis
Copy link
Member

The upper limit is 2048 as documented in https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/groups.html (last paragraph).

I think one of the problems we try to solve with fusing nodes is deep trees, so it might be desirable to reuse histograms to avoid allocation.

@RAMitchell
Copy link
Member Author

If I try to reuse memory I need to write some kind of mutex where histograms can't be recycled when they are being used. The memory management gets more complicated. We also have to zero the memory one by one with separate kernels if we are taking non-contiguous memory.

I will think about it. I think there are still benefits to doing it this way.

@RAMitchell
Copy link
Member Author

RAMitchell commented May 9, 2022

EDIT: Below results are actually for 1 GPU. The performance regression for epsilon also reversed after running again, so this is just noise.

Gbm-bench results below. Minor improvements in a few places, some regression for epsilon/bosch.

max_depth=8

dataset master fuse
airline 103.3786661 103.3746259
bosch 18.15458408 18.01013005
covtype 28.41138208 27.3639999
epsilon 71.15875021 69.51017013
fraud 1.432621169 1.390665454
higgs 21.07405361 20.73446601
year 11.10905975 10.36314736

max_depth=16

dataset master fuse
airline 1002.253579 951.7462879
bosch 58.92085508 62.31542211
covtype 146.2093249 141.8389019
epsilon 838.7910689 893.5267582
fraud 1.856290607 1.843482201
higgs 547.7855797 490.7695937
year 1225.145735 1064.302769

@RAMitchell
Copy link
Member Author

Any regression might be a result of not recycling memory and fewer uses of the subtraction trick. Investigating further.

@RAMitchell
Copy link
Member Author

Results for depth=16 with 8 gpus this time.

dataset master fuse
airline 1233.365728 906.3104833
bosch 73.33866537 60.87019034
covtype 204.4933698 160.8755513
epsilon 1432.8182 1367.160304
fraud 10.78470769 10.36000521
higgs 770.2519306 525.754074
year 1733.778181 1342.36472

Examining the output in debug mode we can see that the number of all-reduce calls for Bosch have decreased by 10x.

old bosch:
[06:03:43] ======== NCCL Statistics========
[06:03:43] AllReduce calls: 142535
[06:03:43] AllReduce total MiB communicated: 270572

new bosch:
[06:13:13] AllReduce calls: 12314
[06:13:13] AllReduce total MiB communicated: 275222

@trivialfis
Copy link
Member

The result looks exciting! Please let me know your plan regarding this optimization. Do we want to merge the result even if there's regression? Do we want to have a better design of the histogram allocator? Do we want to implement the static fuse size (8 or 16 using stack memory) or use the heap for bookkeeping? Just curious, no need to provide any concrete plan if it hasn't been decided yet. Please let me know if it's ready for review or mark it as draft/WIP.

@RAMitchell
Copy link
Member Author

The result looks exciting! Please let me know your plan regarding this optimization. Do we want to merge the result even if there's regression? Do we want to have a better design of the histogram allocator? Do we want to implement the static fuse size (8 or 16 using stack memory) or use the heap for bookkeeping? Just curious, no need to provide any concrete plan if it hasn't been decided yet. Please let me know if it's ready for review or mark it as draft/WIP.

There is no actual regression for single GPU, after I ran it again it was just variance. I actually like the current design better due to simplicity. I don't think much is gained by recycling nodes, and it becomes more complicated to track the status of each memory location when dealing with batches. So I think my preference is to go ahead with this, it is ready for review.

I think the static fusing can work very well, in particular for updating position and evaluating splits, so those are my next goals.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. Would be great if you can investigate the perf change of epsilon depth 16 in future PRs. The variance seems significant but I assume it's a special case and overall performance is more important.

@RAMitchell RAMitchell merged commit 71d3b2e into dmlc:master May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants