Enable distributed GPU training over Rabit #7930

rongou · 2022-05-23T19:17:45Z

Currently distributed GPU training is only supported with NCCL enabled. This PR provides a fallback to Rabit when NCCL is not available (for example, on Windows). A side effect is to allow federated learning using the gpu_hist method.

Part of #7778

rongou · 2022-05-23T19:28:54Z

@RAMitchell @trivialfis

I think this actually works out pretty well, it allows people to run multi-gpu training on platforms without NCCL:

cmake .. -GNinja -DUSE_CUDA=ON -DUSE_NCCL=OFF
ninja
cd ../tests/distributed
./runtests-gpu.sh

generates this output:

 ====== 1. Basic distributed-gpu test with Python: 4 workers; 1 GPU per worker ====== 

2022-05-23 12:23:30,651 INFO start listen on 192.168.1.82:9091
[12:23:30] task 0 got new rank 0
[12:23:30] task 1 got new rank 1
2022-05-23 12:23:30,746 INFO @tracker All of 2 nodes getting started
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 1 of 2 parts
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 0 of 2 parts
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 1 of 2 parts
[12:23:30] Load part of data 0 of 2 parts
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
2022-05-23 12:23:31,139 INFO [0]	eval-logloss:0.22669	train-logloss:0.23338
2022-05-23 12:23:31,182 INFO [1]	eval-logloss:0.13787	train-logloss:0.13666
2022-05-23 12:23:31,226 INFO [2]	eval-logloss:0.08046	train-logloss:0.08253
2022-05-23 12:23:31,270 INFO [3]	eval-logloss:0.05833	train-logloss:0.05647
2022-05-23 12:23:31,314 INFO [4]	eval-logloss:0.03829	train-logloss:0.04151
2022-05-23 12:23:31,358 INFO [5]	eval-logloss:0.02663	train-logloss:0.02961
2022-05-23 12:23:31,403 INFO [6]	eval-logloss:0.01388	train-logloss:0.01919
2022-05-23 12:23:31,447 INFO [7]	eval-logloss:0.01020	train-logloss:0.01332
2022-05-23 12:23:31,490 INFO [8]	eval-logloss:0.00848	train-logloss:0.01113
2022-05-23 12:23:31,534 INFO [9]	eval-logloss:0.00692	train-logloss:0.00663
2022-05-23 12:23:31,578 INFO [10]	eval-logloss:0.00544	train-logloss:0.00504
2022-05-23 12:23:31,623 INFO [11]	eval-logloss:0.00445	train-logloss:0.00420
2022-05-23 12:23:31,667 INFO [12]	eval-logloss:0.00336	train-logloss:0.00356
2022-05-23 12:23:31,710 INFO [13]	eval-logloss:0.00277	train-logloss:0.00281
2022-05-23 12:23:31,755 INFO [14]	eval-logloss:0.00252	train-logloss:0.00244
2022-05-23 12:23:31,799 INFO [15]	eval-logloss:0.00177	train-logloss:0.00194
2022-05-23 12:23:31,843 INFO [16]	eval-logloss:0.00157	train-logloss:0.00161
2022-05-23 12:23:31,887 INFO [17]	eval-logloss:0.00135	train-logloss:0.00142
2022-05-23 12:23:31,935 INFO [18]	eval-logloss:0.00123	train-logloss:0.00125
2022-05-23 12:23:31,979 INFO [19]	eval-logloss:0.00107	train-logloss:0.00107
2022-05-23 12:23:32,006 INFO Finished training
2022-05-23 12:23:32,006 INFO Finished training
2022-05-23 12:23:32,011 INFO @tracker All nodes finishes job
2022-05-23 12:23:32,011 INFO @tracker 1.2640743255615234 secs between node start and job finish

 ====== 2. RF distributed-gpu test with Python: 4 workers; 1 GPU per worker ====== 

2022-05-23 12:23:32,264 INFO start listen on 192.168.1.82:9092
[12:23:32] task 0 got new rank 0
[12:23:32] task 1 got new rank 1
2022-05-23 12:23:32,365 INFO @tracker All of 2 nodes getting started
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 0 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 1 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 1 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 0 of 2 parts
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
2022-05-23 12:23:33,561 INFO [0]	eval-logloss:0.22100	train-logloss:0.21673
2022-05-23 12:23:33,607 INFO Finished training
2022-05-23 12:23:33,607 INFO Finished training
2022-05-23 12:23:33,609 INFO @tracker All nodes finishes job
2022-05-23 12:23:33,609 INFO @tracker 1.2437098026275635 secs between node start and job finish

While also enabled federated learning on GPUs:

cmake .. -GNinja -DUSE_CUDA=ON -DUSE_NCCL=OFF -DPLUGIN_FEDERATED=ON
ninja
cd ../tests/distributed
./runtests-federated.sh

which generates this output:

Generating a RSA private key
.....................+++++
.....................................................................................+++++
writing new private key to 'server-key.pem'
-----
Generating a RSA private key
.....................+++++
............................................................................................................................................................+++++
writing new private key to 'client-key.pem'
-----
[12:27:08] Federated server listening on 0.0.0.0:9091, world size 2
[12:27:09] Connecting to federated server localhost:9091, world size 2, rank 0
[12:27:09] Connecting to federated server localhost:9091, world size 2, rank 1
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:10] [0]	eval-logloss:0.22669	train-logloss:0.23338

[12:27:10] [1]	eval-logloss:0.13787	train-logloss:0.13666

[12:27:10] [2]	eval-logloss:0.08046	train-logloss:0.08253

[12:27:11] [3]	eval-logloss:0.05833	train-logloss:0.05647

[12:27:11] [4]	eval-logloss:0.03829	train-logloss:0.04151

[12:27:11] [5]	eval-logloss:0.02663	train-logloss:0.02961

[12:27:12] [6]	eval-logloss:0.01388	train-logloss:0.01919

[12:27:12] [7]	eval-logloss:0.01020	train-logloss:0.01332

[12:27:12] [8]	eval-logloss:0.00848	train-logloss:0.01113

[12:27:13] [9]	eval-logloss:0.00692	train-logloss:0.00663

[12:27:13] [10]	eval-logloss:0.00544	train-logloss:0.00504

[12:27:13] [11]	eval-logloss:0.00445	train-logloss:0.00420

[12:27:14] [12]	eval-logloss:0.00336	train-logloss:0.00356

[12:27:14] [13]	eval-logloss:0.00277	train-logloss:0.00281

[12:27:14] [14]	eval-logloss:0.00252	train-logloss:0.00244

[12:27:14] [15]	eval-logloss:0.00177	train-logloss:0.00194

[12:27:15] [16]	eval-logloss:0.00157	train-logloss:0.00161

[12:27:15] [17]	eval-logloss:0.00135	train-logloss:0.00142

[12:27:15] [18]	eval-logloss:0.00123	train-logloss:0.00125

[12:27:16] [19]	eval-logloss:0.00107	train-logloss:0.00107

[12:27:16] Finished training

[12:27:16] Federated server listening on 0.0.0.0:9091, world size 2
[12:27:17] Connecting to federated server localhost:9091, world size 2, rank 0
[12:27:17] Connecting to federated server localhost:9091, world size 2, rank 1
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] [0]	eval-logloss:0.22669	train-logloss:0.23338

[12:27:17] [1]	eval-logloss:0.13787	train-logloss:0.13666

[12:27:17] [2]	eval-logloss:0.08046	train-logloss:0.08253

[12:27:17] [3]	eval-logloss:0.05833	train-logloss:0.05647

[12:27:17] [4]	eval-logloss:0.03829	train-logloss:0.04151

[12:27:17] [5]	eval-logloss:0.02663	train-logloss:0.02961

[12:27:17] [6]	eval-logloss:0.01388	train-logloss:0.01919

[12:27:17] [7]	eval-logloss:0.01020	train-logloss:0.01332

[12:27:17] [8]	eval-logloss:0.00848	train-logloss:0.01113

[12:27:17] [9]	eval-logloss:0.00692	train-logloss:0.00663

[12:27:17] [10]	eval-logloss:0.00544	train-logloss:0.00504

[12:27:17] [11]	eval-logloss:0.00445	train-logloss:0.00420

[12:27:17] [12]	eval-logloss:0.00336	train-logloss:0.00356

[12:27:17] [13]	eval-logloss:0.00277	train-logloss:0.00281

[12:27:17] [14]	eval-logloss:0.00252	train-logloss:0.00244

[12:27:17] [15]	eval-logloss:0.00177	train-logloss:0.00194

[12:27:17] [16]	eval-logloss:0.00157	train-logloss:0.00161

[12:27:17] [17]	eval-logloss:0.00135	train-logloss:0.00142

[12:27:17] [18]	eval-logloss:0.00123	train-logloss:0.00125

[12:27:17] [19]	eval-logloss:0.00107	train-logloss:0.00107

[12:27:17] Finished training

trivialfis

Thank you for working on the new option. Is there any chance to move nccl comm under the umbrella of rabit instead of making an abstraction over rabit? For early development it's fine, but I think for the longer term we would like to have one communicator that can be used by all xgboost algorithms.

trivialfis · 2022-05-24T14:17:10Z

@RAMitchell would be great if you can take a look into this.

RAMitchell · 2022-05-24T15:07:56Z

I will take a good look tomorrow.

rongou · 2022-05-24T18:41:01Z

@trivialfis agreed that we need to do more refactoring on rabit. At a minimum we should be able to build with both the default engine and the federated engine enabled (perhaps also the MPI engine) so that we can include them in the release, and a user can pick which one to use at runtime. This is something I'm planning to work on next.

As for NCCL, one issue is that the send and receive buffers are on device, so the api contract is different from that of rabit. We can't really make NCCL a rabit engine without violating the Liskov Substitution Principle. We might be able to make it work with some adapters, but we'd have to change all the calling code, so it's not a trivial change. I think the current PR is a necessary stepping stone towards that. What do you think?

RAMitchell

Very nice to see the distributed version work without nccl!

What is the purpose of CRTP here? Is it to share the code which monitors the number of bytes/calls?

We perhaps need to extend rabits API to have knowledge of memory buffers on alternative devices.

Deeper refactor/attention to rabit is very welcome.

If we are to have different engines at runtime does this imply runtime polymorphism in the rabit interface or some other method? I guess rabit will now have a state concerning which engine is active. This can create some hazards that didn't exist before, for example, accidentally changing the engine during training.

What is your opinion on rabits fault recovery mechanisms? From the project readme:

Reliable: rabit dig burrows to avoid disasters
Rabit programs can recover the model and results using synchronous function calls.
Rabit programs can set rabit_boostrap_cache=1 to support allreduce/broadcast operations before loadcheckpoint rabit::Init(); -> rabit::AllReduce(); -> rabit::loadCheckpoint(); -> for () { rabit::AllReduce(); rabit::Checkpoint();} -> rabit::Shutdown();

Do these fault tolerance mechanisms actually work in the existing rabit engine? Do we intend to support them in future engines?

trivialfis · 2022-05-25T12:59:40Z

The fault tolerance is removed. The description is outdated.

rongou · 2022-05-25T18:53:24Z

@RAMitchell CRTP provides a static interface for both the nccl and rabit allreducers, and there is a bit of code reuse. I'm actually on the fence about this, we can probably just have two separate implementations if you think that's simpler.

If we don't care about fault tolerance, we can probably just rewrite rabit in, say, a few hundred lines of code. As far as I can tell, we only really need broadcast and allreduce, and nice to have allgatherv (the variable length version of allgather, which is currently simulated with broadcast and allreduce by the MPI and NCCL implementations).

But it is a pretty major refactoring. I'm hoping we can tackle it after ticking off the box "federated learning supports GPU training". :)

RAMitchell · 2022-05-25T19:48:37Z

I actually don't mind about crtp I'm just curious.

A rabit rewrite could be very nice.

Happy to go ahead with this pr when you are ready.

trivialfis · 2022-05-27T12:31:38Z

But it is a pretty major refactoring. I'm hoping we can tackle it after ticking off the box "federated learning supports GPU training". :)

Yup, that sounds good. I looked into some implementations recently. If we were to rewrite rabit we must make sure the new one is better than the current one (more robust and efficient). We don't have to rush into this.

rongou · 2022-05-27T16:48:13Z

Can we get this PR merged? Thanks!

rongou added 7 commits May 13, 2022 10:31

add distributed tests for federated gpu

Verified

This commit was signed with the committer’s verified signature.

danielpalme Daniel Palme

GPG key ID: AC1032D1E1D448E1

Verified
Learn about vigilant mode

3f09e5e

refactor nccl allreducer

4770a6b

implement rabit allreducer

aa93a00

Merge remote-tracking branch 'upstream/master' into gpu-rabit

2858dfa

enable mgpu tests without nccl

f2738bf

cleanup docs

185e7b8

update build doc

d981ddb

fix windows build

e6a3a59

trivialfis reviewed May 24, 2022

View reviewed changes

RAMitchell reviewed May 25, 2022

View reviewed changes

trivialfis approved these changes May 30, 2022

View reviewed changes

trivialfis merged commit 80339c3 into dmlc:master May 30, 2022

trivialfis mentioned this pull request Aug 4, 2022

Optional NCCL installation for PyPi wheel #8140

Closed

rongou deleted the gpu-rabit branch November 18, 2022 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable distributed GPU training over Rabit #7930

Enable distributed GPU training over Rabit #7930

rongou commented May 23, 2022

rongou commented May 23, 2022

trivialfis left a comment

trivialfis commented May 24, 2022

RAMitchell commented May 24, 2022

rongou commented May 24, 2022

RAMitchell left a comment

trivialfis commented May 25, 2022

rongou commented May 25, 2022

RAMitchell commented May 25, 2022

trivialfis commented May 27, 2022

rongou commented May 27, 2022

Enable distributed GPU training over Rabit #7930

Enable distributed GPU training over Rabit #7930

Conversation

rongou commented May 23, 2022

rongou commented May 23, 2022

trivialfis left a comment

Choose a reason for hiding this comment

trivialfis commented May 24, 2022

RAMitchell commented May 24, 2022

rongou commented May 24, 2022

RAMitchell left a comment

Choose a reason for hiding this comment

trivialfis commented May 25, 2022

rongou commented May 25, 2022

RAMitchell commented May 25, 2022

trivialfis commented May 27, 2022

rongou commented May 27, 2022