Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable distributed GPU training over Rabit #7930

Merged
merged 8 commits into from May 30, 2022
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented May 23, 2022

Currently distributed GPU training is only supported with NCCL enabled. This PR provides a fallback to Rabit when NCCL is not available (for example, on Windows). A side effect is to allow federated learning using the gpu_hist method.

Part of #7778

@rongou
Copy link
Contributor Author

rongou commented May 23, 2022

@RAMitchell @trivialfis

I think this actually works out pretty well, it allows people to run multi-gpu training on platforms without NCCL:

cmake .. -GNinja -DUSE_CUDA=ON -DUSE_NCCL=OFF
ninja
cd ../tests/distributed
./runtests-gpu.sh

generates this output:

 ====== 1. Basic distributed-gpu test with Python: 4 workers; 1 GPU per worker ====== 

2022-05-23 12:23:30,651 INFO start listen on 192.168.1.82:9091
[12:23:30] task 0 got new rank 0
[12:23:30] task 1 got new rank 1
2022-05-23 12:23:30,746 INFO @tracker All of 2 nodes getting started
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 1 of 2 parts
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 0 of 2 parts
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 1 of 2 parts
[12:23:30] Load part of data 0 of 2 parts
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
2022-05-23 12:23:31,139 INFO [0]	eval-logloss:0.22669	train-logloss:0.23338
2022-05-23 12:23:31,182 INFO [1]	eval-logloss:0.13787	train-logloss:0.13666
2022-05-23 12:23:31,226 INFO [2]	eval-logloss:0.08046	train-logloss:0.08253
2022-05-23 12:23:31,270 INFO [3]	eval-logloss:0.05833	train-logloss:0.05647
2022-05-23 12:23:31,314 INFO [4]	eval-logloss:0.03829	train-logloss:0.04151
2022-05-23 12:23:31,358 INFO [5]	eval-logloss:0.02663	train-logloss:0.02961
2022-05-23 12:23:31,403 INFO [6]	eval-logloss:0.01388	train-logloss:0.01919
2022-05-23 12:23:31,447 INFO [7]	eval-logloss:0.01020	train-logloss:0.01332
2022-05-23 12:23:31,490 INFO [8]	eval-logloss:0.00848	train-logloss:0.01113
2022-05-23 12:23:31,534 INFO [9]	eval-logloss:0.00692	train-logloss:0.00663
2022-05-23 12:23:31,578 INFO [10]	eval-logloss:0.00544	train-logloss:0.00504
2022-05-23 12:23:31,623 INFO [11]	eval-logloss:0.00445	train-logloss:0.00420
2022-05-23 12:23:31,667 INFO [12]	eval-logloss:0.00336	train-logloss:0.00356
2022-05-23 12:23:31,710 INFO [13]	eval-logloss:0.00277	train-logloss:0.00281
2022-05-23 12:23:31,755 INFO [14]	eval-logloss:0.00252	train-logloss:0.00244
2022-05-23 12:23:31,799 INFO [15]	eval-logloss:0.00177	train-logloss:0.00194
2022-05-23 12:23:31,843 INFO [16]	eval-logloss:0.00157	train-logloss:0.00161
2022-05-23 12:23:31,887 INFO [17]	eval-logloss:0.00135	train-logloss:0.00142
2022-05-23 12:23:31,935 INFO [18]	eval-logloss:0.00123	train-logloss:0.00125
2022-05-23 12:23:31,979 INFO [19]	eval-logloss:0.00107	train-logloss:0.00107
2022-05-23 12:23:32,006 INFO Finished training
2022-05-23 12:23:32,006 INFO Finished training
2022-05-23 12:23:32,011 INFO @tracker All nodes finishes job
2022-05-23 12:23:32,011 INFO @tracker 1.2640743255615234 secs between node start and job finish

 ====== 2. RF distributed-gpu test with Python: 4 workers; 1 GPU per worker ====== 

2022-05-23 12:23:32,264 INFO start listen on 192.168.1.82:9092
[12:23:32] task 0 got new rank 0
[12:23:32] task 1 got new rank 1
2022-05-23 12:23:32,365 INFO @tracker All of 2 nodes getting started
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 0 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 1 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 1 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 0 of 2 parts
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args.  Passing these as positional arguments will be considered as error in future releases.
  warnings.warn(
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
2022-05-23 12:23:33,561 INFO [0]	eval-logloss:0.22100	train-logloss:0.21673
2022-05-23 12:23:33,607 INFO Finished training
2022-05-23 12:23:33,607 INFO Finished training
2022-05-23 12:23:33,609 INFO @tracker All nodes finishes job
2022-05-23 12:23:33,609 INFO @tracker 1.2437098026275635 secs between node start and job finish

While also enabled federated learning on GPUs:

cmake .. -GNinja -DUSE_CUDA=ON -DUSE_NCCL=OFF -DPLUGIN_FEDERATED=ON
ninja
cd ../tests/distributed
./runtests-federated.sh

which generates this output:

Generating a RSA private key
.....................+++++
.....................................................................................+++++
writing new private key to 'server-key.pem'
-----
Generating a RSA private key
.....................+++++
............................................................................................................................................................+++++
writing new private key to 'client-key.pem'
-----
[12:27:08] Federated server listening on 0.0.0.0:9091, world size 2
[12:27:09] Connecting to federated server localhost:9091, world size 2, rank 0
[12:27:09] Connecting to federated server localhost:9091, world size 2, rank 1
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:10] [0]	eval-logloss:0.22669	train-logloss:0.23338

[12:27:10] [1]	eval-logloss:0.13787	train-logloss:0.13666

[12:27:10] [2]	eval-logloss:0.08046	train-logloss:0.08253

[12:27:11] [3]	eval-logloss:0.05833	train-logloss:0.05647

[12:27:11] [4]	eval-logloss:0.03829	train-logloss:0.04151

[12:27:11] [5]	eval-logloss:0.02663	train-logloss:0.02961

[12:27:12] [6]	eval-logloss:0.01388	train-logloss:0.01919

[12:27:12] [7]	eval-logloss:0.01020	train-logloss:0.01332

[12:27:12] [8]	eval-logloss:0.00848	train-logloss:0.01113

[12:27:13] [9]	eval-logloss:0.00692	train-logloss:0.00663

[12:27:13] [10]	eval-logloss:0.00544	train-logloss:0.00504

[12:27:13] [11]	eval-logloss:0.00445	train-logloss:0.00420

[12:27:14] [12]	eval-logloss:0.00336	train-logloss:0.00356

[12:27:14] [13]	eval-logloss:0.00277	train-logloss:0.00281

[12:27:14] [14]	eval-logloss:0.00252	train-logloss:0.00244

[12:27:14] [15]	eval-logloss:0.00177	train-logloss:0.00194

[12:27:15] [16]	eval-logloss:0.00157	train-logloss:0.00161

[12:27:15] [17]	eval-logloss:0.00135	train-logloss:0.00142

[12:27:15] [18]	eval-logloss:0.00123	train-logloss:0.00125

[12:27:16] [19]	eval-logloss:0.00107	train-logloss:0.00107

[12:27:16] Finished training

[12:27:16] Federated server listening on 0.0.0.0:9091, world size 2
[12:27:17] Connecting to federated server localhost:9091, world size 2, rank 0
[12:27:17] Connecting to federated server localhost:9091, world size 2, rank 1
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] [0]	eval-logloss:0.22669	train-logloss:0.23338

[12:27:17] [1]	eval-logloss:0.13787	train-logloss:0.13666

[12:27:17] [2]	eval-logloss:0.08046	train-logloss:0.08253

[12:27:17] [3]	eval-logloss:0.05833	train-logloss:0.05647

[12:27:17] [4]	eval-logloss:0.03829	train-logloss:0.04151

[12:27:17] [5]	eval-logloss:0.02663	train-logloss:0.02961

[12:27:17] [6]	eval-logloss:0.01388	train-logloss:0.01919

[12:27:17] [7]	eval-logloss:0.01020	train-logloss:0.01332

[12:27:17] [8]	eval-logloss:0.00848	train-logloss:0.01113

[12:27:17] [9]	eval-logloss:0.00692	train-logloss:0.00663

[12:27:17] [10]	eval-logloss:0.00544	train-logloss:0.00504

[12:27:17] [11]	eval-logloss:0.00445	train-logloss:0.00420

[12:27:17] [12]	eval-logloss:0.00336	train-logloss:0.00356

[12:27:17] [13]	eval-logloss:0.00277	train-logloss:0.00281

[12:27:17] [14]	eval-logloss:0.00252	train-logloss:0.00244

[12:27:17] [15]	eval-logloss:0.00177	train-logloss:0.00194

[12:27:17] [16]	eval-logloss:0.00157	train-logloss:0.00161

[12:27:17] [17]	eval-logloss:0.00135	train-logloss:0.00142

[12:27:17] [18]	eval-logloss:0.00123	train-logloss:0.00125

[12:27:17] [19]	eval-logloss:0.00107	train-logloss:0.00107

[12:27:17] Finished training

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on the new option. Is there any chance to move nccl comm under the umbrella of rabit instead of making an abstraction over rabit? For early development it's fine, but I think for the longer term we would like to have one communicator that can be used by all xgboost algorithms.

@trivialfis
Copy link
Member

@RAMitchell would be great if you can take a look into this.

@RAMitchell
Copy link
Member

I will take a good look tomorrow.

@rongou
Copy link
Contributor Author

rongou commented May 24, 2022

@trivialfis agreed that we need to do more refactoring on rabit. At a minimum we should be able to build with both the default engine and the federated engine enabled (perhaps also the MPI engine) so that we can include them in the release, and a user can pick which one to use at runtime. This is something I'm planning to work on next.

As for NCCL, one issue is that the send and receive buffers are on device, so the api contract is different from that of rabit. We can't really make NCCL a rabit engine without violating the Liskov Substitution Principle. We might be able to make it work with some adapters, but we'd have to change all the calling code, so it's not a trivial change. I think the current PR is a necessary stepping stone towards that. What do you think?

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to see the distributed version work without nccl!

What is the purpose of CRTP here? Is it to share the code which monitors the number of bytes/calls?

We perhaps need to extend rabits API to have knowledge of memory buffers on alternative devices.

Deeper refactor/attention to rabit is very welcome.

If we are to have different engines at runtime does this imply runtime polymorphism in the rabit interface or some other method? I guess rabit will now have a state concerning which engine is active. This can create some hazards that didn't exist before, for example, accidentally changing the engine during training.

What is your opinion on rabits fault recovery mechanisms? From the project readme:

Reliable: rabit dig burrows to avoid disasters
Rabit programs can recover the model and results using synchronous function calls.
Rabit programs can set rabit_boostrap_cache=1 to support allreduce/broadcast operations before loadcheckpoint rabit::Init(); -> rabit::AllReduce(); -> rabit::loadCheckpoint(); -> for () { rabit::AllReduce(); rabit::Checkpoint();} -> rabit::Shutdown();

Do these fault tolerance mechanisms actually work in the existing rabit engine? Do we intend to support them in future engines?

@trivialfis
Copy link
Member

The fault tolerance is removed. The description is outdated.

@rongou
Copy link
Contributor Author

rongou commented May 25, 2022

@RAMitchell CRTP provides a static interface for both the nccl and rabit allreducers, and there is a bit of code reuse. I'm actually on the fence about this, we can probably just have two separate implementations if you think that's simpler.

If we don't care about fault tolerance, we can probably just rewrite rabit in, say, a few hundred lines of code. As far as I can tell, we only really need broadcast and allreduce, and nice to have allgatherv (the variable length version of allgather, which is currently simulated with broadcast and allreduce by the MPI and NCCL implementations).

But it is a pretty major refactoring. I'm hoping we can tackle it after ticking off the box "federated learning supports GPU training". :)

@RAMitchell
Copy link
Member

I actually don't mind about crtp I'm just curious.

A rabit rewrite could be very nice.

Happy to go ahead with this pr when you are ready.

@trivialfis
Copy link
Member

But it is a pretty major refactoring. I'm hoping we can tackle it after ticking off the box "federated learning supports GPU training". :)

Yup, that sounds good. I looked into some implementations recently. If we were to rewrite rabit we must make sure the new one is better than the current one (more robust and efficient). We don't have to rush into this.

@rongou
Copy link
Contributor Author

rongou commented May 27, 2022

Can we get this PR merged? Thanks!

@trivialfis trivialfis merged commit 80339c3 into dmlc:master May 30, 2022
@rongou rongou deleted the gpu-rabit branch November 18, 2022 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants