-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Enable distributed GPU training over Rabit #7930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this actually works out pretty well, it allows people to run multi-gpu training on platforms without NCCL: cmake .. -GNinja -DUSE_CUDA=ON -DUSE_NCCL=OFF
ninja
cd ../tests/distributed
./runtests-gpu.sh generates this output: ====== 1. Basic distributed-gpu test with Python: 4 workers; 1 GPU per worker ======
2022-05-23 12:23:30,651 INFO start listen on 192.168.1.82:9091
[12:23:30] task 0 got new rank 0
[12:23:30] task 1 got new rank 1
2022-05-23 12:23:30,746 INFO @tracker All of 2 nodes getting started
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 1 of 2 parts
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 0 of 2 parts
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] XGBoost distributed mode detected, will split data among workers
[12:23:30] Load part of data 1 of 2 parts
[12:23:30] Load part of data 0 of 2 parts
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args. Passing these as positional arguments will be considered as error in future releases.
warnings.warn(
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args. Passing these as positional arguments will be considered as error in future releases.
warnings.warn(
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:31] XGBoost is not compiled with NCCL, falling back to Rabit.
2022-05-23 12:23:31,139 INFO [0] eval-logloss:0.22669 train-logloss:0.23338
2022-05-23 12:23:31,182 INFO [1] eval-logloss:0.13787 train-logloss:0.13666
2022-05-23 12:23:31,226 INFO [2] eval-logloss:0.08046 train-logloss:0.08253
2022-05-23 12:23:31,270 INFO [3] eval-logloss:0.05833 train-logloss:0.05647
2022-05-23 12:23:31,314 INFO [4] eval-logloss:0.03829 train-logloss:0.04151
2022-05-23 12:23:31,358 INFO [5] eval-logloss:0.02663 train-logloss:0.02961
2022-05-23 12:23:31,403 INFO [6] eval-logloss:0.01388 train-logloss:0.01919
2022-05-23 12:23:31,447 INFO [7] eval-logloss:0.01020 train-logloss:0.01332
2022-05-23 12:23:31,490 INFO [8] eval-logloss:0.00848 train-logloss:0.01113
2022-05-23 12:23:31,534 INFO [9] eval-logloss:0.00692 train-logloss:0.00663
2022-05-23 12:23:31,578 INFO [10] eval-logloss:0.00544 train-logloss:0.00504
2022-05-23 12:23:31,623 INFO [11] eval-logloss:0.00445 train-logloss:0.00420
2022-05-23 12:23:31,667 INFO [12] eval-logloss:0.00336 train-logloss:0.00356
2022-05-23 12:23:31,710 INFO [13] eval-logloss:0.00277 train-logloss:0.00281
2022-05-23 12:23:31,755 INFO [14] eval-logloss:0.00252 train-logloss:0.00244
2022-05-23 12:23:31,799 INFO [15] eval-logloss:0.00177 train-logloss:0.00194
2022-05-23 12:23:31,843 INFO [16] eval-logloss:0.00157 train-logloss:0.00161
2022-05-23 12:23:31,887 INFO [17] eval-logloss:0.00135 train-logloss:0.00142
2022-05-23 12:23:31,935 INFO [18] eval-logloss:0.00123 train-logloss:0.00125
2022-05-23 12:23:31,979 INFO [19] eval-logloss:0.00107 train-logloss:0.00107
2022-05-23 12:23:32,006 INFO Finished training
2022-05-23 12:23:32,006 INFO Finished training
2022-05-23 12:23:32,011 INFO @tracker All nodes finishes job
2022-05-23 12:23:32,011 INFO @tracker 1.2640743255615234 secs between node start and job finish
====== 2. RF distributed-gpu test with Python: 4 workers; 1 GPU per worker ======
2022-05-23 12:23:32,264 INFO start listen on 192.168.1.82:9092
[12:23:32] task 0 got new rank 0
[12:23:32] task 1 got new rank 1
2022-05-23 12:23:32,365 INFO @tracker All of 2 nodes getting started
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 0 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 1 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 1 of 2 parts
[12:23:32] XGBoost distributed mode detected, will split data among workers
[12:23:32] Load part of data 0 of 2 parts
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args. Passing these as positional arguments will be considered as error in future releases.
warnings.warn(
/home/rou/src/xgboost-cpp/python-package/xgboost/core.py:577: FutureWarning: Pass `evals` as keyword args. Passing these as positional arguments will be considered as error in future releases.
warnings.warn(
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
[12:23:32] XGBoost is not compiled with NCCL, falling back to Rabit.
2022-05-23 12:23:33,561 INFO [0] eval-logloss:0.22100 train-logloss:0.21673
2022-05-23 12:23:33,607 INFO Finished training
2022-05-23 12:23:33,607 INFO Finished training
2022-05-23 12:23:33,609 INFO @tracker All nodes finishes job
2022-05-23 12:23:33,609 INFO @tracker 1.2437098026275635 secs between node start and job finish While also enabled federated learning on GPUs: cmake .. -GNinja -DUSE_CUDA=ON -DUSE_NCCL=OFF -DPLUGIN_FEDERATED=ON
ninja
cd ../tests/distributed
./runtests-federated.sh which generates this output: Generating a RSA private key
.....................+++++
.....................................................................................+++++
writing new private key to 'server-key.pem'
-----
Generating a RSA private key
.....................+++++
............................................................................................................................................................+++++
writing new private key to 'client-key.pem'
-----
[12:27:08] Federated server listening on 0.0.0.0:9091, world size 2
[12:27:09] Connecting to federated server localhost:9091, world size 2, rank 0
[12:27:09] Connecting to federated server localhost:9091, world size 2, rank 1
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:09] XGBoost federated mode detected, not splitting data among workers
[12:27:10] [0] eval-logloss:0.22669 train-logloss:0.23338
[12:27:10] [1] eval-logloss:0.13787 train-logloss:0.13666
[12:27:10] [2] eval-logloss:0.08046 train-logloss:0.08253
[12:27:11] [3] eval-logloss:0.05833 train-logloss:0.05647
[12:27:11] [4] eval-logloss:0.03829 train-logloss:0.04151
[12:27:11] [5] eval-logloss:0.02663 train-logloss:0.02961
[12:27:12] [6] eval-logloss:0.01388 train-logloss:0.01919
[12:27:12] [7] eval-logloss:0.01020 train-logloss:0.01332
[12:27:12] [8] eval-logloss:0.00848 train-logloss:0.01113
[12:27:13] [9] eval-logloss:0.00692 train-logloss:0.00663
[12:27:13] [10] eval-logloss:0.00544 train-logloss:0.00504
[12:27:13] [11] eval-logloss:0.00445 train-logloss:0.00420
[12:27:14] [12] eval-logloss:0.00336 train-logloss:0.00356
[12:27:14] [13] eval-logloss:0.00277 train-logloss:0.00281
[12:27:14] [14] eval-logloss:0.00252 train-logloss:0.00244
[12:27:14] [15] eval-logloss:0.00177 train-logloss:0.00194
[12:27:15] [16] eval-logloss:0.00157 train-logloss:0.00161
[12:27:15] [17] eval-logloss:0.00135 train-logloss:0.00142
[12:27:15] [18] eval-logloss:0.00123 train-logloss:0.00125
[12:27:16] [19] eval-logloss:0.00107 train-logloss:0.00107
[12:27:16] Finished training
[12:27:16] Federated server listening on 0.0.0.0:9091, world size 2
[12:27:17] Connecting to federated server localhost:9091, world size 2, rank 0
[12:27:17] Connecting to federated server localhost:9091, world size 2, rank 1
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] XGBoost federated mode detected, not splitting data among workers
[12:27:17] [0] eval-logloss:0.22669 train-logloss:0.23338
[12:27:17] [1] eval-logloss:0.13787 train-logloss:0.13666
[12:27:17] [2] eval-logloss:0.08046 train-logloss:0.08253
[12:27:17] [3] eval-logloss:0.05833 train-logloss:0.05647
[12:27:17] [4] eval-logloss:0.03829 train-logloss:0.04151
[12:27:17] [5] eval-logloss:0.02663 train-logloss:0.02961
[12:27:17] [6] eval-logloss:0.01388 train-logloss:0.01919
[12:27:17] [7] eval-logloss:0.01020 train-logloss:0.01332
[12:27:17] [8] eval-logloss:0.00848 train-logloss:0.01113
[12:27:17] [9] eval-logloss:0.00692 train-logloss:0.00663
[12:27:17] [10] eval-logloss:0.00544 train-logloss:0.00504
[12:27:17] [11] eval-logloss:0.00445 train-logloss:0.00420
[12:27:17] [12] eval-logloss:0.00336 train-logloss:0.00356
[12:27:17] [13] eval-logloss:0.00277 train-logloss:0.00281
[12:27:17] [14] eval-logloss:0.00252 train-logloss:0.00244
[12:27:17] [15] eval-logloss:0.00177 train-logloss:0.00194
[12:27:17] [16] eval-logloss:0.00157 train-logloss:0.00161
[12:27:17] [17] eval-logloss:0.00135 train-logloss:0.00142
[12:27:17] [18] eval-logloss:0.00123 train-logloss:0.00125
[12:27:17] [19] eval-logloss:0.00107 train-logloss:0.00107
[12:27:17] Finished training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on the new option. Is there any chance to move nccl comm under the umbrella of rabit instead of making an abstraction over rabit? For early development it's fine, but I think for the longer term we would like to have one communicator that can be used by all xgboost algorithms.
@RAMitchell would be great if you can take a look into this. |
I will take a good look tomorrow. |
@trivialfis agreed that we need to do more refactoring on rabit. At a minimum we should be able to build with both the default engine and the federated engine enabled (perhaps also the MPI engine) so that we can include them in the release, and a user can pick which one to use at runtime. This is something I'm planning to work on next. As for NCCL, one issue is that the send and receive buffers are on device, so the api contract is different from that of rabit. We can't really make NCCL a rabit engine without violating the Liskov Substitution Principle. We might be able to make it work with some adapters, but we'd have to change all the calling code, so it's not a trivial change. I think the current PR is a necessary stepping stone towards that. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice to see the distributed version work without nccl!
What is the purpose of CRTP here? Is it to share the code which monitors the number of bytes/calls?
We perhaps need to extend rabits API to have knowledge of memory buffers on alternative devices.
Deeper refactor/attention to rabit is very welcome.
If we are to have different engines at runtime does this imply runtime polymorphism in the rabit interface or some other method? I guess rabit will now have a state concerning which engine is active. This can create some hazards that didn't exist before, for example, accidentally changing the engine during training.
What is your opinion on rabits fault recovery mechanisms? From the project readme:
Reliable: rabit dig burrows to avoid disasters
Rabit programs can recover the model and results using synchronous function calls.
Rabit programs can set rabit_boostrap_cache=1 to support allreduce/broadcast operations before loadcheckpoint rabit::Init(); -> rabit::AllReduce(); -> rabit::loadCheckpoint(); -> for () { rabit::AllReduce(); rabit::Checkpoint();} -> rabit::Shutdown();
Do these fault tolerance mechanisms actually work in the existing rabit engine? Do we intend to support them in future engines?
The fault tolerance is removed. The description is outdated. |
@RAMitchell CRTP provides a static interface for both the nccl and rabit allreducers, and there is a bit of code reuse. I'm actually on the fence about this, we can probably just have two separate implementations if you think that's simpler. If we don't care about fault tolerance, we can probably just rewrite rabit in, say, a few hundred lines of code. As far as I can tell, we only really need But it is a pretty major refactoring. I'm hoping we can tackle it after ticking off the box "federated learning supports GPU training". :) |
I actually don't mind about crtp I'm just curious. A rabit rewrite could be very nice. Happy to go ahead with this pr when you are ready. |
Yup, that sounds good. I looked into some implementations recently. If we were to rewrite rabit we must make sure the new one is better than the current one (more robust and efficient). We don't have to rush into this. |
Can we get this PR merged? Thanks! |
Currently distributed GPU training is only supported with NCCL enabled. This PR provides a fallback to Rabit when NCCL is not available (for example, on Windows). A side effect is to allow federated learning using the
gpu_hist
method.Part of #7778