New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp the rabit implementation. #10112
base: master
Are you sure you want to change the base?
Conversation
@rongou Is using |
Yes, in a federated setting, a participant may not want to expose the host name to the rest of the group. |
17d179b
to
834e598
Compare
3f702cf
to
0380974
Compare
17fa5a8
to
85ac8f8
Compare
Tracking apache/arrow#41058 Need to remove the war in CI once a new snappy is published. |
3001987
to
1421115
Compare
6ac4fe2
to
633da12
Compare
633da12
to
3bf7a7f
Compare
852a9f8
to
45c8fa8
Compare
@wbo4958 Please help look into the changes to JVM packages when you are available. |
XGBoostJNI.checkCall(XGBoostJNI.TrackerRun(this.handle)); | ||
this.tracker_daemon = new Thread(() -> { | ||
try { | ||
XGBoostJNI.checkCall(XGBoostJNI.TrackerWaitFor(this.handle, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask why tracker_daemon is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it as a handle for the implementation of uncaughtException
, do you have suggestions for handling it differently?
931010e
to
ee8e203
Compare
@rongou The PR is ready for an initial review. I can't extract more small PRs since the global communicator is swapped. Need to investigate the JVM dependency and the flaky error. |
get host name. send port. all gather. assert. op. begin work on bootstrap. utilities. work on bootstrap. move listener. catch. comm. async send. block. block. Start work on async. batch poll. tests. move. Start working on tracker. better tests. work on tracker. bind. work on accepting workers. complete allgather. Move. Send with JSON. work on shutdown. msg. compare task. rename. Move. cleanup. Start work on broadcast. Work comm. Cleanup bootstrap. hide. Move to bootstrap. non blocking. cleanup. any op. shift. cleanup. Cleanup. per-thread. checks. start working on nccl. backend. Get the prototype compile. log. test print. timeout on connection. get nccl allreduce basic. look into federated. proto. scatter reduce. allreduce prototype. Work on tests. cleanup. Initialization. Init. Work on Python. get args. Start working on allgatherv. convert some allreduce. remove some old use. remove cpu impl. work on gpu. play with dlopen. convert. convert. convert. placeholder. backend. work on federated. remove. Move. Federated tracker. Move. move into comm. GPU variant. not just nccl. fix. fix. Convert. bitwise. stream. copying allgather. replace. Remove. remove. Remove device. Remove rabit. Remove rabit. cmake. tests. use gmock. Move. Split. init. Extract. compiler. test timeout. exc. comments. Tests for federated. Remove. remove. Split up. refactor tests. format. extract magic number. Extract more commands. refactor. Remove. Reduce dependency on c api. remove old code. throw. coll error. indirect. look into dask module. parameters. command. probing. listen for error. debug. host. cleanup. dask. loop. working basic. header. guard. test. type. socket. cleanup & notes. use a state machine. work on tests. header. test channel. cleanup. cleanup broadcast. unneeded changes. allgather string. Fixes. cleanup rebase. fixes after rebase. split up nccl comm. Move data copying. allgatherv test. Extract. tests. test allreduce. remove the use of ctx. tests. rebase. work on fed. work on allgatherv. name. lint. Split. split. remove gmock. move. CPU. CUDA. compile. Cleanup. header. Work on tests. checks. fixes. tests. work on CUDA test. comm. Share the implementation. tests. cleanup. cleanup. cleanup cleanup. set device. cleanup. cleanup. more. cleanup. Get it work. wait. revert dask changes. time. remove reference to encoder. extract. extract. split up the training function. Fix. deterministic. Fix. debug. Fixes. remove. cleanup. fix. Move worker env. cleanup. cleanup. wait. cleanup. extract error handling. get abort to work as well. Move. policy. cleanups. cleanup. Split up. doc. Cleanup ctor. tests. tests. tests. configuration. tests. task id. start working on metric tests. Remove. type. agg. fix seq. tests. start working on cuda test. type. fixes. tests. Use device ord. Remove auc. remove elementwise. remove multi-class cleanup aft. cleanup ranking. remove old tests. headers. Move. move. single gpu tests. Cleanup C API. unknown. C API. Small cleanup. cleanup. Fix. cleanup. work on async queue. work on sync. Use blocking op. result. Fuzzing. result. Remove coll error. Move. cleanup. cleanup. cleanup. cleanup. cleanup. cleanup. cleanup. Fix removal. test lint. remove. cleanup. cleanup. cleanup. invoke result. note. Fix rebase. Fix rebase. fix & cleanup. Fix. remove coll error for now. cleanup. replace. replace. replace. test basic. cleanup, fix. deduced size. cleanup. Convert to new routines. Fixes. Fix. Add test. use vector. cleanup. safe coll. Fix. Fix. Fix. Fix. cleanup. cleanup. Don't throw. Fixes. v6 timeout. Cleanups. Remove error handling for now. Timeout. syc. mac. cli. remove. types. federated. build. build. sortby. build. windows, macos. macos. federated. macos. lint. skip finalize. Forbid empty data. Shutdown before dtor. small allreduce. rounddown. empty input. remove warning. remove get host IP. take down the jvm package for now. stop early. annotation. lint. debug check. windows. np.bool. Work on shutdown. test blocking. remove dask error. Detach. comments. blocking. display timeout. debug github error. checks. Switch the order. revert debug log. fix tests. delete. reverse. don't block. improved error. clear; shutdown the tracker. release lock early. comments. macos. windows. Move. const. Unify ctor. remove exceptions. lint, comment. freeze pyarrow. windows. r package. Fix CI. Start looking into jvm chpk jni. c test. remove extra argument. tracker. interrupt. cleanup. Fix spark profiling. Compile. Start convert the scala package. Log init. cleanup test. communicator. alive. tests log. Revert "log." This reverts commit 3bc6d82. Shutdown when exit. remove tracker return code. windows build. shutdown only if not closed. lint. protect the listener. concat. Debug log. detect EOF. Revert "Debug log." This reverts commit a3e0bd9. Cleanup. Fixes. lint. don't omit frame pointer. Refactor tests. Fix minimum build. Fix distributed tests on single GPU. cleanup & win build. MacOS compilation. typo. macos, jvm. Windows socket. Ignore POLLHUP Handle shutdown. unix socket fix MSVC Sock error Restore the shutdown signal. states windows sock enable win tests lint. update. Skip tests. skip only for gpu. cleanup. cleanup. remove error code for now. Documents. lower case. rename. Fix. Fix.
aef7630
to
b4e97f9
Compare
@rongou Hi, the PR is ready for review. It's quite large, if you need to have an online review please let me know. |
motivation
With the increasing complexity of the networking module due to support for vertical and horizontal federated learning and GPU-based training for scaling, the existing rabit module is no longer sufficient. We have multiple dimensions of features to support:
Among the above features, GPU acceleration and federated learning require loading optional external libraries. Lastly, we are trying to support resilience.
features
This PR replaces the original RABIT implementation with a new one, which has already been partially merged into XGBoost. The new one features:
todos:
working in progress
Retry is still in progress. This is to provide essential support for handling exception (e.g., a network error or an OOM). Segfault handling has to be done with additional cooperation with the distributed framework and is out of scope for this work.
note for review
n_workers
parameter.