Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp the rabit implementation. #10112

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Mar 11, 2024

motivation

With the increasing complexity of the networking module due to support for vertical and horizontal federated learning and GPU-based training for scaling, the existing rabit module is no longer sufficient. We have multiple dimensions of features to support:

  • data split.
  • federated learning.
  • GPU acceleration.

Among the above features, GPU acceleration and federated learning require loading optional external libraries. Lastly, we are trying to support resilience.

features

This PR replaces the original RABIT implementation with a new one, which has already been partially merged into XGBoost. The new one features:

  • Federated learning for both CPU and GPU.
  • NCCL.
  • More data types.
  • A unified interface for all the underlying implementations.
  • Improved timeout handling for both tracker and workers.
  • Exhausted tests with metrics (fixed a couple of bugs along the way).
  • A reusable tracker for Python and JVM packages.

todos:

  • JVM
  • Standardize the naming of worker parameters and tracker methods.
  • Worker sortby.

working in progress

Retry is still in progress. This is to provide essential support for handling exception (e.g., a network error or an OOM). Segfault handling has to be done with additional cooperation with the distributed framework and is out of scope for this work.

note for review

  • Breaking changes are made to the tracker in Python and JVM interfaces.
  • communicator parameters are reworked. For a list of parameters, see the document in the C header.
  • Federated tracker requires an additional n_workers parameter.

@trivialfis
Copy link
Member Author

@rongou Is using rank{} as process name instead of host name a deliberate choice for federated learning?

@rongou
Copy link
Contributor

rongou commented Mar 12, 2024

@rongou Is using rank{} as process name instead of host name a deliberate choice for federated learning?

Yes, in a federated setting, a participant may not want to expose the host name to the rest of the group.

@trivialfis
Copy link
Member Author

Tracking apache/arrow#41058 Need to remove the war in CI once a new snappy is published.

@trivialfis
Copy link
Member Author

@wbo4958 Please help look into the changes to JVM packages when you are available.

XGBoostJNI.checkCall(XGBoostJNI.TrackerRun(this.handle));
this.tracker_daemon = new Thread(() -> {
try {
XGBoostJNI.checkCall(XGBoostJNI.TrackerWaitFor(this.handle, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I ask why tracker_daemon is needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it as a handle for the implementation of uncaughtException, do you have suggestions for handling it differently?

@trivialfis trivialfis marked this pull request as ready for review April 29, 2024 03:25
@trivialfis trivialfis changed the title [WIP] Revamp the rabit implementation. Revamp the rabit implementation. Apr 29, 2024
@trivialfis
Copy link
Member Author

trivialfis commented Apr 29, 2024

@rongou The PR is ready for an initial review. I can't extract more small PRs since the global communicator is swapped.

Need to investigate the JVM dependency and the flaky error.

get host name.

send port.

all gather.

assert.

op.

begin work on bootstrap.

utilities.

work on bootstrap.

move listener.

catch.

comm.

async send.

block.

block.

Start work on async.

batch poll.

tests.

move.

Start working on tracker.

better tests.

work on tracker.

bind.

work on accepting workers.

complete allgather.

Move.

Send with JSON.

work on shutdown.

msg.

compare task.

rename.

Move.

cleanup.

Start work on broadcast.

Work comm.

Cleanup bootstrap.

hide.

Move to bootstrap.

non blocking.

cleanup.

any op.

shift.

cleanup.

Cleanup.

per-thread.

checks.

start working on nccl.

backend.

Get the prototype compile.

log.

test print.

timeout on connection.

get nccl allreduce basic.

look into federated.

proto.

scatter reduce.

allreduce prototype.

Work on tests.

cleanup.

Initialization.

Init.

Work on Python.

get args.

Start working on allgatherv.

convert some allreduce.

remove some old use.

remove cpu impl.

work on gpu.

play with dlopen.

convert.

convert.

convert.

placeholder.

backend.

work on federated.

remove.

Move.

Federated tracker.

Move.

move into comm.

GPU variant.

not just nccl.

fix.

fix.

Convert.

bitwise.

stream.

copying allgather.

replace.

Remove.

remove.

Remove device.

Remove rabit.

Remove rabit.

cmake.

tests.

use gmock.

Move.

Split.

init.

Extract.

compiler.

test timeout.

exc.

comments.

Tests for federated.

Remove.

remove.

Split up.

refactor tests.

format.

extract magic number.

Extract more commands.

refactor.

Remove.

Reduce dependency on c api.

remove old code.

throw.

coll error.

indirect.

look into dask module.

parameters.

command.

probing.

listen for error.

debug.

host.

cleanup.

dask.

loop.

working basic.

header.

guard.

test.

type.

socket.

cleanup & notes.

use a state machine.

work on tests.

header.

test channel.

cleanup.

cleanup broadcast.

unneeded changes.

allgather string.

Fixes.

cleanup rebase.

fixes after rebase.

split up nccl comm.

Move data copying.

allgatherv test.

Extract.

tests.

test allreduce.

remove the use of ctx.

tests.

rebase.

work on fed.

work on allgatherv.

name.

lint.

Split.

split.

remove gmock.

move.

CPU.

CUDA.

compile.

Cleanup.

header.

Work on tests.

checks.

fixes.

tests.

work on CUDA test.

comm.

Share the implementation.

tests.

cleanup.

cleanup.

cleanup

cleanup.

set device.

cleanup.

cleanup.

more.

cleanup.

Get it work.

wait.

revert dask changes.

time.

remove reference to encoder.

extract.

extract.

split up the training function.

Fix.

deterministic.

Fix.

debug.

Fixes.

remove.

cleanup.

fix.

Move worker env.

cleanup.

cleanup.

wait.

cleanup.

extract error handling.

get abort to work as well.

Move.

policy.

cleanups.

cleanup.

Split up.

doc.

Cleanup ctor.

tests.

tests.

tests.

configuration.

tests.

task id.

start working on metric tests.

Remove.

type.

agg.

fix seq.

tests.

start working on cuda test.

type.

fixes.

tests.

Use device ord.

Remove auc.

remove elementwise.

remove multi-class

cleanup aft.

cleanup ranking.

remove old tests.

headers.

Move.

move.

single gpu tests.

Cleanup C API.

unknown.

C API.

Small cleanup.

cleanup.

Fix.

cleanup.

work on async queue.

work on sync.

Use blocking op.

result.

Fuzzing.

result.

Remove coll error.

Move.

cleanup.

cleanup.

cleanup.

cleanup.

cleanup.

cleanup.

cleanup.

Fix removal.

test

lint.

remove.

cleanup.

cleanup.

cleanup.

invoke result.

note.

Fix rebase.

Fix rebase.

fix & cleanup.

Fix.

remove coll error for now.

cleanup.

replace.

replace.

replace.

test basic.

cleanup, fix.

deduced size.

cleanup.

Convert to new routines.

Fixes.

Fix.

Add test.

use vector.

cleanup.

safe coll.

Fix.

Fix.

Fix.

Fix.

cleanup.

cleanup.

Don't throw.

Fixes.

v6

timeout.

Cleanups.

Remove error handling for now.

Timeout.

syc.

mac.

cli.

remove.

types.

federated.

build.

build.

sortby.

build.

windows, macos.

macos.

federated.

macos.

lint.

skip finalize.

Forbid empty data.

Shutdown before dtor.

small allreduce.

rounddown.

empty input.

remove warning.

remove get host IP.

take down the jvm package for now.

stop early.

annotation.

lint.

debug check.

windows.

np.bool.

Work on shutdown.

test blocking.

remove dask error.

Detach.

comments.

blocking.

display timeout.

debug github error.

checks.

Switch the order.

revert debug log.

fix tests.

delete.

reverse.

don't block.

improved error.

clear;

shutdown the tracker.

release lock early.

comments.

macos.

windows.

Move.

const.

Unify ctor.

remove exceptions.

lint, comment.

freeze pyarrow.

windows.

r package.

Fix CI.

Start looking into jvm

chpk

jni.

c test.

remove extra argument.

tracker.

interrupt.

cleanup.

Fix spark profiling.

Compile.

Start convert the scala package.

Log init.

cleanup test.

communicator.

alive.

tests

log.

Revert "log."

This reverts commit 3bc6d82.

Shutdown when exit.

remove tracker return code.

windows build.

shutdown only if not closed.

lint.

protect the listener.

concat.

Debug log.

detect EOF.

Revert "Debug log."

This reverts commit a3e0bd9.

Cleanup.

Fixes.

lint.

don't omit frame pointer.

Refactor tests.

Fix minimum build.

Fix distributed tests on single GPU.

cleanup & win build.

MacOS compilation.

typo.

macos, jvm.

Windows socket.

Ignore POLLHUP

Handle shutdown.

unix socket

fix

MSVC

Sock error

Restore the shutdown signal.

states

windows sock

enable win tests

lint.

update.

Skip tests.

skip only for gpu.

cleanup.

cleanup.

remove error code for now.

Documents.

lower case.

rename.

Fix.

Fix.
@trivialfis
Copy link
Member Author

@rongou Hi, the PR is ready for review. It's quite large, if you need to have an online review please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants