[Breaking] Switch from rabit to the collective communicator #8257

rongou · 2022-09-21T23:31:31Z

This PR switches the Rabit api to Communicator, which gives us more flexibility in the collective communication implementation. It's a breaking change, but to most uses, it's a straightforward swap.

…cator

trivialfis

Thank you for the work on swapping out rabit. When it's ready, could you please split up the PR into smaller ones and start with internal C++ changes?

rongou · 2022-09-22T15:40:13Z

@trivialfis the issue is this is a breaking change. Once we change the c++ portion, we have to change the python and java apis too, otherwise the communicator would be uninitialized.

…cator

rongou · 2022-09-23T23:03:16Z

@wbo4958 any ideas about the JVM test failures? They seem to pass on my local desktop.

wbo4958 · 2022-09-26T06:30:17Z

I will check it today.

wbo4958 · 2022-09-26T08:12:45Z

I can repro it locally, which will make life easy.

wbo4958 · 2022-09-26T08:55:07Z

Seems the test rabit timeout fail handle has affected others.

After replacing it with below code, it worked for me.

  test("test rabit timeout fail handle") {
    val training = buildDataFrame(Classification.train)

    try {
      // mock rank 0 failure during 8th allreduce synchronization
      Communicator.mockList = Array("0,8,0,0").toList.asJava
      intercept[SparkException] {
        new XGBoostClassifier(Map(
          "eta" -> "0.1",
          "max_depth" -> "10",
          "verbosity" -> "1",
          "objective" -> "binary:logistic",
          "num_round" -> 5,
          "num_workers" -> numWorkers,
          "rabit_timeout" -> 0))
          .fit(training)
      }
    } finally {
      Communicator.mockList = Array.empty.toList.asJava
    }
  }

…cator

rongou · 2022-09-26T19:36:02Z

@wbo4958 Thanks for the help with the debugging. Resetting the mocklist seems to have fixed it.

@trivialfis I think this PR is ready for review. It touches a lot of files, but mostly it's a one-to-one swap from rabit to communicator. Thanks!

…cator

trivialfis

Thank you for the great work on swapping out rabit. Could you please layout the plan for future PRs for 1.7 release? I'm trying to estimate an ETA.

jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala

trivialfis · 2022-09-28T04:33:45Z

src/collective/communicator.cu

@@ -16,7 +17,7 @@ thread_local std::unique_ptr<DeviceCommunicator> Communicator::device_communicat

 void Communicator::Finalize() {
  communicator_->Shutdown();
-  communicator_.reset(nullptr);


Could you please share under which case the communicator can still be called after being shut down? I think a nullptr can be a guard for unintentional calls.

I think we have some tests that mix distributed training and local training.

trivialfis · 2022-09-28T04:35:12Z

src/collective/noop_communicator.h

+/**
+ * A no-op communicator, used for non-distributed training.
+ */
+class NoOpCommunicator : public Communicator {


I see that you have already added checks for non-distributed env in various communicator implementations. Is this still necessary?

As mentioned above, this is needed to replicate the existing rabit behavior.

trivialfis · 2022-09-28T04:37:34Z

src/metric/auc.cu

@@ -46,7 +45,7 @@ struct DeviceAUCCache {
  dh::device_vector<size_t> unique_idx;
  // p^T: transposed prediction matrix, used by MultiClassAUC
  dh::device_vector<float> predts_t;
-  std::unique_ptr<dh::AllReducer> reducer;
+  collective::DeviceCommunicator* communicator;


If this is now a global instance, we don't have to maintain a pointer to it.

We still need the device id to get the communicator, which in this class is only passed in during Init and not saved. We can probably clean this up once we have better device id management.

Ended up removing these pointers.

trivialfis · 2022-09-28T04:40:15Z

python-package/xgboost/dask.py

@@ -158,7 +158,7 @@ def _try_start_tracker(
        if isinstance(addrs[0], tuple):
            host_ip = addrs[0][0]
            port = addrs[0][1]
-            rabit_context = RabitTracker(
+            rabit_tracker = RabitTracker(


hmm .. so we still need rabit tracker for downstream projects. Could you please share how federated learning communicates the worker addresses across all workers?

Yes if we use the rabit communicator, which is the default, we still need to start a tracker. I imagine if we switch to something like gloo, then we can get rid of rabit completely.

For federated learning, since we have to start a gRPC server first, we just pass the server address (host:port) to each client.

trivialfis · 2022-09-28T04:43:59Z

src/collective/communicator.cc

@@ -12,14 +13,10 @@
 namespace xgboost {
 namespace collective {

-thread_local std::unique_ptr<Communicator> Communicator::communicator_{};
+thread_local std::unique_ptr<Communicator> Communicator::communicator_{new NoOpCommunicator()};


See other comments on the no-op, I have some concerns that we will use this accidentally, we have issues where XGBoost failed to establish a working communicator group but proceed with distributed training without explicit error. I haven't been able to tackle those issues due to the complicated network setup others use. But still, it's something that we should have in mind.

On this line https://github.com/dmlc/xgboost/blob/master/rabit/src/engine.cc#L73, rabit provides a default engine if it's not initialized. We have some code that depends on this behavior. The NoOpCommunicator is one way to replicate this behavior.

I think the user has to tell us if they are doing distributed training, either by entering the CommunicatorContext, or calling Init directly. Otherwise we'd have no way of knowing, and can't just return a nullptr in case they are not in the distributed mode.

…cator

rongou

After this PR, I think the only code change needed is #8279, and then we just need to weak the CI to build with federated learning enabled.

rongou · 2022-09-28T18:18:36Z

python-package/xgboost/dask.py

@@ -158,7 +158,7 @@ def _try_start_tracker(
        if isinstance(addrs[0], tuple):
            host_ip = addrs[0][0]
            port = addrs[0][1]
-            rabit_context = RabitTracker(
+            rabit_tracker = RabitTracker(


Yes if we use the rabit communicator, which is the default, we still need to start a tracker. I imagine if we switch to something like gloo, then we can get rid of rabit completely.

For federated learning, since we have to start a gRPC server first, we just pass the server address (host:port) to each client.

rongou · 2022-09-28T18:37:23Z

src/collective/communicator.cc

@@ -12,14 +13,10 @@
 namespace xgboost {
 namespace collective {

-thread_local std::unique_ptr<Communicator> Communicator::communicator_{};
+thread_local std::unique_ptr<Communicator> Communicator::communicator_{new NoOpCommunicator()};


On this line https://github.com/dmlc/xgboost/blob/master/rabit/src/engine.cc#L73, rabit provides a default engine if it's not initialized. We have some code that depends on this behavior. The NoOpCommunicator is one way to replicate this behavior.

I think the user has to tell us if they are doing distributed training, either by entering the CommunicatorContext, or calling Init directly. Otherwise we'd have no way of knowing, and can't just return a nullptr in case they are not in the distributed mode.

rongou · 2022-09-28T18:38:40Z

src/collective/communicator.cu

@@ -16,7 +17,7 @@ thread_local std::unique_ptr<DeviceCommunicator> Communicator::device_communicat

 void Communicator::Finalize() {
  communicator_->Shutdown();
-  communicator_.reset(nullptr);


I think we have some tests that mix distributed training and local training.

rongou · 2022-09-28T18:39:52Z

src/collective/noop_communicator.h

+/**
+ * A no-op communicator, used for non-distributed training.
+ */
+class NoOpCommunicator : public Communicator {


As mentioned above, this is needed to replicate the existing rabit behavior.

rongou · 2022-09-28T18:45:16Z

src/metric/auc.cu

@@ -46,7 +45,7 @@ struct DeviceAUCCache {
  dh::device_vector<size_t> unique_idx;
  // p^T: transposed prediction matrix, used by MultiClassAUC
  dh::device_vector<float> predts_t;
-  std::unique_ptr<dh::AllReducer> reducer;
+  collective::DeviceCommunicator* communicator;


We still need the device id to get the communicator, which in this class is only passed in during Init and not saved. We can probably clean this up once we have better device id management.

…cator

trivialfis · 2022-09-30T21:41:36Z

Could you please merge master again?

rongou · 2022-09-30T22:13:14Z

It's up to date.

…cator

rongou · 2022-10-03T16:58:20Z

@trivialfis can this be merged? Thanks!

hcho3 · 2022-10-03T17:41:29Z

@rongou We'll have to first merge #8298 to fix the CI.

…cator

rongou · 2022-10-05T21:35:27Z

@trivialfis @hcho3 can this be merged now? Thanks!

gnaggnoyil · 2022-11-03T15:52:58Z

I noticed that the 1.7.0 release note is still indicating that users "can choose between rabit and federated". Did someone forgot to change the wording in the release note, or just I'm misunderstanding something?

hcho3 · 2022-11-03T16:07:47Z

@gnaggnoyil You can choose between Rabit and Federated by passing {"xgboost_communicator": "[tracker type]"} to xgboost.collective.init().

As for xgboost.rabit, we got rid of the API in 1.7.0, but it broke some downstream projects. So we plan to release patch release 1.7.1 to restore xgboost.rabit.

…ning. dmlc/xgboost#7982 (comment) dmlc/xgboost#8257

…ning (#384) Addressing NCCL issue with binary classification for distributed training. dmlc/xgboost#7982 (comment) dmlc/xgboost#8257 Co-authored-by: Nikhil Raverkar <nraverka@amazon.com>

rongou added 3 commits September 21, 2022 16:19

Switch from rabit to the collective communicator

2d6e0dc

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

c72fbca

…cator

fix size_t specialization

07ce6f1

trivialfis reviewed Sep 22, 2022

View reviewed changes

rongou added 11 commits September 22, 2022 09:47

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

f888985

…cator

really fix size_t

9db6d96

try again

96a232e

add include

5853f83

more include

3482b53

fix lint errors

2ae73c0

remove rabit includes

2c594ca

fix pylint error

0f2cdd1

return dict from communicator context

217e3a5

fix communicator shutdown

aa731a0

fix dask test

8561853

rongou added 2 commits September 26, 2022 11:06

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

747093c

…cator

reset communicator mocklist

c89df47

rongou changed the title ~~[WIP] Switch from rabit to the collective communicator~~ [Breaking] Switch from rabit to the collective communicator Sep 26, 2022

rongou requested a review from trivialfis September 26, 2022 19:36

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

01e19ab

…cator

rongou mentioned this pull request Sep 27, 2022

Don't split input data in federated mode #8279

Merged

trivialfis reviewed Sep 28, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

8c00deb

…cator

rongou commented Sep 28, 2022

View reviewed changes

trivialfis added this to In progress in 1.7 Roadmap via automation Sep 29, 2022

trivialfis approved these changes Sep 29, 2022

View reviewed changes

1.7 Roadmap automation moved this from In progress to Reviewer approved Sep 29, 2022

rongou added 6 commits September 29, 2022 10:16

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

c814ac1

…cator

fix distributed tests

a9479d9

do not save device communicator

af56b1a

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

501d06f

…cator

fix jvm gpu tests

9be13f8

add python test for federated communicator

44a49fd

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

0aff6c0

…cator

hcho3 and others added 4 commits October 4, 2022 00:20

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

2e3dc6c

…cator

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

27a0f85

…cator

Merge remote-tracking branch 'upstream/master' into switch-to-communi…

1b7c938

…cator

Update gputreeshap submodule

43be5ac

hcho3 merged commit 668b8a0 into dmlc:master Oct 5, 2022

1.7 Roadmap automation moved this from Reviewer approved to Done Oct 5, 2022

rongou mentioned this pull request Oct 7, 2022

Switch to XGBoost Communicator API NVIDIA/NVFlare#996

Merged

Yard1 mentioned this pull request Oct 31, 2022

Compatibility for xgboost>=1.7.0, fix master CI ray-project/xgboost_ray#242

Merged

rongou deleted the switch-to-communicator branch November 18, 2022 19:01

NikhilRaverkar pushed a commit to NikhilRaverkar/sagemaker-xgboost-container that referenced this pull request Mar 17, 2023

Addressing NCCL issue with binary classification for distributed trai…

5982cee

…ning. dmlc/xgboost#7982 (comment) dmlc/xgboost#8257

NikhilRaverkar mentioned this pull request Mar 17, 2023

Addressing NCCL issue with binary classification for distributed training aws/sagemaker-xgboost-container#384

Merged

gcbeltramini mentioned this pull request Aug 24, 2023

Remove the scikit learn restriction and bump minimal python version to 3.8 nubank/fklearn#233

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Breaking] Switch from rabit to the collective communicator #8257

[Breaking] Switch from rabit to the collective communicator #8257

rongou commented Sep 21, 2022 •

edited

trivialfis left a comment

rongou commented Sep 22, 2022

rongou commented Sep 23, 2022

wbo4958 commented Sep 26, 2022

wbo4958 commented Sep 26, 2022

wbo4958 commented Sep 26, 2022

rongou commented Sep 26, 2022

trivialfis left a comment

trivialfis Sep 28, 2022

rongou Sep 28, 2022

trivialfis Sep 28, 2022

rongou Sep 28, 2022

trivialfis Sep 28, 2022

rongou Sep 28, 2022

rongou Sep 29, 2022

trivialfis Sep 28, 2022

rongou Sep 28, 2022

trivialfis Sep 28, 2022

rongou Sep 28, 2022

rongou left a comment

rongou Sep 28, 2022

rongou Sep 28, 2022

rongou Sep 28, 2022

rongou Sep 28, 2022

rongou Sep 28, 2022

trivialfis commented Sep 30, 2022

rongou commented Sep 30, 2022

rongou commented Oct 3, 2022

hcho3 commented Oct 3, 2022

rongou commented Oct 5, 2022

gnaggnoyil commented Nov 3, 2022

hcho3 commented Nov 3, 2022 •

edited

[Breaking] Switch from rabit to the collective communicator #8257

[Breaking] Switch from rabit to the collective communicator #8257

Conversation

rongou commented Sep 21, 2022 • edited

trivialfis left a comment

Choose a reason for hiding this comment

rongou commented Sep 22, 2022

rongou commented Sep 23, 2022

wbo4958 commented Sep 26, 2022

wbo4958 commented Sep 26, 2022

wbo4958 commented Sep 26, 2022

rongou commented Sep 26, 2022

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rongou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Sep 30, 2022

rongou commented Sep 30, 2022

rongou commented Oct 3, 2022

hcho3 commented Oct 3, 2022

rongou commented Oct 5, 2022

gnaggnoyil commented Nov 3, 2022

hcho3 commented Nov 3, 2022 • edited

rongou commented Sep 21, 2022 •

edited

hcho3 commented Nov 3, 2022 •

edited