Initial support for federated learning #7831

rongou · 2022-04-21T22:29:29Z

Federated learning plugin for xgboost:

A gRPC server to aggregate MPI-style requests (allgather, allreduce, broadcast) from federated workers.
A Rabit engine for the federated environment.
Integration test to simulate federated learning.

Additional followups are need to address GPU support, better security and privacy, etc.

Part of #7778

rongou · 2022-04-21T22:29:54Z

@trivialfis @RAMitchell @hcho3

rongou · 2022-04-21T22:31:16Z

Here is the output from running the integration test:

(venv) rou@rou:~/src/xgboost/tests/distributed$ ./runtests-federated.sh 
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 0
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 1
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] Connecting to federated server localhost:9091, world size 3, rank 2
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:05] XGBoost federated mode detected, not splitting data among workers
[15:30:06] [0]	eval-logloss:0.22669	train-logloss:0.23338

[15:30:07] [1]	eval-logloss:0.13787	train-logloss:0.13666

[15:30:07] [2]	eval-logloss:0.08046	train-logloss:0.08253

[15:30:08] [3]	eval-logloss:0.05833	train-logloss:0.05647

[15:30:08] [4]	eval-logloss:0.03829	train-logloss:0.04151

[15:30:09] [5]	eval-logloss:0.02663	train-logloss:0.02961

[15:30:09] [6]	eval-logloss:0.01388	train-logloss:0.01919

[15:30:10] [7]	eval-logloss:0.01020	train-logloss:0.01332

[15:30:10] [8]	eval-logloss:0.00848	train-logloss:0.01113

[15:30:11] [9]	eval-logloss:0.00692	train-logloss:0.00663

[15:30:11] [10]	eval-logloss:0.00544	train-logloss:0.00504

[15:30:12] [11]	eval-logloss:0.00445	train-logloss:0.00420

[15:30:12] [12]	eval-logloss:0.00336	train-logloss:0.00356

[15:30:13] [13]	eval-logloss:0.00277	train-logloss:0.00281

[15:30:13] [14]	eval-logloss:0.00252	train-logloss:0.00244

[15:30:14] [15]	eval-logloss:0.00177	train-logloss:0.00194

[15:30:15] [16]	eval-logloss:0.00157	train-logloss:0.00161

[15:30:15] [17]	eval-logloss:0.00135	train-logloss:0.00142

[15:30:16] [18]	eval-logloss:0.00123	train-logloss:0.00125

[15:30:16] [19]	eval-logloss:0.00107	train-logloss:0.00107

[15:30:16] Finished training

RAMitchell

Implementing this as a plug-in works well for now. We don't want dependencies on protobuf etc. in main xgboost.

I assume gRPC a is a placeholder here and we want something encrypted.

I'm hoping this work also eventually leads to re-factoring, improvements, and a better understanding of the underlying rabit code.

Looks good as a first attempt. Are you wanting us to merge this and do the next steps in stages, or is this just for feedback? I would probably not want to merge it because I don't want to suggest to users that there is a functional federated learning plug-in yet.

RAMitchell · 2022-04-22T10:46:24Z

plugin/federated/federated_server.cc

+
+  void Accumulate(std::string& buffer, std::string const& input, DataType data_type,
+                  ReduceOperation reduce_operation) const {
+    switch (data_type) {


Note to self: If we update xgboost to c++17 we can do this type of switch with variant/visit in 3 lines.

RAMitchell · 2022-04-22T10:49:38Z

plugin/federated/federated_server.cc

+  }
+
+  int const world_size_;
+  AllgatherHandler allgather_handler_;


These handlers don't really have members, can they just be functions?

Changed them to functors.

RAMitchell · 2022-04-22T10:52:38Z

src/c_api/c_api.cc

@@ -198,11 +198,15 @@ XGB_DLL int XGDMatrixCreateFromFile(const char *fname,
                                    DMatrixHandle *out) {
  API_BEGIN();
  bool load_row_split = false;
+#if defined(XGBOOST_USE_FEDERATED)


So each worker needs to call the c_api with manually specified file locations?

Yeah in a federated environment, presumably all the local data on each federated worker is used for training, so it doesn't make sense to split further.

trivialfis

Thank you for the exciting feature!

Out of curiosity, is it preferred to launch a CLI application instead of exposing a C function (along with Python API) to let users launch it from somewhere within their program?

rongou

@RAMitchell I added SSL/TLS encryption (server and clients are mutually authenticated). I'm hoping we can merge this as a bare-bones implementation of federated learning, and improve on it with followup PRs.

rongou · 2022-04-25T21:34:22Z

plugin/federated/federated_server.cc

+  }
+
+  int const world_size_;
+  AllgatherHandler allgather_handler_;


Changed them to functors.

rongou · 2022-04-25T21:36:10Z

src/c_api/c_api.cc

@@ -198,11 +198,15 @@ XGB_DLL int XGDMatrixCreateFromFile(const char *fname,
                                    DMatrixHandle *out) {
  API_BEGIN();
  bool load_row_split = false;
+#if defined(XGBOOST_USE_FEDERATED)


Yeah in a federated environment, presumably all the local data on each federated worker is used for training, so it doesn't make sense to split further.

rongou · 2022-04-25T21:44:29Z

@trivialfis I did the CLI because it was easier. :) We can certainly add a C API/Python wrapper if needed. Perhaps as a followup?

trivialfis

@trivialfis I did the CLI because it was easier. :) We can certainly add a C API/Python wrapper if needed. Perhaps as a followup?

I think it would be better to avoid adding an executable by replacing the main function with a C API and integrating it into libxgboost.so. But it's fine if you want to make it as a followup.

Could you please enable the tests on Github action? Rest looks fine to me as a bare-bone implementation.

RAMitchell · 2022-04-26T09:31:21Z

I'm okay to merge the prototype. Any ideas on how to solve the quantile issue?

rongou · 2022-04-29T17:57:22Z

@trivialfis I added some unit tests along with the integration test. But the federated learning plugin is disabled by default so they are not being run by the CI. Need to send a followup PR to tweak the CI pipelines to add them. Also I added the C API and the Python wrapper as you suggested.

@RAMitchell what do you mean by the quantile issue? For now the quantiles are still constructed globally using allreduce. We need to do some followup work to enhance the privacy.

plugin/federated/CMakeLists.txt

src/c_api/c_api.cc

python-package/xgboost/__init__.py

python-package/xgboost/federated.py

RAMitchell · 2022-05-03T12:52:02Z

@RAMitchell what do you mean by the quantile issue? For now the quantiles are still constructed globally using allreduce. We need to do some followup work to enhance the privacy.

I think we need a plan on how to solve distributed quantiles while preserving privacy. Its hard for me to see how this can be possible with any reasonable guarantees. For example in in small datasets or datasets with few unique values, the quantiles could capture all of the data, so even sharing the final quantiles among workers would represent a significant leakage.

rongou · 2022-05-03T17:34:24Z

As I mentioned in the RFC, this first iteration is really about putting the basic framework in place so that federated learning can be done in a somewhat high trust, "enterprise" environment. We can then incrementally add more security and privacy features to widen the use case.

For the quantile leakage issue, one possibility is to have each party compute a histogram of different bin sizes depending on the size of the local data, then fuse the histograms at the server, something like https://arxiv.org/abs/2012.06670. This doesn't rely on homomorphic encryption or differential privacy, but of course there are other approaches we can also consider.

rongou added 15 commits April 4, 2022 11:34

add federated plugin

48fc024

add federation server and test client

5e1ff7e

minor cleanup

cd52ceb

implemented allreduce/allgather/broadcast

b63395a

refactor federated server

66bf977

more refactoring of the federated server

549fbdb

support custom reduction

4723c00

remove unused includes

457f690

Merge remote-tracking branch 'upstream/master' into federated

095f717

fix finalize

66e9425

no splitting data in federated mode

4dc81df

update readme

e40ca5d

support more than 10 workers

c99883c

Merge remote-tracking branch 'upstream/master' into federated

80f8593

add some comments and copyright headers

60ef12e

RAMitchell reviewed Apr 22, 2022

View reviewed changes

trivialfis reviewed Apr 22, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into federated

5ce184b

trivialfis added this to 2.0 In Progress in 2.0 Roadmap via automation Apr 22, 2022

rongou added 3 commits April 23, 2022 01:07

add mutual ssl/tls authentication

dca5902

simplify cert generation

05fce58

Merge remote-tracking branch 'upstream/master' into federated

ac8be18

rongou commented Apr 25, 2022

View reviewed changes

change to functors

6c16319

trivialfis reviewed Apr 26, 2022

View reviewed changes

add c api

851ccde

rongou added 5 commits April 28, 2022 14:09

add federated server unit tests

b7ba8ae

exclude federated tests when plugin not enabled

f2164c6

Merge remote-tracking branch 'upstream/master' into federated

9819f85

revert accidiental change

bb0896e

Merge remote-tracking branch 'upstream/master' into federated

7ea426c

trivialfis requested changes Apr 30, 2022

View reviewed changes

plugin/federated/CMakeLists.txt Outdated Show resolved Hide resolved

src/c_api/c_api.cc Outdated Show resolved Hide resolved

trivialfis reviewed Apr 30, 2022

View reviewed changes

python-package/xgboost/__init__.py Outdated Show resolved Hide resolved

python-package/xgboost/federated.py Show resolved Hide resolved

rongou added 2 commits May 2, 2022 10:09

Merge remote-tracking branch 'upstream/master' into federated

5bf7e35

address review comments

ba52021

trivialfis approved these changes May 3, 2022

View reviewed changes

rongou requested a review from RAMitchell May 4, 2022 21:25

RAMitchell approved these changes May 5, 2022

View reviewed changes

trivialfis merged commit 14ef38b into dmlc:master May 5, 2022

2.0 Roadmap automation moved this from 2.0 In Progress to 2.0 Done May 5, 2022

trivialfis removed this from 2.0 Done in 2.0 Roadmap Sep 28, 2022

trivialfis added this to In progress in 1.7 Roadmap via automation Sep 28, 2022

trivialfis moved this from In progress to Done in 1.7 Roadmap Sep 28, 2022

rongou deleted the federated branch November 18, 2022 19:01

exalate-issue-sync bot mentioned this pull request May 11, 2023

Investigate approaches of securing XGBoost connections in multinode environment h2oai/h2o-3#6927

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for federated learning #7831

Initial support for federated learning #7831

rongou commented Apr 21, 2022

rongou commented Apr 21, 2022

rongou commented Apr 21, 2022

RAMitchell left a comment

RAMitchell Apr 22, 2022

RAMitchell Apr 22, 2022

rongou Apr 25, 2022

RAMitchell Apr 22, 2022

rongou Apr 25, 2022

trivialfis left a comment

rongou left a comment

rongou Apr 25, 2022

rongou Apr 25, 2022

rongou commented Apr 25, 2022

trivialfis left a comment •

edited

RAMitchell commented Apr 26, 2022

rongou commented Apr 29, 2022

RAMitchell commented May 3, 2022

rongou commented May 3, 2022

Initial support for federated learning #7831

Initial support for federated learning #7831

Conversation

rongou commented Apr 21, 2022

rongou commented Apr 21, 2022

rongou commented Apr 21, 2022

RAMitchell left a comment

Choose a reason for hiding this comment

RAMitchell Apr 22, 2022

Choose a reason for hiding this comment

RAMitchell Apr 22, 2022

Choose a reason for hiding this comment

rongou Apr 25, 2022

Choose a reason for hiding this comment

RAMitchell Apr 22, 2022

Choose a reason for hiding this comment

rongou Apr 25, 2022

Choose a reason for hiding this comment

trivialfis left a comment

Choose a reason for hiding this comment

rongou left a comment

Choose a reason for hiding this comment

rongou Apr 25, 2022

Choose a reason for hiding this comment

rongou Apr 25, 2022

Choose a reason for hiding this comment

rongou commented Apr 25, 2022

trivialfis left a comment • edited

Choose a reason for hiding this comment

RAMitchell commented Apr 26, 2022

rongou commented Apr 29, 2022

RAMitchell commented May 3, 2022

rongou commented May 3, 2022

trivialfis left a comment •

edited