Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Test federated learning plugin in the CI #8325

Merged
merged 27 commits into from Oct 12, 2022
Merged

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Oct 9, 2022

Close #8301
Close #8296

Notes:

  • We test federated learning only in the BuildKite CI pipelines. We don't test federated learning in GitHub Actions, as it only provide 2-core CPUs, so it will take a long time to build gRPC from the source. On the other hand, we have access to a beefy machine (16-core CPUs) on BuildKite.
  • gRPC is installed into Docker containers. We don't want to build gRPC from the source over and over again, as it consists of 2000 files (!). Build once and cache it.
  • We build both CPU and GPU binaries with federated learning support.
  • We do not enable federated learning for Windows platform or ARM64 arch.

CMakeLists.txt Outdated Show resolved Hide resolved
tests/cpp/plugin/test_federated_adapter.cu Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
@hcho3

This comment was marked as outdated.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 9, 2022

Nevermind, FederatedCommunicatorTest.Allreduce is failing with

terminate called after throwing an instance of 'std::runtime_error'
what():  Allreduce RPC failed

I have no idea why this is happening.

@trivialfis
Copy link
Member

@rongou Hi, is this the last PR for federated learning planned for 1.7?

@trivialfis trivialfis added this to In progress in 1.7 Roadmap via automation Oct 10, 2022
@trivialfis
Copy link
Member

trivialfis commented Oct 10, 2022

@hcho3 I couldn't reproduce the error with a bare metal machine, maybe it's a docker-specific constraint on the network?

Some additional notes:

  • If we need to link protobuf statically, Protobuf_USE_STATIC_LIBS should be set. Also, it would be great to have readelf check there's no unexpected dynamic dependency. (ci_build/verify_link.sh)
  • We should change find_package(Protobuf CONFIG REQUIRED) to find_package(Protobuf REQUIRED). Since FindProtobuf is built into cmake, there's no need to find config file nor rollout our own find script.
  • We should replace the set target property call with xgboost_target_properties(federated_proto), which ensures the correct language version among other properties.

@rongou
Copy link
Contributor

rongou commented Oct 10, 2022

@trivialfis yes this should be the last PR for 1.7. #8316 is a nice to have.

As for find_package(Protobuf CONFIG REQUIRED), I think this is needed for the grpc plugin to work properly (see https://www.f-ax.de/dev/2020/11/08/grpc-plugin-cmake-support.html).

CMakeLists.txt Show resolved Hide resolved
CMakeLists.txt Show resolved Hide resolved
@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 11, 2022

I fixed #8325 (comment) by installing gRPC in a Conda environment. My guess is that I accidentally overwrote some headers in the system when I installed gRPC globally.

plugin/federated/CMakeLists.txt Outdated Show resolved Hide resolved
plugin/federated/README.md Outdated Show resolved Hide resolved
tests/ci_build/Dockerfile.cpu Outdated Show resolved Hide resolved
@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 11, 2022

@rongou Can you review the changes I made to gtest? I had to randomize the server port, because reusing the same port twice often resulted into a connection failure.

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 11, 2022

GTest is finally fixed 🎉

1.7 Roadmap automation moved this from In progress to Reviewer approved Oct 12, 2022
@hcho3 hcho3 merged commit 2faa744 into dmlc:master Oct 12, 2022
1.7 Roadmap automation moved this from Reviewer approved to Done Oct 12, 2022
@hcho3 hcho3 deleted the setup_grpc branch October 12, 2022 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants