New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] r1.15 rccl upstream patch #34532
[ROCm] r1.15 rccl upstream patch #34532
Conversation
This patch is a backport of current RCCL support in master for the r1.15 branch.
Thank you @mihaimaruseac for approving. I was concerned with the number of failing checks even though the build was successful for ROCm and non-ROCm paths on our test systems. I tried reproducing the sanity test failure locally but the indicated failures were unrelated to this PR. I hope you were able to otherwise test this satisfactorily. |
We have some issues with the tests on the old release branches. We're working on fixes. |
Hi @mihaimaruseac , could you help update the ETA to have this PR merged to r1.15 release branch? |
I think first week of December? |
@mihaimaruseac gentle ping, thanks! |
@mihaimaruseac gentle ping, thanks. |
Apologies. I didn't yet get a chance to investigate why the CI fails on the release branches. It seems to be picking some configuration from master but didn't yet have time to dig more into this. |
@mihaimaruseac gentle ping, thanks. This is also blocking the subsequent PR #34769. |
Apologies for the delay. I tried now to get #33981 merged so that the builds would run against the I'll try over the holidays and bring @gunan in too and see what we can do. |
Actually, remote builds do not care about which branch we are running from. |
@mihaimaruseac @gunan, gentle ping anything we can / need to do on our end to help out? |
#33981 was merged less than an hour ago. This means we can attempt running presubmits on the branch. Probably some will fail at this moment as VMs changed but I will run presubmits again later in the night, when the VMs for this branch can be reused. |
Windows Bazel GPU and Ubuntu Sanity failures are expected at this time, trying them again later. |
@jeffdaily @deven-amd can you please check the Linux GPU build? I'm still going to run it on the 1.15 VMs around 8 hours from now, just in case the failure is not related to the PR but to the VMs. Once this is merged, I think next is #34769 and #35230. Both of them are gated on this one, right? |
According to https://source.cloud.google.com/results/invocations/efa3a582-4b1e-472a-af44-1bad8131553e/targets/%2F%2Ftensorflow%2Fcore%2Fkernels:collective_nccl_test_gpu/log (the only failure on the proper VMs), we have some failures introduced by this PR (other PRs on the branch are completely green)
|
@mihaimaruseac thank you for bringing this test to my attention. I am trying to reproduce now on our platform. |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
0f9dc72
to
c954406
Compare
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
@mihaimaruseac I was able to reproduce the failing test more than once. After this change, c954406 , I was no longer able to reproduce. Let's watch CI now. Thanks. |
Thank you |
This patch is a backport of current RCCL support in master for the r1.15 branch. RCCL support was not complete in the r1.15 branch, and since this is the last V1 release branch, it is important to have this feature here.
Further, without this PR, the r1.15 branch will not build for the latest ROCm release due to missing clang 10-based header files. See #31849 for the same change to master.