Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ROCm] r1.15 rccl upstream patch #34532

Merged
merged 3 commits into from Jan 21, 2020

Conversation

jeffdaily
Copy link
Contributor

@jeffdaily jeffdaily commented Nov 22, 2019

This patch is a backport of current RCCL support in master for the r1.15 branch. RCCL support was not complete in the r1.15 branch, and since this is the last V1 release branch, it is important to have this feature here.

Further, without this PR, the r1.15 branch will not build for the latest ROCm release due to missing clang 10-based header files. See #31849 for the same change to master.

This patch is a backport of current RCCL support in master
for the r1.15 branch.
@tensorflow-bot tensorflow-bot bot added the size:L CL Change Size: Large label Nov 22, 2019
@jeffdaily jeffdaily changed the title [ROCm] R1.15 rccl upstream patch [ROCm] r1.15 rccl upstream patch Nov 22, 2019
@whchung whchung added the kokoro:force-run Tests on submitted change label Nov 22, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Nov 22, 2019
mihaimaruseac
mihaimaruseac previously approved these changes Nov 25, 2019
@jeffdaily
Copy link
Contributor Author

Thank you @mihaimaruseac for approving. I was concerned with the number of failing checks even though the build was successful for ROCm and non-ROCm paths on our test systems. I tried reproducing the sanity test failure locally but the indicated failures were unrelated to this PR. I hope you were able to otherwise test this satisfactorily.

@mihaimaruseac
Copy link
Collaborator

We have some issues with the tests on the old release branches. We're working on fixes.

@sunway513
Copy link
Contributor

Hi @mihaimaruseac , could you help update the ETA to have this PR merged to r1.15 release branch?

@mihaimaruseac
Copy link
Collaborator

I think first week of December?

@sunway513
Copy link
Contributor

@mihaimaruseac gentle ping, thanks!

@jeffdaily
Copy link
Contributor Author

@mihaimaruseac gentle ping, thanks.

@mihaimaruseac
Copy link
Collaborator

Apologies. I didn't yet get a chance to investigate why the CI fails on the release branches. It seems to be picking some configuration from master but didn't yet have time to dig more into this.

@jeffdaily
Copy link
Contributor Author

@mihaimaruseac gentle ping, thanks. This is also blocking the subsequent PR #34769.

@mihaimaruseac
Copy link
Collaborator

Apologies for the delay. I tried now to get #33981 merged so that the builds would run against the r1.15 branch instead of master. However, the builds use remote execution and at the moment our remote execution is only configured from HEAD.

I'll try over the holidays and bring @gunan in too and see what we can do.

@mihaimaruseac
Copy link
Collaborator

So I got the linux builds and one Windows one to pass on the 2.0 equivalent of #33981 (#33982). I'll try to get the others too but by start of the new year I will cherry-pick these changes to #33981 with as many presubmits passing as possible.

@gunan
Copy link
Contributor

gunan commented Dec 27, 2019

Actually, remote builds do not care about which branch we are running from.
All remote build parameters, docker containers and the compiler options are baked into the branch, under third_party/toolchains/preconfig folder.

@deven-amd
Copy link
Contributor

@mihaimaruseac @gunan, gentle ping

anything we can / need to do on our end to help out?

@mihaimaruseac mihaimaruseac added the kokoro:force-run Tests on submitted change label Jan 15, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 15, 2020
@mihaimaruseac
Copy link
Collaborator

mihaimaruseac commented Jan 15, 2020

#33981 was merged less than an hour ago. This means we can attempt running presubmits on the branch.

Probably some will fail at this moment as VMs changed but I will run presubmits again later in the night, when the VMs for this branch can be reused.

@mihaimaruseac
Copy link
Collaborator

Windows Bazel GPU and Ubuntu Sanity failures are expected at this time, trying them again later.

@mihaimaruseac
Copy link
Collaborator

@jeffdaily @deven-amd can you please check the Linux GPU build? I'm still going to run it on the 1.15 VMs around 8 hours from now, just in case the failure is not related to the PR but to the VMs.

Once this is merged, I think next is #34769 and #35230. Both of them are gated on this one, right?

@mihaimaruseac mihaimaruseac added the kokoro:force-run Tests on submitted change label Jan 16, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 16, 2020
@mihaimaruseac
Copy link
Collaborator

According to https://source.cloud.google.com/results/invocations/efa3a582-4b1e-472a-af44-1bad8131553e/targets/%2F%2Ftensorflow%2Fcore%2Fkernels:collective_nccl_test_gpu/log (the only failure on the proper VMs), we have some failures introduced by this PR (other PRs on the branch are completely green)

2020-01-16 06:03:24.050562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 1628 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
2020-01-16 06:03:24.154879: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-01-16 06:03:24.154914: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(+0x10462bd)[0x7f0ca2e412bd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f0ca1bef390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f0ca0b9c428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f0ca0b9e02a]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(+0x15df094)[0x7f0ca33da094]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f0ca2e1ef27]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f0ca2e1f7bf]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f0ca2e26e81]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f0ca2e24578]
/b/f/w/bazel-out/k8-opt/bin/tensorflow/core/kernels/collective_nccl_test_gpu.runfiles/org_tensorflow/tensorflow/core/kernels/../../../_solib_local/_U_S_Stensorflow_Score_Skernels_Ccollective_Unccl_Utest_Ugpu___Utensorflow/libtensorflow_framework.so.1(+0x1685a8f)[0x7f0ca3480a8f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f0ca1be56ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0ca0c6e41d]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
	tensorflow::CurrentStackTrace()


	gsignal
	abort

	tensorflow::EventMgr::PollEvents(bool, absl::InlinedVector<tensorflow::EventMgr::InUse, 4ul, std::allocator<tensorflow::EventMgr::InUse> >*)
	tensorflow::EventMgr::PollLoop()
	Eigen::ThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop(int)
	std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&)


	clone
*** End stack trace ***
external/bazel_tools/tools/test/test-setup.sh: line 310:    16 Aborted                 (core dumped) "${TEST_PATH}" "$@" 2>&1

@jeffdaily
Copy link
Contributor Author

@mihaimaruseac thank you for bringing this test to my attention. I am trying to reproduce now on our platform.

@googlebot
Copy link

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@jeffdaily
Copy link
Contributor Author

@mihaimaruseac I was able to reproduce the failing test more than once. After this change, c954406 , I was no longer able to reproduce. Let's watch CI now. Thanks.

@mihaimaruseac mihaimaruseac added the kokoro:force-run Tests on submitted change label Jan 21, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 21, 2020
@mihaimaruseac mihaimaruseac added the kokoro:force-run Tests on submitted change label Jan 21, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 21, 2020
@mihaimaruseac mihaimaruseac added the kokoro:force-run Tests on submitted change label Jan 21, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 21, 2020
@mihaimaruseac mihaimaruseac added the kokoro:force-run Tests on submitted change label Jan 21, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Jan 21, 2020
@mihaimaruseac mihaimaruseac merged commit 360b2e3 into tensorflow:r1.15 Jan 21, 2020
@mihaimaruseac
Copy link
Collaborator

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes size:L CL Change Size: Large
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants