Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing GPU and CPU tf head failures #3431

Merged
merged 1 commit into from Feb 28, 2022
Merged

Fixing GPU and CPU tf head failures #3431

merged 1 commit into from Feb 28, 2022

Conversation

Tixxx
Copy link
Collaborator

@Tixxx Tixxx commented Feb 28, 2022

Signed-off-by: TJ tix@uber.com

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

The tf head gpu and cpu CIs have been failing for a while. These are caused by 2 issues:

  1. CPU is failing because keras.layer removed .apply method, we can just use the call convention to pass inputs to the layer
  2. GPU is failing due to a recent change to kernel registration logic(which I haven't completely figured out). It looks like under some conditions, the GPU kernel is replaced by CPU kernel of the Join op which result in "fake" allreduces from the joined rank being allocated on the wrong device. This will cause a deadlock since other non-joined ranks are waiting indefinitely for the joined ranks to finish allreduce that's not happening on the same device. For Join op, I skip registering the cpu op when allreduce device is set to gpu.

Fixes # (issue).
#3422

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Signed-off-by: TJ <tix@uber.com>
@github-actions
Copy link

Unit Test Results

     802 files  +  29       802 suites  +29   9h 40m 30s ⏱️ + 5m 16s
     722 tests ±    0       679 ✔️ +    4       43 💤  -     4  0 ±0 
17 373 runs  +680  12 381 ✔️ +516  4 992 💤 +164  0 ±0 

Results for commit 213c9f7. ± Comparison against base commit 642a6b3.

@github-actions
Copy link

Unit Test Results (with flaky tests)

     926 files  +     59       926 suites  +59   10h 15m 44s ⏱️ + 17m 36s
     722 tests ±       0       677 ✔️ +    3       43 💤  -     4  2 +1 
19 969 runs  +1 133  14 161 ✔️ +917  5 804 💤 +213  4 +3 

For more details on these failures, see this check.

Results for commit 213c9f7. ± Comparison against base commit 642a6b3.

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for fixing!

@Tixxx Tixxx merged commit 71e10b4 into master Feb 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

TF Nightly breaks unit tests test_horovod_syncbn_gpu() and test_horovod_syncbn_cpu()
3 participants