Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchTests::test_delta_optimizer causes deadlocks in later tests #3314

Closed
maxhgerlach opened this issue Dec 13, 2021 · 1 comment · Fixed by #3379
Closed

TorchTests::test_delta_optimizer causes deadlocks in later tests #3314

maxhgerlach opened this issue Dec 13, 2021 · 1 comment · Fixed by #3379
Assignees
Labels

Comments

@maxhgerlach
Copy link
Collaborator

Environment:

  1. Framework: PyTorch
  2. Framework version: 1.10
  3. Horovod version: master
  4. MPI version: OpenMPI

Bug report:
With PyTorch 1.10 we occasionally observed deadlocks in TorchTests::test_dynamic_requires_grad. These went away by skipping test_delta_optimizer (the test before the previous test in alphabetical order), see #3291 (comment). Since that test only applies with MPI and when run on GPUs, only a few CI configurations can encounter the problem.

The skip is a workaround for some more fundamental problem that is not understood. It should be fixed and test_delta_optimizer should be re-eneabled then.

@maxhgerlach maxhgerlach changed the title TorchTests::test_delta_optimizer is being skipped to prevent deadlocks TorchTests::test_delta_optimizer causes deadlocks in later tests Dec 13, 2021
@Tixxx
Copy link
Collaborator

Tixxx commented Dec 20, 2021

This is an interesting issue. I think this might be related to how the in-place tensor computations are done in the function hooks in delta optimizer. I can take a look if no on has started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants