`TorchTests::test_delta_optimizer` causes deadlocks in later tests #3314

maxhgerlach · 2021-12-13T11:09:06Z

Environment:

Framework: PyTorch
Framework version: 1.10
Horovod version: master
MPI version: OpenMPI

Bug report:
With PyTorch 1.10 we occasionally observed deadlocks in TorchTests::test_dynamic_requires_grad. These went away by skipping test_delta_optimizer (the test before the previous test in alphabetical order), see #3291 (comment). Since that test only applies with MPI and when run on GPUs, only a few CI configurations can encounter the problem.

The skip is a workaround for some more fundamental problem that is not understood. It should be fixed and test_delta_optimizer should be re-eneabled then.

The text was updated successfully, but these errors were encountered:

Tixxx · 2021-12-20T22:40:29Z

This is an interesting issue. I think this might be related to how the in-place tensor computations are done in the function hooks in delta optimizer. I can take a look if no on has started.

maxhgerlach added the bug label Dec 13, 2021

maxhgerlach changed the title ~~TorchTests::test_delta_optimizer is being skipped to prevent deadlocks~~ TorchTests::test_delta_optimizer causes deadlocks in later tests Dec 13, 2021

maxhgerlach mentioned this issue Dec 20, 2021

Support resurrecting blacklisted hosts #3319

Merged

3 tasks

Tixxx self-assigned this Dec 20, 2021

Tixxx mentioned this issue Jan 22, 2022

re-enable delta optimizer test and fix a bug in adasum communicator init logic #3379

Merged

4 tasks

Tixxx linked a pull request Jan 22, 2022 that will close this issue

re-enable delta optimizer test and fix a bug in adasum communicator init logic #3379

Merged

4 tasks

Tixxx closed this as completed in #3379 Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TorchTests::test_delta_optimizer` causes deadlocks in later tests #3314

`TorchTests::test_delta_optimizer` causes deadlocks in later tests #3314

maxhgerlach commented Dec 13, 2021

Tixxx commented Dec 20, 2021

TorchTests::test_delta_optimizer causes deadlocks in later tests #3314

TorchTests::test_delta_optimizer causes deadlocks in later tests #3314

Comments

maxhgerlach commented Dec 13, 2021

Tixxx commented Dec 20, 2021

`TorchTests::test_delta_optimizer` causes deadlocks in later tests #3314

`TorchTests::test_delta_optimizer` causes deadlocks in later tests #3314