Add pytorch 1.10.0 to test space, remove 1.6.0 #3291

EnricoMi · 2021-11-24T20:54:29Z

PyTorch 1.10.0 has been released a month ago, 1.9.1 two months ago. Adding these versions to our test space, removing 1.6.0.

github-actions · 2021-11-25T00:58:40Z

Unit Test Results

    758 files -   40     758 suites - 40 8h 5m 9s ⏱️ - 32m 34s
    716 tests ±    0     667 ✔️ -     5     49 💤 +    5 0 ❌ ±0
16 322 runs - 850 11 466 ✔️ - 656 4 856 💤 - 194 0 ❌ ±0

Results for commit 39a03af. ± Comparison against base commit be3b72d.

This pull request skips 5 tests.

test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_torch.TorchTests ‑ test_delta_optimizer

♻️ This comment has been updated with latest results.

github-actions · 2021-11-25T00:58:53Z

Unit Test Results (with flaky tests)

    886 files -     88     886 suites - 88 9h 35m 59s ⏱️ - 3m 48s
    716 tests ±      0     664 ✔️ -       4     49 💤 +    5 3 ❌ - 1
19 426 runs - 1 946 13 478 ✔️ - 1 448 5 942 💤 - 499 6 ❌ +1

For more details on these failures, see this check.

Results for commit 39a03af. ± Comparison against base commit be3b72d.

This pull request skips 5 tests.

test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_torch.TorchTests ‑ test_delta_optimizer

♻️ This comment has been updated with latest results.

EnricoMi · 2021-12-04T15:25:36Z

@maxhgerlach much better, but there is an error in an image that did not use to fail.

maxhgerlach · 2021-12-04T16:06:49Z

I see, in test-cpu-openmpi-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0:

test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error 
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_error 
124
test_torch.py::TorchTests::test_horovod_allgather_error 
Error: Process completed with exit code 124.

or

test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error 
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error SKIPPED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSED
124
test_torch.py::TorchTests::test_horovod_allreduce_error 
Error: Process completed with exit code 124.

or

test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_error PASSED
124
test_torch.py::TorchTests::test_horovod_broadcast_error 
Error: Process completed with exit code 124.

Those could be related to my changes (which introduced barriers in the duplicate_name_error tests)... One would have to reproduce and investigate interactively.

maxhgerlach · 2021-12-05T14:59:08Z

When testing locally, an extra synchronization point at the end of test_*_duplicate_name_error seems to have fixed this deadlock: 6593da5 Let's see how this plays out in the CI.

Edit: Had some junk included in an intermediate commit by accident, it's fixed now.

EnricoMi · 2021-12-05T21:24:12Z

Excellent job, almost there! Now there is macOS failing. I guess once that is fixed, Buildkite GPU tests are next to fail 😆.

maxhgerlach · 2021-12-06T08:45:35Z

It's never as easy as one thinks, is it?

CI / Build and Test macOS (test-openmpi-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_5_0-macos) (pull_request) Failing after 47m:

2021-12-05T17:51:20.0416760Z [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast [1,0]<stdout>:
2021-12-05T17:51:20.6840670Z [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast [1,0]<stdout>:�[32mPASSED�[0m[1,1]<stdout>:�[32mPASSED�[0m[1,1]<stdout>:
2021-12-05T17:51:20.6842890Z [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error [1,0]<stdout>:
2021-12-05T17:51:21.0206380Z [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error [1,1]<stdout>:�[31mFAILED�[0m[1,1]<stdout>:
2021-12-05T17:52:49.0035070Z [1,0]<stderr>:[2021-12-05 17:52:49.  2999: W /Users/runner/work/horovod/horovod/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
2021-12-05T17:52:49.0036780Z [1,0]<stderr>:Missing ranks:
2021-12-05T17:52:49.0037560Z [1,0]<stderr>:0: [broadcast.duplicate_name, broadcast.noname.2714]
2021-12-05T17:53:49.0054560Z [1,0]<stderr>:[2021-12-05 17:53:49.  4917: W /Users/runner/work/horovod/horovod/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
2021-12-05T17:53:49.0056300Z [1,0]<stderr>:Missing ranks:
2021-12-05T17:53:49.0056940Z [1,0]<stderr>:0: [broadcast.duplicate_name, broadcast.noname.2714]
2021-12-05T17:53:49.0057620Z [1,0]<stderr>:1: [allreduce.synch]

This time all three builds failed in test_horovod_broadcast_duplicate_name_error specifically.

maxhgerlach · 2021-12-06T10:59:26Z

@EnricoMi, the macOS tests on this branch still run with PyTorch 1.9 and 1.6, not yet with 1.10 -- is that intentional?

Quite curious that test_horovod_broadcast_duplicate_name_error failed three times in a row here, but not a single time on #3300 with the same build configuration. :-|

maxhgerlach · 2021-12-06T11:18:21Z

Under the assumption that remaining undefined behavior with hvd.barrier might still trigger problems under certain conditions, I've replaced it at the remaining three synchronization points with hvd.allreduce as well for now. Hopefully that turns out to be stable.

maxhgerlach · 2021-12-06T17:09:30Z

I guess once that is fixed, Buildkite GPU tests are next to fail 😆.

Excellent predictive powers! https://buildkite.com/horovod/horovod/builds/6878#d726fe50-6ade-4bbf-8e68-6e1efc8e8268

Three timeouts in test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0, each time the script reached test_torch.py::TorchTests::test_dynamic_requires_grad.

EnricoMi · 2021-12-06T19:20:14Z

the macOS tests on this branch still run with PyTorch 1.9 and 1.6, not yet with 1.10 -- is that intentional?

Nope, not intentional, that always slips my attention. I will move macOS to 1.10. once the current issues are fixed for 1.9. We will see what falls out of that next.

maxhgerlach · 2021-12-07T15:16:35Z

I could reproduce the problem in a locally built container test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0. Often, but not on every trial, test_dynamic_requires_grad would hang in a deadlock, where one process was somehow still at hvd.init(), while the other was already executing an optimizer step().

Curiously, the problem seems to go away entirely when I skip test_delta_optimizer (the test before the previous test in alphabetical order). Let's see how this pans out on a different system.. Conversely, extra synchronization barriers (like those I put into test_horovod_allreduce_duplicate_name_error and friends) did not help, but somehow they would hang themselves.

Obviously, this is not more than a crude workaround for some more fundamental problem.

(test_delta_optimizer is only run with MPI and GPUs, so only a few CI configs can encounter this.)

maxhgerlach · 2021-12-08T15:56:32Z

@EnricoMi, all the configurations with PyTorch 1.10 ran fine twice now. Twice, because I had an intermediate bug in the test script that failed on two of the MacOS configs with torch 1.9 in the first go.

Of course, something remains fishy. Somebody still needs to understand why we need to skip test_delta_optimizer (related to Adasum) here to avoid deadlocks.

EnricoMi · 2021-12-09T09:30:04Z

@maxhgerlach I am rebasing this ...

…g 1.6.0 Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

EnricoMi · 2021-12-09T13:58:01Z

As promised I have also moved macOS tests to 1.10.0 and cleaned up other framework versions (f98a1c3), harmonizing with docker-compose.test.yml. I guess this will break macOS in many ways. If that is the case I will put this in a separate PR.

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi · 2021-12-10T16:49:10Z

Interestingly, Horovod does not compile on macOS with mxnet 1.8.0, so I went for the latest mxnet versions before that. The macOS package of mxnet 1.8.0.post0 does not contain the mxnet/include/mkldnn/dnnl_config.h file.

EnricoMi · 2021-12-13T08:27:02Z

@maxhgerlach looks like the macOS tests are flaky. Sometimes, they fail. Here it looks like TorchTests::test_horovod_join_allgather hangs for five minutes before the tests timeout:

Sun, 12 Dec 2021 22:26:38 GMT [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allgather [1,1]<stdout>:SKIPPED (Gloo...)[1,1]<stdout>:
Sun, 12 Dec 2021 22:31:33 GMT [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_join_allgather 
Sun, 12 Dec 2021 22:31:33 GMT Error: Process completed with exit code 124.

maxhgerlach · 2021-12-13T10:23:43Z

Hmm, @Tixxx has posted a PR #3313 that should fix aspects of the interplay between join and allgather. Maybe that one will help here?

EnricoMi · 2021-12-13T12:01:08Z

Thanks for the pointer, I'll test this in #3301 where I see this issue consistently.

* Adding PyTorch 1.10.0 to test space, upgrading to 1.9.1 while removing 1.6.0 * Skip test_delta_optimizer with PyTorch 1.10 * Harmonize macOS tests with docker-compose.test.yml * Latest mxnet does not compile for macOS Signed-off-by: Enrico Minack <github@enrico.minack.dev> Co-authored-by: Max H. Gerlach <git@maxgerlach.de>

EnricoMi changed the title ~~Add pytorch 1.10.0 to test space~~ Add pytorch 1.10.0 to test space, remove 1.6.0 Nov 24, 2021

EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from 1a3ad98 to 946081e Compare November 24, 2021 22:24

EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch 2 times, most recently from 950a801 to 952d17a Compare December 4, 2021 13:20

EnricoMi requested a review from maxhgerlach December 4, 2021 13:20

maxhgerlach force-pushed the branch-ci-move-to-pytorch-1.10 branch from faa1c93 to 6593da5 Compare December 5, 2021 15:01

maxhgerlach mentioned this pull request Dec 5, 2021

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Merged

4 tasks

EnricoMi and others added 3 commits December 9, 2021 10:33

Adding PyTorch 1.10.0 to test space, upgrading to 1.9.1 while removin…

b565e73

…g 1.6.0 Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Skip test_delta_optimizer with PyTorch 1.10

bcdab90

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

Skip test_delta_optimizer with PyTorch 1.10, fix conditional

cc96124

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from 82612b5 to cc96124 Compare December 9, 2021 09:35

EnricoMi marked this pull request as ready for review December 9, 2021 09:36

maxhgerlach approved these changes Dec 9, 2021

View reviewed changes

Harmonize macOS tests with docker-compose.test.yml

f98a1c3

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from fcce163 to f98a1c3 Compare December 9, 2021 13:59

Fixing revert keras version in macOS test

a75e091

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch 3 times, most recently from d98791f to 7fdba8e Compare December 10, 2021 16:04

Latest mxnet does not compile for macOS

39a03af

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from 7fdba8e to 39a03af Compare December 10, 2021 16:46

EnricoMi merged commit df18797 into master Dec 11, 2021

EnricoMi deleted the branch-ci-move-to-pytorch-1.10 branch December 11, 2021 08:20

maxhgerlach mentioned this pull request Dec 13, 2021

TorchTests::test_delta_optimizer causes deadlocks in later tests #3314

Closed

EnricoMi mentioned this pull request Dec 20, 2021

Support resurrecting blacklisted hosts #3319

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pytorch 1.10.0 to test space, remove 1.6.0 #3291

Add pytorch 1.10.0 to test space, remove 1.6.0 #3291

EnricoMi commented Nov 24, 2021 •

edited

github-actions bot commented Nov 25, 2021 •

edited

github-actions bot commented Nov 25, 2021 •

edited

EnricoMi commented Dec 4, 2021

maxhgerlach commented Dec 4, 2021 •

edited

maxhgerlach commented Dec 5, 2021 •

edited

EnricoMi commented Dec 5, 2021

maxhgerlach commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021

EnricoMi commented Dec 6, 2021

maxhgerlach commented Dec 7, 2021

maxhgerlach commented Dec 8, 2021 •

edited

EnricoMi commented Dec 9, 2021

EnricoMi commented Dec 9, 2021 •

edited

EnricoMi commented Dec 10, 2021 •

edited

EnricoMi commented Dec 13, 2021

maxhgerlach commented Dec 13, 2021

EnricoMi commented Dec 13, 2021

Add pytorch 1.10.0 to test space, remove 1.6.0 #3291

Add pytorch 1.10.0 to test space, remove 1.6.0 #3291

Conversation

EnricoMi commented Nov 24, 2021 • edited

github-actions bot commented Nov 25, 2021 • edited

Unit Test Results

github-actions bot commented Nov 25, 2021 • edited

Unit Test Results (with flaky tests)

EnricoMi commented Dec 4, 2021

maxhgerlach commented Dec 4, 2021 • edited

maxhgerlach commented Dec 5, 2021 • edited

EnricoMi commented Dec 5, 2021

maxhgerlach commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021

EnricoMi commented Dec 6, 2021

maxhgerlach commented Dec 7, 2021

maxhgerlach commented Dec 8, 2021 • edited

EnricoMi commented Dec 9, 2021

EnricoMi commented Dec 9, 2021 • edited

EnricoMi commented Dec 10, 2021 • edited

EnricoMi commented Dec 13, 2021

maxhgerlach commented Dec 13, 2021

EnricoMi commented Dec 13, 2021

EnricoMi commented Nov 24, 2021 •

edited

github-actions bot commented Nov 25, 2021 •

edited

github-actions bot commented Nov 25, 2021 •

edited

maxhgerlach commented Dec 4, 2021 •

edited

maxhgerlach commented Dec 5, 2021 •

edited

maxhgerlach commented Dec 8, 2021 •

edited

EnricoMi commented Dec 9, 2021 •

edited

EnricoMi commented Dec 10, 2021 •

edited