Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pytorch 1.10.0 to test space, remove 1.6.0 #3291

Merged
merged 6 commits into from Dec 11, 2021

Conversation

EnricoMi
Copy link
Collaborator

@EnricoMi EnricoMi commented Nov 24, 2021

PyTorch 1.10.0 has been released a month ago, 1.9.1 two months ago. Adding these versions to our test space, removing 1.6.0.

@EnricoMi EnricoMi changed the title Add pytorch 1.10.0 to test space Add pytorch 1.10.0 to test space, remove 1.6.0 Nov 24, 2021
@EnricoMi EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from 1a3ad98 to 946081e Compare November 24, 2021 22:24
@github-actions
Copy link

github-actions bot commented Nov 25, 2021

Unit Test Results

     758 files   -   40       758 suites   - 40   8h 5m 9s ⏱️ - 32m 34s
     716 tests ±    0       667 ✔️  -     5       49 💤 +    5  0 ±0 
16 322 runs   - 850  11 466 ✔️  - 656  4 856 💤  - 194  0 ±0 

Results for commit 39a03af. ± Comparison against base commit be3b72d.

This pull request skips 5 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_torch.TorchTests ‑ test_delta_optimizer

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Nov 25, 2021

Unit Test Results (with flaky tests)

     886 files   -      88       886 suites   - 88   9h 35m 59s ⏱️ - 3m 48s
     716 tests ±       0       664 ✔️  -        4       49 💤 +    5  3  - 1 
19 426 runs   - 1 946  13 478 ✔️  - 1 448  5 942 💤  - 499  6 +1 

For more details on these failures, see this check.

Results for commit 39a03af. ± Comparison against base commit be3b72d.

This pull request skips 5 tests.
test.parallel.test_mxnet2.MX2Tests ‑ test_gluon_trainer
test.parallel.test_mxnet2.MX2Tests ‑ test_gpu_required
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet2.MX2Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_torch.TorchTests ‑ test_delta_optimizer

♻️ This comment has been updated with latest results.

@EnricoMi EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch 2 times, most recently from 950a801 to 952d17a Compare December 4, 2021 13:20
@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Dec 4, 2021

@maxhgerlach much better, but there is an error in an image that did not use to fail.

@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Dec 4, 2021

I see, in test-cpu-openmpi-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0:

test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error 
test_torch.py::TorchTests::test_horovod_allgather_duplicate_name_error PASSEDPASSED
test_torch.py::TorchTests::test_horovod_allgather_error 
124
test_torch.py::TorchTests::test_horovod_allgather_error 
Error: Process completed with exit code 124.

or

test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error 
test_torch.py::TorchTests::test_horovod_allreduce_cpu_gpu_error SKIPPED
test_torch.py::TorchTests::test_horovod_allreduce_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_allreduce_error PASSED
124
test_torch.py::TorchTests::test_horovod_allreduce_error 
Error: Process completed with exit code 124.

or

test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error PASSED
test_torch.py::TorchTests::test_horovod_broadcast_error PASSED
124
test_torch.py::TorchTests::test_horovod_broadcast_error 
Error: Process completed with exit code 124.

Those could be related to my changes (which introduced barriers in the duplicate_name_error tests)... One would have to reproduce and investigate interactively.

@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Dec 5, 2021

When testing locally, an extra synchronization point at the end of test_*_duplicate_name_error seems to have fixed this deadlock: 6593da5 Let's see how this plays out in the CI.

Edit: Had some junk included in an intermediate commit by accident, it's fixed now.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Dec 5, 2021

Excellent job, almost there! Now there is macOS failing. I guess once that is fixed, Buildkite GPU tests are next to fail 😆.

@maxhgerlach
Copy link
Collaborator

It's never as easy as one thinks, is it?

CI / Build and Test macOS (test-openmpi-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_5_0-macos) (pull_request) Failing after 47m:

2021-12-05T17:51:20.0416760Z [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast [1,0]<stdout>:
2021-12-05T17:51:20.6840670Z [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast [1,0]<stdout>:�[32mPASSED�[0m[1,1]<stdout>:�[32mPASSED�[0m[1,1]<stdout>:
2021-12-05T17:51:20.6842890Z [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error [1,0]<stdout>:
2021-12-05T17:51:21.0206380Z [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_broadcast_duplicate_name_error [1,1]<stdout>:�[31mFAILED�[0m[1,1]<stdout>:
2021-12-05T17:52:49.0035070Z [1,0]<stderr>:[2021-12-05 17:52:49.  2999: W /Users/runner/work/horovod/horovod/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
2021-12-05T17:52:49.0036780Z [1,0]<stderr>:Missing ranks:
2021-12-05T17:52:49.0037560Z [1,0]<stderr>:0: [broadcast.duplicate_name, broadcast.noname.2714]
2021-12-05T17:53:49.0054560Z [1,0]<stderr>:[2021-12-05 17:53:49.  4917: W /Users/runner/work/horovod/horovod/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
2021-12-05T17:53:49.0056300Z [1,0]<stderr>:Missing ranks:
2021-12-05T17:53:49.0056940Z [1,0]<stderr>:0: [broadcast.duplicate_name, broadcast.noname.2714]
2021-12-05T17:53:49.0057620Z [1,0]<stderr>:1: [allreduce.synch]

This time all three builds failed in test_horovod_broadcast_duplicate_name_error specifically.

@maxhgerlach
Copy link
Collaborator

@EnricoMi, the macOS tests on this branch still run with PyTorch 1.9 and 1.6, not yet with 1.10 -- is that intentional?

Quite curious that test_horovod_broadcast_duplicate_name_error failed three times in a row here, but not a single time on #3300 with the same build configuration. :-|

@maxhgerlach
Copy link
Collaborator

Under the assumption that remaining undefined behavior with hvd.barrier might still trigger problems under certain conditions, I've replaced it at the remaining three synchronization points with hvd.allreduce as well for now. Hopefully that turns out to be stable.

@maxhgerlach
Copy link
Collaborator

I guess once that is fixed, Buildkite GPU tests are next to fail 😆.

Excellent predictive powers! https://buildkite.com/horovod/horovod/builds/6878#d726fe50-6ade-4bbf-8e68-6e1efc8e8268

Three timeouts in test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0, each time the script reached test_torch.py::TorchTests::test_dynamic_requires_grad.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Dec 6, 2021

the macOS tests on this branch still run with PyTorch 1.9 and 1.6, not yet with 1.10 -- is that intentional?

Nope, not intentional, that always slips my attention. I will move macOS to 1.10. once the current issues are fixed for 1.9. We will see what falls out of that next.

@maxhgerlach
Copy link
Collaborator

I could reproduce the problem in a locally built container test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0. Often, but not on every trial, test_dynamic_requires_grad would hang in a deadlock, where one process was somehow still at hvd.init(), while the other was already executing an optimizer step().

Curiously, the problem seems to go away entirely when I skip test_delta_optimizer (the test before the previous test in alphabetical order). Let's see how this pans out on a different system.. Conversely, extra synchronization barriers (like those I put into test_horovod_allreduce_duplicate_name_error and friends) did not help, but somehow they would hang themselves.

Obviously, this is not more than a crude workaround for some more fundamental problem.

(test_delta_optimizer is only run with MPI and GPUs, so only a few CI configs can encounter this.)

@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Dec 8, 2021

@EnricoMi, all the configurations with PyTorch 1.10 ran fine twice now. Twice, because I had an intermediate bug in the test script that failed on two of the MacOS configs with torch 1.9 in the first go.

Of course, something remains fishy. Somebody still needs to understand why we need to skip test_delta_optimizer (related to Adasum) here to avoid deadlocks.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Dec 9, 2021

@maxhgerlach I am rebasing this ...

EnricoMi and others added 3 commits December 9, 2021 10:33
…g 1.6.0

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@EnricoMi EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from 82612b5 to cc96124 Compare December 9, 2021 09:35
@EnricoMi EnricoMi marked this pull request as ready for review December 9, 2021 09:36
@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Dec 9, 2021

As promised I have also moved macOS tests to 1.10.0 and cleaned up other framework versions (f98a1c3), harmonizing with docker-compose.test.yml. I guess this will break macOS in many ways. If that is the case I will put this in a separate PR.

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from fcce163 to f98a1c3 Compare December 9, 2021 13:59
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch 3 times, most recently from d98791f to 7fdba8e Compare December 10, 2021 16:04
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi EnricoMi force-pushed the branch-ci-move-to-pytorch-1.10 branch from 7fdba8e to 39a03af Compare December 10, 2021 16:46
@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Dec 10, 2021

Interestingly, Horovod does not compile on macOS with mxnet 1.8.0, so I went for the latest mxnet versions before that. The macOS package of mxnet 1.8.0.post0 does not contain the mxnet/include/mkldnn/dnnl_config.h file.

@EnricoMi EnricoMi merged commit df18797 into master Dec 11, 2021
@EnricoMi EnricoMi deleted the branch-ci-move-to-pytorch-1.10 branch December 11, 2021 08:20
@EnricoMi
Copy link
Collaborator Author

@maxhgerlach looks like the macOS tests are flaky. Sometimes, they fail. Here it looks like TorchTests::test_horovod_join_allgather hangs for five minutes before the tests timeout:

Sun, 12 Dec 2021 22:26:38 GMT [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allgather [1,1]<stdout>:SKIPPED (Gloo...)[1,1]<stdout>:
Sun, 12 Dec 2021 22:31:33 GMT [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_join_allgather 
Sun, 12 Dec 2021 22:31:33 GMT Error: Process completed with exit code 124.

@maxhgerlach
Copy link
Collaborator

Hmm, @Tixxx has posted a PR #3313 that should fix aspects of the interplay between join and allgather. Maybe that one will help here?

@EnricoMi
Copy link
Collaborator Author

Thanks for the pointer, I'll test this in #3301 where I see this issue consistently.

tkhanna1996 pushed a commit to tkhanna1996/horovod that referenced this pull request Dec 16, 2021
* Adding PyTorch 1.10.0 to test space, upgrading to 1.9.1 while removing 1.6.0
* Skip test_delta_optimizer with PyTorch 1.10
* Harmonize macOS tests with docker-compose.test.yml
* Latest mxnet does not compile for macOS

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: Max H. Gerlach <git@maxgerlach.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants