New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pytorch 1.10.0 to test space, remove 1.6.0 #3291
Conversation
1a3ad98
to
946081e
Compare
Unit Test Results 758 files - 40 758 suites - 40 8h 5m 9s ⏱️ - 32m 34s Results for commit 39a03af. ± Comparison against base commit be3b72d. This pull request skips 5 tests.
♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 886 files - 88 886 suites - 88 9h 35m 59s ⏱️ - 3m 48s For more details on these failures, see this check. Results for commit 39a03af. ± Comparison against base commit be3b72d. This pull request skips 5 tests.
♻️ This comment has been updated with latest results. |
950a801
to
952d17a
Compare
@maxhgerlach much better, but there is an error in an image that did not use to fail. |
I see, in test-cpu-openmpi-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0:
or
or
Those could be related to my changes (which introduced barriers in the duplicate_name_error tests)... One would have to reproduce and investigate interactively. |
When testing locally, an extra synchronization point at the end of test_*_duplicate_name_error seems to have fixed this deadlock: 6593da5 Let's see how this plays out in the CI. Edit: Had some junk included in an intermediate commit by accident, it's fixed now. |
faa1c93
to
6593da5
Compare
Excellent job, almost there! Now there is macOS failing. I guess once that is fixed, Buildkite GPU tests are next to fail 😆. |
It's never as easy as one thinks, is it? CI / Build and Test macOS (test-openmpi-cpu-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_9_0-mxnet1_5_0-macos) (pull_request) Failing after 47m:
This time all three builds failed in |
Under the assumption that remaining undefined behavior with |
Excellent predictive powers! https://buildkite.com/horovod/horovod/builds/6878#d726fe50-6ade-4bbf-8e68-6e1efc8e8268 Three timeouts in test-mixed-openmpi-gloo-py3_8-tf2_6_0-keras2_6_0-torch1_10_0-mxnet1_8_0_p0-pyspark3_2_0, each time the script reached |
Nope, not intentional, that always slips my attention. I will move macOS to 1.10. once the current issues are fixed for 1.9. We will see what falls out of that next. |
I could reproduce the problem in a locally built container Curiously, the problem seems to go away entirely when I skip Obviously, this is not more than a crude workaround for some more fundamental problem. ( |
@EnricoMi, all the configurations with PyTorch 1.10 ran fine twice now. Twice, because I had an intermediate bug in the test script that failed on two of the MacOS configs with torch 1.9 in the first go. Of course, something remains fishy. Somebody still needs to understand why we need to skip |
@maxhgerlach I am rebasing this ... |
…g 1.6.0 Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
82612b5
to
cc96124
Compare
As promised I have also moved macOS tests to 1.10.0 and cleaned up other framework versions (f98a1c3), harmonizing with |
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
fcce163
to
f98a1c3
Compare
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
d98791f
to
7fdba8e
Compare
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
7fdba8e
to
39a03af
Compare
Interestingly, Horovod does not compile on macOS with mxnet 1.8.0, so I went for the latest mxnet versions before that. The macOS package of mxnet 1.8.0.post0 does not contain the |
@maxhgerlach looks like the macOS tests are flaky. Sometimes, they fail. Here it looks like
|
Thanks for the pointer, I'll test this in #3301 where I see this issue consistently. |
* Adding PyTorch 1.10.0 to test space, upgrading to 1.9.1 while removing 1.6.0 * Skip test_delta_optimizer with PyTorch 1.10 * Harmonize macOS tests with docker-compose.test.yml * Latest mxnet does not compile for macOS Signed-off-by: Enrico Minack <github@enrico.minack.dev> Co-authored-by: Max H. Gerlach <git@maxgerlach.de>
PyTorch 1.10.0 has been released a month ago, 1.9.1 two months ago. Adding these versions to our test space, removing 1.6.0.