-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reenable test_ucx_config_w_env_var
#8272
Reenable test_ucx_config_w_env_var
#8272
Conversation
Some time ago `test_ucx_config_w_env_var` started failing intermittently, and the causes were still unknown. After some investigation it seems in certain cases exchanging UCX-Py peer information causes some of the underlying communication calls to never complete and thus cause a hang that can't be recovered from by Distributed. With rapidsai/ucx-py#994, UCX-Py now has a timeout on those calls that allow Distributed to catch and retry establishing the connection, which seems to resolve the problem.
rerun tests |
3 similar comments
rerun tests |
rerun tests |
rerun tests |
I think this is now resolved for good, as you can see this is not xfailing nor marking the test as flaky and triggering reruns and after having gpuCI run 5 times in total, no failures have occurred. If there are no objections, this is probably good to merge from the gpuCI side. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 25 files ±0 25 suites ±0 14h 39m 11s ⏱️ - 21m 58s For more details on these failures, see this check. Results for commit 5ce2c91. ± Comparison against base commit 5cedc47. |
planning to merge this afternoon if there is no further feedback |
Merging in |
Some time ago
test_ucx_config_w_env_var
started failing intermittently, and the causes were still unknown. After some investigation it seems in certain cases exchanging UCX-Py peer information causes some of the underlying communication calls to never complete and thus cause a hang that can't be recovered from by Distributed. With rapidsai/ucx-py#994, UCX-Py now has a timeout on those calls that allow Distributed to catch and retry establishing the connection, which seems to resolve the problem.Closes #5229
pre-commit run --all-files