Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reenable test_ucx_config_w_env_var #8272

Merged
merged 1 commit into from
Nov 9, 2023

Conversation

pentschev
Copy link
Member

Some time ago test_ucx_config_w_env_var started failing intermittently, and the causes were still unknown. After some investigation it seems in certain cases exchanging UCX-Py peer information causes some of the underlying communication calls to never complete and thus cause a hang that can't be recovered from by Distributed. With rapidsai/ucx-py#994, UCX-Py now has a timeout on those calls that allow Distributed to catch and retry establishing the connection, which seems to resolve the problem.

Closes #5229

  • Tests added / passed
  • Passes pre-commit run --all-files

Some time ago `test_ucx_config_w_env_var` started failing
intermittently, and the causes were still unknown. After some
investigation it seems in certain cases exchanging UCX-Py peer
information causes some of the underlying communication calls to never
complete and thus cause a hang that can't be recovered from by
Distributed. With rapidsai/ucx-py#994, UCX-Py
now has a timeout on those calls that allow Distributed to catch and
retry establishing the connection, which seems to resolve the problem.
@pentschev
Copy link
Member Author

rerun tests

3 similar comments
@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

I think this is now resolved for good, as you can see this is not xfailing nor marking the test as flaky and triggering reruns and after having gpuCI run 5 times in total, no failures have occurred. If there are no objections, this is probably good to merge from the gpuCI side.

cc @jrbourbeau @crusaderky @quasiben @charlesbluca

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       25 files  ±0         25 suites  ±0   14h 39m 11s ⏱️ - 21m 58s
  3 860 tests ±0    3 738 ✔️ +2     117 💤 ±0    5  - 2 
44 949 runs  ±0  42 801 ✔️ +1  2 121 💤 ±0  27  - 1 

For more details on these failures, see this check.

Results for commit 5ce2c91. ± Comparison against base commit 5cedc47.

@quasiben
Copy link
Member

quasiben commented Nov 9, 2023

planning to merge this afternoon if there is no further feedback

@quasiben
Copy link
Member

quasiben commented Nov 9, 2023

Merging in

@quasiben quasiben merged commit e98dcb1 into dask:main Nov 9, 2023
17 of 33 checks passed
@pentschev pentschev deleted the reenable-test_ucx_config_w_env_var branch November 13, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distributed.comm.tests.test_ucx_config.test_ucx_config_w_env_var flaky
3 participants