-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout for exchange_peer_info
and fix for AM tests
#994
Timeout for exchange_peer_info
and fix for AM tests
#994
Conversation
After some debugging of Distributed tests with UCX it was observed that sometimes `exchange_peer_info` hangs indefinitely, specifically when executing `stream_recv` on the client side. The causes for this is unknown but believed to be due to messages being lost if there's either multiple stream messages being transferred simultaneously among various endpoints or being lost due to the receiving end taking too long to launch `stream_recv`, see rapidsai#509 for a similar issue related to stream API. By adding a timeout doesn't allow recovery, but at least allows a UCX-Py client to retry upon failure to establish the endpoint.
@charlesbluca could you plan on rebuilding gpuCI's docker image once this is merged in? It should fix dask/distributed#5229 , or it seems to do so at least. |
Yup can rebuild once this is reflected in ucx-py nightlies |
This should resolve rapidsai#797, which seems to be caused by checking received messages too early, before they are actually received by the listener. A roundtrip ensures the client receives the reply and thus prevents us from the checking for a transfer that didn't complete yet.
This change should prevent `DeprecationWarning`s from pytest-asyncio which currently closes unclosed event loop but will stop doing that in future releases.
exchange_peer_info
exchange_peer_info
and fix for AM tests
@rapidsai/ucxx-python-codeowners could I please get your review here ASAP? I have the expectation that this may improve the situation with Dask-CUDA tests, thus getting it merged soon will let us verify that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; small non-blocking suggestion
ci/test_wheel.sh
Outdated
@@ -14,4 +14,4 @@ cd tests | |||
python -m pytest --cache-clear -vs . | |||
cd ../ucp | |||
# skipped test context: https://github.com/rapidsai/ucx-py/issues/797 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good to remove this if this closes out #797:
# skipped test context: https://github.com/rapidsai/ucx-py/issues/797 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @charlesbluca , fixed in 6a8ebf2. It seems that you're not in the ucxpy-python-codeowners group though, was this always the case? I really thought you were in that group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
/merge |
Some time ago `test_ucx_config_w_env_var` started failing intermittently, and the causes were still unknown. After some investigation it seems in certain cases exchanging UCX-Py peer information causes some of the underlying communication calls to never complete and thus cause a hang that can't be recovered from by Distributed. With rapidsai/ucx-py#994, UCX-Py now has a timeout on those calls that allow Distributed to catch and retry establishing the connection, which seems to resolve the problem.
After some debugging of Distributed tests with UCX it was observed that sometimes
exchange_peer_info
hangs indefinitely, specifically when executingstream_recv
on the client side. The causes for this is unknown but believed to be due to messages being lost if there's either multiple stream messages being transferred simultaneously among various endpoints or being lost due to the receiving end taking too long to launchstream_recv
, see #509 for a similar issue related to stream API. By adding a timeout doesn't allow recovery, but at least allows a UCX-Py client to retry upon failure to establish the endpoint.This change seems to resolve dask/distributed#5229, at least it isn't reproducible locally with this change.
Additionally do a roundtrip message transfer for `test_send_recv_am, which should resolve #797 and seems to be caused by checking received messages too early, before they are actually received by the listener. A roundtrip ensures the client receives the reply and thus prevents us from the checking for a transfer that didn't complete yet.
Ensure now also that the listener is closed before validating
test_close_callback
conditions, which was also flaky.Finally, ensure we close the loop in test fixture, thus preventing
DeprecationWarning
s from pytest-asynciowhich currently closes unclosed event loop but will stop doing that in future releases.
Closes #797