Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise UCXNotConnected on stream receive callback #799

Merged
merged 1 commit into from Oct 21, 2021

Conversation

pentschev
Copy link
Member

This is an issue that occurs non-deterministically in Distributed, generally during client._reconnect at shutdown when peers have already begun closing. Therefore, we need to catch this exception and raise CommClosedError in Distributed to prevent unhandled errors.

@@ -113,6 +119,10 @@ cdef void _stream_recv_callback(
name = req_info["name"]
msg = "<%s>: " % name
exception = UCXCanceled(msg)
if status == UCS_ERR_NOT_CONNECTED:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add any debug logging here ? If it's intermittent maybe this might help

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the Dask use behavior is intermittent, not the code itself. The problem seems to be that when Dask tries to reconnect this error is raised, but the reconnect process happens once the other end has already closed. The exception can be seen clearly when it happens, and this is what dask/distributed#5449 is catching now, therefore logging doesn't seem to have any use to me now.

@quasiben quasiben merged commit f970ab6 into rapidsai:branch-0.23 Oct 21, 2021
@pentschev pentschev deleted the raise-stream-not-connected branch October 21, 2021 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants