Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snooze stability #1643

Open
sk1p opened this issue May 16, 2024 · 2 comments
Open

Snooze stability #1643

sk1p opened this issue May 16, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@sk1p
Copy link
Member

sk1p commented May 16, 2024

While testing for 0.14 (#1623), we hit an issue where a dask worker was trying to heartbeat at a small time delta after snoozing (~10ms) - it would be good to write a reproducer and report this upstream. This was mostly an issue of printing an error to the log - the executor managed to unsnooze without issue afterwards.

@sk1p sk1p added the bug Something isn't working label May 16, 2024
@matbryan52
Copy link
Member

Log from this error (hard to reproduce!) :

[2024-05-21 11:13:55,880] INFO [libertem.web.state.snooze:106] Snoozing...
2024-05-21 11:13:55,894 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/worker.py", line 1252, in heartbeat
    response = await retry_operation(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/utils_comm.py", line 452, in retry_operation
    return await retry(
           ^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/utils_comm.py", line 431, in retry
    return await coro()
           ^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/core.py", line 1395, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/core.py", line 1154, in send_recv
    response = await comm.read(deserializers=deserializers)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/nobackup/mb265392/.pyenv/versions/3.11.6/envs/dask_heartbeat/lib/python3.11/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://10.8.164.164:56438 remote=tcp://10.8.164.164:42579>: Stream is closed

@matbryan52
Copy link
Member

matbryan52 commented May 21, 2024

This issue appears to be the same : dask/distributed#7891 although there it is in an interactive context.

Theses issues may be relevant: dask/distributed#6384 dask/distributed#6354

And more recently dask/distributed#8522

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants