Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failures and ConnectionPool errors #1236

Open
pentschev opened this issue Sep 25, 2023 · 5 comments
Open

CI failures and ConnectionPool errors #1236

pentschev opened this issue Sep 25, 2023 · 5 comments

Comments

@pentschev
Copy link
Member

We continue to have lots of tests failing in CI (see for example yesterday's and today's nightly runs), more commonly in nightly builds, but those also happen in PR builds although less often.

In a bit of further investigation, one of the issues I see is we get lots of errors such as the one below when a cluster is shutting down.

2023-09-24 05:36:14,555 - distributed.worker - ERROR - Unexpected exception during heartbeat. Closing worker.
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/worker.py", line 1253, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/utils_comm.py", line 454, in retry_operation
    return await retry(
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/utils_comm.py", line 433, in retry
    return await coro()
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/test/lib/python3.10/site-packages/distributed/core.py", line 1543, in connect
    raise RuntimeError("ConnectionPool is closed")

In most cases those errors seem harmless, but I can't say yet whether they are at least partly responsible for the failing tests. I was able to get one test to locally reproduce the error above in a consistent manner, see below.

# pytest -vs dask_cuda/tests/test_proxy.py -k test_communicating_proxy_objects[ucx-None]
==================================================== test session starts ====================================================
platform linux -- Python 3.9.18, pytest-7.4.2, pluggy-1.3.0 -- /opt/conda/envs/test/bin/python3.9
cachedir: .pytest_cache
rootdir: /repo
configfile: pyproject.toml
plugins: cov-4.1.0
collected 1175 items / 1174 deselected / 1 skipped / 1 selected

tests/test_proxy.py::test_communicating_proxy_objects[ucx-None] 2023-09-25 12:47:31,680 - distributed.worker - ERROR - Unexpected exception during heartbeat. Closing worker.
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/worker.py", line 1253, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 454, in retry_operation
    return await retry(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 433, in retry
    return await coro()
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1543, in connect
    raise RuntimeError("ConnectionPool is closed")
RuntimeError: ConnectionPool is closed
2023-09-25 12:47:31,692 - tornado.application - ERROR - Exception in callback <bound method Worker.heartbeat of <Worker 'ucx://127.0.0.1:54751', name: 0, status: closed, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>>
Traceback (most recent call last):
  File "/opt/conda/envs/test/lib/python3.9/site-packages/tornado/ioloop.py", line 921, in _run
    await val
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/worker.py", line 1253, in heartbeat
    response = await retry_operation(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 454, in retry_operation
    return await retry(
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/utils_comm.py", line 433, in retry
    return await coro()
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1344, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/opt/conda/envs/test/lib/python3.9/site-packages/distributed/core.py", line 1543, in connect
    raise RuntimeError("ConnectionPool is closed")
RuntimeError: ConnectionPool is closed
PASSED

I was able to work around that error with the patch below, which essentially forcefully disabled cuDF spilling.

diff --git a/dask_cuda/is_spillable_object.py b/dask_cuda/is_spillable_object.py
index cb85248..f30800b 100644
--- a/dask_cuda/is_spillable_object.py
+++ b/dask_cuda/is_spillable_object.py
@@ -48,6 +48,7 @@ def cudf_spilling_status() -> Optional[bool]:
         - None if the current version of cudf doesn't support spilling, or
         - None if cudf isn't available.
     """
+    return False
     try:
         from cudf.core.buffer.spill_manager import get_global_manager
     except ImportError:

It looks to me like from cudf.core.buffer.spill_manager import get_global_manager (if you simply import it without calling the get_global_manager() before returning, it will already fail) is again battling Disitributed somehow to destroy resources. I've spent literally no time looking at the cuDF code to verify whether there's something obvious there, but I was hoping @madsbk would be able to tell whether there's something we should be concerned there when we are working with Distributed's finalizers, or @wence- who has been to the depths of Distributed's finalizers and back to have some magic power in pinpointing the problem right away.

@madsbk
Copy link
Member

madsbk commented Sep 26, 2023

The only side-effect of get_global_manager(), when spilling is disabled, is the import of cudf.

Sure enough, this also triggers the issue:

diff --git a/dask_cuda/is_spillable_object.py b/dask_cuda/is_spillable_object.py
index cb85248..959d9f3 100644
--- a/dask_cuda/is_spillable_object.py
+++ b/dask_cuda/is_spillable_object.py
@@ -48,6 +48,8 @@ def cudf_spilling_status() -> Optional[bool]:
         - None if the current version of cudf doesn't support spilling, or
         - None if cudf isn't available.
     """
+    import cudf
+    return False
     try:
         from cudf.core.buffer.spill_manager import get_global_manager
     except ImportError:

@pentschev
Copy link
Member Author

You're right @madsbk . This is essentially being triggered by having

# Make the "disk" serializer available and use a directory that are
# remove on exit.
if ProxifyHostFile._spill_to_disk is None:
tmpdir = tempfile.TemporaryDirectory()
ProxifyHostFile(
worker_local_directory=tmpdir.name,
device_memory_limit=1024,
memory_limit=1024,
)
in the context of the test file, commenting out the ProxifyHostFile instantiation also prevents this from happening. IOW, it seems this is another instance of a global context that gets leaked to undesirable places.

I'm thinking instantiating ProxifyHostFile should be a fixture that is only set for tests that actually need it. Are there any reasons why we shouldn't be doing that?

@madsbk
Copy link
Member

madsbk commented Sep 26, 2023

I'm thinking instantiating ProxifyHostFile should be a fixture that is only set for tests that actually need it. Are there any reasons why we shouldn't be doing that?

I think that is a good idea!

@wence-
Copy link
Contributor

wence- commented Sep 26, 2023

I guess the problem is somehow that there are things in the ProxifyHostFile that are keeping objects alive, and since it is a module-level variable it never gets cleaned up until too late.

@wence-
Copy link
Contributor

wence- commented Sep 26, 2023

Ah wait, something strange. So ProxifyHostFile._spill_to_disk is a singleton object, that just sets up some parameters? The ProxifyHostFile object that is being constructed here is thrown away, some there is something strange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants