Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make BatchedSend restartable #6329

Closed
wants to merge 27 commits into from

Conversation

gjoseph92
Copy link
Collaborator

This is #6272, with my PR comments addressed.

Closes #5481, #6228, #5480

  • Tests added / passed
  • Passes pre-commit run --all-files

@@ -4550,6 +4567,10 @@ async def add_client(
We listen to all future messages from this Comm.
"""
assert client is not None
assert client not in self.clients, f"Client {client!r} already exists"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes from fb8f1c9 are a driveby (@graingert and I discussed how the current logic is messed up in potentially allowing the wrong client batchedsend to be deleted and closed). I'm happy to pull them out to a separate PR.

def teardown(self):
self.bcomm.close()
async def close(self) -> None:
await self.bcomm.close()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do anything about my concern in #6272 (comment) because I don't know if it's valid or not

@github-actions
Copy link
Contributor

github-actions bot commented May 12, 2022

Unit Test Results

       15 files   -        1         15 suites   - 1   7h 5m 29s ⏱️ - 18m 38s
  2 797 tests +     26    2 710 ✔️ +     17    79 💤 +    1    8 +  8 
20 743 runs   - 1 387  19 806 ✔️  - 1 303  910 💤  - 111  27 +27 

For more details on these failures, see this check.

Results for commit d49d5ca. ± Comparison against base commit 9f02e7a.

♻️ This comment has been updated with latest results.

@gjoseph92
Copy link
Collaborator Author

This is currently plagued by a few reliably failing tests:

distributed/tests/test_client.py::test_retire_workers
distributed/tests/test_scheduler.py::test_retire_workers_close
distributed/tests/test_client.py::test_reconnect_timeout
distributed/tests/test_tls_functional.py::test_retire_workers
distributed/tests/test_stress.py::test_stress_scatter_death

I can't reproduce these failures locally. Additionally, when I restrict CI to only run these tests, they pass: https://github.com/dask/distributed/runs/6428523276?check_suite_focus=true

Interestingly, test_retire_workers seems to be leaking file descriptors:

================================ RESOURCE LEAKS ================================
distributed/tests/test_client.py::test_retire_workers leaking fds: leaked 3 file descriptor(s) (16->19)
distributed/tests/test_tls_functional.py::test_retire_workers leaking fds: leaked 1 file descriptor(s) (19->20)

@gjoseph92
Copy link
Collaborator Author

Update: the retire_workers tests do fail on Windows, even when run in isolation. test_reconnect_timeout (which is the one that fails most reliably normally) is still not failing in isolation.

@gjoseph92
Copy link
Collaborator Author

Thanks to 40e37fa, we can now at least see that the asyncio.Task being leaked from test_reconnect_timeout is coming from the Client: https://github.com/dask/distributed/runs/6429598771?check_suite_focus=true#step:11:1324

@fjetter
Copy link
Member

fjetter commented May 25, 2022

This should be no longer relevant after #6361

@fjetter fjetter closed this May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Properly support restarting BatchedSend
2 participants