Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP/POC] ordered RPC #8430

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

[WIP/POC] ordered RPC #8430

wants to merge 7 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Dec 21, 2023

Closes #7480

cc @hendrikmakait

@fjetter
Copy link
Member Author

fjetter commented Dec 21, 2023

After implementing this, I'm not entirely sure if this is what we actually need/want. I'll have to think a little more about it. Either way, this doesn't handle many cases.

Copy link
Contributor

github-actions bot commented Dec 21, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       27 files  ±  0         27 suites  ±0   9h 39m 36s ⏱️ - 11m 22s
  3 953 tests +  3    3 843 ✔️ +  4     109 💤 ±0  1  - 1 
49 722 runs  +39  47 432 ✔️ +40  2 289 💤 ±0  1  - 1 

For more details on these failures, see this check.

Results for commit 570af51. ± Comparison against base commit 1c74474.

This pull request removes 2 and adds 5 tests. Note that renamed tests count towards both.
distributed.tests.test_worker ‑ test_heartbeat_comm_closed
distributed.tests.test_worker ‑ test_heartbeat_missing
distributed.shuffle.tests.test_limiter ‑ test_limiter_no_limit_no_statistics
distributed.tests.test_core ‑ test_ordered_rpc[False]
distributed.tests.test_core ‑ test_ordered_rpc[True]
distributed.tests.test_core ‑ test_ordered_rpc_comm_closed[False]
distributed.tests.test_core ‑ test_ordered_rpc_comm_closed[True]

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member Author

fjetter commented Dec 22, 2023

I cleaned this up and took it a step further. This now moves the worker heartbeat to the scheduler stream. This was not possible earlier because the heartbeat expected a scheduler response to adjust its interval. This is the only reason why the heartbeat used a dedicated connection pool.

@fjetter
Copy link
Member Author

fjetter commented Dec 22, 2023

I'm starting to like this after all. Moving the heartbeat was seamless and I believe this avoids a lot of anti patterns. The transfer of actual payload data should not be done over this interface so we can't get rid of the old PooledRPC thing just yet. I also believe that stuff like gather-data should not necessarily obey any ordering since this could cause unnecessary delays.

@fjetter
Copy link
Member Author

fjetter commented Dec 22, 2023

Well, thinking about the above statement again, I think there is nothing wrong with payload data on that stream as long as we're not re-using/abusing the primary administrative stream between scheduler/worker.

However, to use this interface for smth like P2P where we would want to use the same stream that is also used for all the task-finished/erred/etc. messages we'll likely need to extend this multiplexing to move data off-band after all and just transmit administrative metadata over the stream.

@@ -597,7 +597,7 @@ async def test_new_metrics_during_heartbeat(c, s, a):
a.digest_metric(("execute", span.id, "x", "test", "test"), 1)
await asyncio.sleep(0)
await hb_task
assert n > 9
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 9 feels very magical. since the heartbeat doesn't have to open a new comm, we need fewer ticks until it completes and this is not guaranteed to be past 9.

I'm not entirely convinced this test still works/make sense/is relevant.

@fjetter
Copy link
Member Author

fjetter commented Dec 22, 2023

@hendrikmakait this now moved the P2P logic to the ordered PRC. Do you recall which logic was put in P2P to ensure ordering? I would hope that we could remove some logic to test this (and ultimately reduce complexity)

@fjetter
Copy link
Member Author

fjetter commented Dec 22, 2023

The one test failure right now is test_restarting_during_unpack_raises_killed_worker which appears to be a timing / poor test logic thing that I can also trigger on main in ~1% of cases under load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ordered send_recv pattern for RPCs
1 participant