Do not filter tasks before gathering data #6371

crusaderky · 2022-05-18T19:25:23Z

Partially closes Migrate ensure_communicating transitions to new WorkerState event mechanism #5896
Follow-up to Deadlock - Ensure resumed flight tasks are still fetched #5426
Remove all sophisticated and time-sensitive logic from the section of gather_dep leading to the async call to the other worker. This is a prerequisite to the refactoring of the function.
Remove exceptional case handling for a task that transitions from flight to cancelled while the event loop has finished _handle_instructions and has not reached gather_dep yet. I will call this period handle_instructions to gather_dep interstice from now on. Now the logic is the same if the task transitions to cancelled in that tiny window or at any moment during the network comms.
Remove incorrect (and, to my understanding, untested) transition from memory to fetch if a task transitions from flight to memory during the handle_instructions to gather_dep interstice. The behaviour is now the same to when the task transitions from flight to memory at whatever moment during the comms (the comms output is discarded). Added test for it.
Remove inconsequential performance optimizations that would prevent the comms at all if a task transitions from flight to cancelled or memory during the interstice.

crusaderky · 2022-05-18T19:37:50Z

distributed/tests/test_worker.py

-@pytest.mark.parametrize("close_worker", [False, True])
+@pytest.mark.parametrize(
+    "close_worker", [False, pytest.param(True, marks=pytest.mark.slow)]
+)


(cancelled, True) now takes 5s instead of 100ms as now the network comms is fired blindly.
(resumed, True) was already taking 5s before this PR.

Is that related to #6354 at all? Because Scheduler.remove_worker doesn't flush or await the BatchedSend, so after remove_worker returns, there's still some delay until it receives the message and actually shuts down? 5s seems longer than I'd expect.

I'm getting this traceback:

File "/home/crusaderky/github/distributed/distributed/worker.py", line 4575, in _get_data comm = await rpc.connect(worker) File "/home/crusaderky/github/distributed/distributed/core.py", line 1184, in connect return await connect_attempt File "/home/crusaderky/github/distributed/distributed/core.py", line 1120, in _connect comm = await connect( File "/home/crusaderky/github/distributed/distributed/comm/core.py", line 315, in connect raise OSError( OSError: Timed out trying to connect to tcp://127.0.0.1:34011 after 5 s

There seems to be nothing, when Worker.close() is invoked, that explicitly shuts down the RPC channel.

distributed/distributed/worker.py

Line 1568 in 33fc50c

await self.rpc.close()

maybe this just happens too late?

Moving to #6409

crusaderky · 2022-05-18T19:39:53Z

distributed/worker.py

-        for key in to_gather_keys:
-            ts = self.tasks.get(key)
-            if ts is None:
-                continue


This should never happen. The finally clause of gather_dep strongly states this by using an unguarded access self.tasks[key].

Yes, we worked very hard to ensure tasks are not accidentally forgotten. I encourage being as strict as possible with this. A KeyError is often a sign of a messed up transition somewhere else

crusaderky · 2022-05-18T19:41:59Z

distributed/worker.py

+        stop: float,
+        data: dict[str, Any],
+        cause: TaskState,
+        worker: str,


off topic: post refactor, should this method move to the state machine class, or stay in Worker proper?

It's network and diagnostics related so I'm inclined to say this does not belong to the state machine class

crusaderky · 2022-05-18T19:43:43Z

distributed/worker.py

-                    if ts.state == "cancelled":
-                        recommendations[ts] = "released"
-                    else:
-                        recommendations[ts] = "fetch"


ts.state == "memory"

crusaderky · 2022-05-18T19:44:44Z

distributed/worker.py

-                        recommendations[ts] = "released"
-                    else:
-                        recommendations[ts] = "fetch"
+                if ts.state == "cancelled":


This tests ts.state a lot later than before. There's a new test in this PR to verify this is works for tasks that transition during the comms.

Was the prior behavior just an optimization to avoid fetching keys that were cancelled in the handle_instructions to gather_dep interstice? Before, we avoided fetching them; now we don't? Seems like a nice thing to add back eventually, but the simplification here is nice.

I'm curious how much a blocked event loop #6325 would make this scenario more likely.

The prior behaviour was introduced by #5426 as a response to a deadlock.

Before #5426 you had two use cases:
a1. cancelled during comms
a2. cancelled task is received and implicitly transitioned to released

b1. cancelled in the interstice
b2. cancelled task is not fetched
b3. deadlock

After #5426:
a1. cancelled during comms
a2. cancelled task is received and implicitly transitioned to released

b1. cancelled in the interstice
b2. cancelled task is explicitly transitioned to released

After this PR:

cancelled whenever

cancelled task is explicitly transitioned to released

I really don't think we should care about performance optimizations in this case. Transitions from flight to cancelled should not be that frequent to begin with?

I agree that we should remove this optimization if possible. It's not worth it and it didn't feel great to introduce it in the first place.
By now, I trust tests around these edge cases enough that if all is green after removal, we're good to go

gjoseph92

Overall seems good, I appreciate the simplification.

gjoseph92 · 2022-05-18T21:03:23Z

distributed/worker.py

-                        recommendations[ts] = "released"
-                    else:
-                        recommendations[ts] = "fetch"
+                if ts.state == "cancelled":


Was the prior behavior just an optimization to avoid fetching keys that were cancelled in the handle_instructions to gather_dep interstice? Before, we avoided fetching them; now we don't? Seems like a nice thing to add back eventually, but the simplification here is nice.

I'm curious how much a blocked event loop #6325 would make this scenario more likely.

gjoseph92 · 2022-05-18T21:08:56Z

distributed/tests/test_worker.py

-@pytest.mark.parametrize("close_worker", [False, True])
+@pytest.mark.parametrize(
+    "close_worker", [False, pytest.param(True, marks=pytest.mark.slow)]
+)


Is that related to #6354 at all? Because Scheduler.remove_worker doesn't flush or await the BatchedSend, so after remove_worker returns, there's still some delay until it receives the message and actually shuts down? 5s seems longer than I'd expect.

distributed/worker.py

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

github-actions · 2022-05-19T00:10:46Z

Unit Test Results

      15 files ±    0       15 suites ±0 7h 14m 38s ⏱️ + 9m 49s
  2 806 tests +    2   2 725 ✔️ +    2   78 💤 -   1 3 ❌ +1
20 808 runs - 386 19 886 ✔️ - 323 919 💤 - 64 3 ❌ +1

For more details on these failures, see this check.

Results for commit 6f9caed. ± Comparison against base commit 33fc50c.

♻️ This comment has been updated with latest results.

fjetter · 2022-05-19T07:45:34Z

distributed/worker.py

+        typically the next to be executed but since we're fetching tasks for potentially
+        many dependents, an exact match is not possible.


FYI this entire "get_cause" thing is necessary for acquire_replica where there is not necessarily a dependent known to the worker. It's not about the ambiguity of having multiple dependents

Added a note about acquire-replicas

fjetter · 2022-05-19T07:46:48Z

distributed/worker.py

+        stop: float,
+        data: dict[str, Any],
+        cause: TaskState,
+        worker: str,


It's network and diagnostics related so I'm inclined to say this does not belong to the state machine class

fjetter · 2022-05-19T08:13:08Z

distributed/worker.py

-                        recommendations[ts] = "released"
-                    else:
-                        recommendations[ts] = "fetch"
+                if ts.state == "cancelled":


I believe we should remove this special treatment. The bigger point of the transition system was to simplify these kind of clauses and allow us to make a recommendation without investigating start states. This did not work well all the time but in this case it works flawlessly and reduces complexity as intended.

I like the original transition log also better because a successful fetch should recommend a transition to memory. However, the state machine decides to forget instead because it knows the history and knows that the key was cancelled. This is much more in line with how I would envision this system to work.

diff --git a/distributed/tests/test_cancelled_state.py b/distributed/tests/test_cancelled_state.py index cab21a5c..74a039b7 100644 --- a/distributed/tests/test_cancelled_state.py +++ b/distributed/tests/test_cancelled_state.py @@ -322,10 +322,7 @@ async def test_in_flight_lost_after_resumed(c, s, b): ("free-keys", (fut1.key,)), (fut1.key, "resumed", "released", "cancelled", {}), # After gather_dep receives the data, the task is forgotten - ("receive-dep", a.address, {fut1.key}), - (fut1.key, "release-key"), - (fut1.key, "cancelled", "released", "released", {fut1.key: "forgotten"}), - (fut1.key, "released", "forgotten", "forgotten", {}), + (fut1.key, "cancelled", "memory", "released", {fut1.key: "forgotten"}), ], ) diff --git a/distributed/worker.py b/distributed/worker.py index cc2ea229..3f6319fa 100644 --- a/distributed/worker.py +++ b/distributed/worker.py @@ -3333,9 +3333,7 @@ class Worker(ServerNode): for d in self.in_flight_workers.pop(worker): ts = self.tasks[d] ts.done = True - if ts.state == "cancelled": - recommendations[ts] = "released" - elif d in data: + if d in data: recommendations[ts] = ("memory", data[d]) elif busy: recommendations[ts] = "fetch"

Very happy to apply the patch if it doesn't deadlock elsewhere 😛

To clarify: if a asks b for x, but either b responds that it doesn't have a replica or doesn't respond at all, and in the meantime the scheduler cancels x on a, this will trigger a cancelled->fetch transition. Is this the desired behaviour?

Desired behavior is probably a bit much. It will do the right thing because we'll have

ts.done = True
cancelled -> fetch

which will recommend a cancelled->release so we're good.

I fully admit that the ts.done attribute in this case is very awkward. It basically encodes that this fetch transition originates from either the gather_dep result or from the execute result. Therefore, we could just as well remove the ts.done attribute and deal with these transition in the gather_dep/execute result the way you are proposing in this PR. When I introduced this (many months ago) I felt this would reduce code complexity (as in having fewer conditionals).
Given that we still have the ts.done attribute, I believe the patch I am proposing is the more idiomatic way but I'm happy to revisit this in a later iteration.

I would also be in favor of removing ts.done eventually and having the logic in gather_dep and execute like here. Or maybe it's just a naming issue—ts.done is a pretty generic/ambiguous term. But I think from a #5736 perspective, having this extra piece of state (done) that affects the behavior of transitions makes things harder to reason about. Though I do appreciate that it protects you from forgetting about these edge cases and having to check whether ts.state == "cancelled".

Maybe this is over the top, but what if done was a state? Call it fetched and executed, since they might need different logic and I don't like overlapping the states of execution vs fetching anyway. Then you'd have different transition handlers for flight->fetched vs cancelled->fetched. Forgetting to handle the cancelled possibility would be an impossible transition error, instead of a bug and maybe deadlock.

I don't think "fetched" or "executed" is a good idea - I'd rather look into moving away from intermediate states, not adding more.

fjetter · 2022-05-19T08:14:44Z

distributed/worker.py

-                        recommendations[ts] = "released"
-                    else:
-                        recommendations[ts] = "fetch"
+                if ts.state == "cancelled":


I agree that we should remove this optimization if possible. It's not worth it and it didn't feel great to introduce it in the first place.
By now, I trust tests around these edge cases enough that if all is green after removal, we're good to go

Simplify handling of cancelled state

fjetter · 2022-05-20T08:58:44Z

I changed the title of the PR to have a better changelog since this is not just simplifying things but also a behavioral change

fjetter · 2022-05-20T09:02:20Z

Failing tests are known offenders

distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395
Flaky test_slow_terminate #6373
test_stress_scatter_death #6305 (should be fixed on main)

fjetter · 2022-05-20T09:09:37Z

distributed/tests/test_worker_state_machine.py

+    await wait_for_state("x", "flight", a)
+    a.update_data({"x": 3})


if x is already in flight, how could the data end up in memory without us doing it explicitly like in this unit test?
acquire_replica and "fetch_dependency" should not fetch this key a second time.

From me reading the code, the only way this could happen is via Client.scatter. I would argue a user should not be allowed to scatter a key that is already known to the cluster to be computed.

I don't want to block this PR for this test but if the above outline is how we end up in this situation, I think we should prohibit scattering such keys and shrink the space of possible/allowed transitions.

Specifically, I'm inclined to say a.update({"x": 3}) should raise an exception if x is in flight.

Thoughts?

Specifically, I'm inclined to say a.update({"x": 3}) should raise an exception if x is in flight.

would might translate to something like

def transition_flight_memory(...): if not ts.done: raise ImpossibleTransition("A nice exception that tells us that we cannot move data to memory while in flight but coro/task still running")

(Where the exception is supposed to be raised is not the point of my argument. It may not be feasible to raise in the transition itself, idk)

The more I think about this the stronger I feel about it because these kind of race conditions are part of why I introduced cancelled/resumed to avoid us needing to deal with these transitions.
If the fetch task would finish successfully, this would cause a memory->memory transition. Since this is not allowed/possible this would cause a

memory->released
(possibly the released transition would cancel some follow up tasks)
released->memory

or as a concrete story

[ (ts.key, "flight", "memory", "memory", {dependent: "executing"}) (dependent.key, "waiting", "executing", "executing", {}), # A bit later after gather_dep returns (ts.key, "memory", "memory", "released", {dependent: "released"}), (dependent.key, "executing", "released", "cancelled", {}), (ts.key, "released", "memory", "memory", {dependent: "waiting"}), (dependent.key, "cancelled", "waiting", "executing", {}), ]

writing down the expected story made me realize that our transition flow should heal us here but we'd be performing a lot of unnecessary transitions that could expose us to problems.

I think that 80% of the problem is caused by the non-sequentiality of RPC calls vs. bulk comms.

client scatters to a

the scheduler does not know about scattered keys until the three-way round-trip between client, workers, and scheduler has been completed:

distributed/distributed/scheduler.py

Lines 5018 to 5022 in fb3589c

keys, who_has, nbytes = await scatter_to_workers(

nthreads, data, rpc=self.rpc, report=False

)

self.update_data(who_has=who_has, nbytes=nbytes, client=client)

in the middle of that handshake, a client (not necessarily the same client) calls compute on b and then gather_dep to copy the key from b to a

while the flight from b to a is in progress, the scatter finishes, which triggers update_data as shown in the test.

The only way to avoid this would be to fundamentally rewrite the scatter implementation. Which, for the record, I think is long overdue.

I'll explain the above in a comment in the test

My point is only partially about technical correctness of race conditions but also about whether this is even a sane operation. How can a user know the value of x if x is supposed to be computed on the cluster?

fjetter

I think this is good to go. The question about the test case is something that should inform a possible follow up and should not block this PR imo

Simplify preamble of gather_dep

fa157c9

crusaderky self-assigned this May 18, 2022

crusaderky commented May 18, 2022

View reviewed changes

crusaderky requested review from fjetter and gjoseph92 May 18, 2022 19:46

crusaderky marked this pull request as ready for review May 18, 2022 19:49

crusaderky mentioned this pull request May 18, 2022

Remove wrong assert in handle compute #6370

Merged

gjoseph92 reviewed May 18, 2022

View reviewed changes

Update distributed/worker.py

913e5e9

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

fjetter requested changes May 19, 2022

View reviewed changes

crusaderky added 3 commits May 19, 2022 20:07

Code review

3193a8a

Simplify handling of cancelled state

Revert (moved to dask#6385)

7021526

Merge branch 'main' into WSMR/gather_dep_preamble

6f9caed

crusaderky added a commit to crusaderky/distributed that referenced this pull request May 20, 2022

Simplify preamble of gather_dep (dask#6371)

b660eac

crusaderky mentioned this pull request May 20, 2022

Refactor gather_dep #6388

Merged

fjetter changed the title ~~Simplify preamble of gather_dep~~ Do not filter tasks before gathering data May 20, 2022

fjetter mentioned this pull request May 20, 2022

distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395

Closed

fjetter reviewed May 20, 2022

View reviewed changes

fjetter approved these changes May 20, 2022

View reviewed changes

crusaderky added 2 commits May 20, 2022 12:30

Merge branch 'main' into WSMR/gather_dep_preamble

027cdca

Explain use case

216cf2c

crusaderky merged commit 4420644 into dask:main May 20, 2022

crusaderky deleted the WSMR/gather_dep_preamble branch May 20, 2022 13:11

crusaderky mentioned this pull request May 21, 2022

Worker.close() leaves RPC channels dangling #6409

Open

crusaderky mentioned this pull request May 30, 2022

Rework some tests related to gather_dep #6472

Merged

		typically the next to be executed but since we're fetching tasks for potentially
		many dependents, an exact match is not possible.

		await wait_for_state("x", "flight", a)
		a.update_data({"x": 3})

	keys, who_has, nbytes = await scatter_to_workers(
	nthreads, data, rpc=self.rpc, report=False
	)

	self.update_data(who_has=who_has, nbytes=nbytes, client=client)

Do not filter tasks before gathering data #6371

Do not filter tasks before gathering data #6371

Conversation

crusaderky commented May 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 19, 2022 • edited

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter May 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky May 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented May 20, 2022

fjetter commented May 20, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

crusaderky commented May 18, 2022 •

edited

github-actions bot commented May 19, 2022 •

edited

fjetter May 19, 2022 •

edited

crusaderky May 19, 2022 •

edited

fjetter commented May 20, 2022 •

edited