New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMM: Graceful Worker Retirement #5381
Conversation
distributed/active_memory_manager.py
Outdated
be retired will be bombarded with ``Worker.get_data`` calls from the rest of the | ||
cluster. This can be a problem if most of the managed memory of the worker has been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually should not be a problem since the worker limits concurrent incoming get_data
requests with distributed.worker.connections.incoming
(default 10). Of course, this might not be sufficient...
distributed/worker.py
Outdated
@@ -1434,6 +1432,9 @@ async def close_gracefully(self, restart=None): | |||
restart = self.lifetime_restart | |||
|
|||
logger.info("Closing worker gracefully: %s", self.address) | |||
# Scheduler.retire_workers will set the status to closing_gracefully and push it | |||
# back to this worker. We're additionally setting it here immediately to prevent | |||
# race conditions during the round trip. | |||
self.status = Status.closing_gracefully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will trigger another message to the scheduler, won't it? How is this protecting from race conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two use cases:
Use case 1: graceful shutdown initiated by nanny
- Worker sets Worker.status.closing_gracefully and sends updated status to Scheduler.
- Worker immediately afterwards invokes Scheduler.retire_workers
- retire_workers sets WorkerState.status = closing_gracefully. It's unknown which status update arrives first, if the one from (1) or the one from (2), and we shouldn't care.
- retire_workers sends notification of the updated status to the Worker, which will be superfluous as its status is already closing_gracefully
- The scheduler stops (a) assigning new tasks to the worker (b) rebalancing in-memory tasks onto the worker
Use case 2: graceful shutdown initiated by client or adaptive scaler
- retire_workers sets WorkerState.status = closing_gracefully and immediately stops (a) assigning new tasks to the worker (b) rebalancing in-memory tasks onto the worker. This is crucial to avoid accidentally closing a worker with unique in-memory tasks on it.
- retire_workers sends notification of the updated status to the Worker. This in turn causes the worker to stop processing queued tasks.
In both cases, queued tasks on the workers need to be stolen (still missing in this PR).
d731381
to
7ea9786
Compare
5d565e5
to
1d4f45c
Compare
1579834
to
d529111
Compare
8a666d2
to
e1d0e7f
Compare
2142bb2
to
302fac5
Compare
245978b
to
d19e737
Compare
# FIXME an unresponsive worker takes ~88s to be removed from the scheduler. | ||
# The below setting seems to do nothing. | ||
"distributed.scheduler.worker-ttl": "5s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXME Can't find the setting that governs this timeout
Current status:
|
|
51781ab
to
d3fa246
Compare
462288f
to
b57da8d
Compare
65d32e9
to
5d9577a
Compare
98588d7
to
7b9995e
Compare
7e98474
to
9676934
Compare
I observed a random failure in test_adapt_quickly. Those are present in main too; it remains to be seen if this PR makes them more frequent. |
ALL GREEN except unrelated failures! |
df079fc
to
ebb8611
Compare
This PR is in draft status only because it incorporates #5653; it is otherwise ready for final review. |
ebb8611
to
25e92db
Compare
|
||
|
||
class RetireWorker(ActiveMemoryManagerPolicy): | ||
"""Replicate somewhere else all unique in-memory tasks on a worker, preparing for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to say this is excellent high-level documentation. I have a very good sense of the purpose of this and how it works without reading any code.
distributed/active_memory_manager.py
Outdated
|
||
- At every iteration, this policy drops all tasks that have already been replicated | ||
somewhere else. This makes room for further tasks to be moved out of the spill | ||
file in order to be replicated onto another worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find myself wondering how much of a race condition there is here between the get_data
requests from peers and the drop
signals from the scheduler. Might be good to discuss this in more detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be safe. get_data will fail and the requesting worker will try again on the next worker that owns the data. ActiveMemoryManagerExtension._find_dropper
makes sure that you don't accidentally drop the last replica of a key.
@gen_cluster( | ||
client=True, | ||
Worker=Nanny, | ||
config={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to use a similar config as #5697 and only look at managed memory? (I know AMM currently is hardcoded to use optimistic memory, but that seems like an easy thing to make configurable. Just thinking about any future test flakiness due to unmanaged memory changes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something for a future PR indeed
|
||
If ReduceReplicas is enabled, | ||
1. On the first AMM iteration, either ReduceReplicas or RetireWorker (arbitrarily | ||
depending on which comes first in the iteration of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular test, won't ReduceReplicas
always come first, since RetireWorker
isn't added c.retire_workers
gets called? Maybe you want three possibilities for active-memory-manager.policies
:
[ReduceReplicas, RetireWorker]
[RetireWorker, ReduceReplicas]
[RetireWorker]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ActiveMemoryManager.policies is a set, so the order of the policies is pseudo-random and not controllable from a unit test.
# the time to migrate. This is OK if it happens occasionally, but if this | ||
# setting is too aggressive the cluster will get flooded with repeated comm | ||
# requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a problem in real-world clusters too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As the comment says, it's OK if once in a while a task takes more than 2 seconds to migrate; it could be problematic if many tasks over and over take longer than 2 seconds. I don't really have a pulse about how long it typically takes on a big busy cluster between get_data and the actual data arriving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a busy cluster, it wouldn't surprise me if the full roundtrip (including the scheduler finding out the task had migrated) could take longer that 2s, especially when data is spilled to disk, and especially if multiple workers are retiring (scale-down).
I suppose we'll just have to try it and find out. I could see it becoming necessary to have some memory on the AMM and not drop self.pending
every cycle (or store pending
as prev
for the next cycle, and remove any current suggestions that were in prev
or something), but that will be a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we see this when it happens? Are there logs or other traces or would we be blind if that happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: this is exclusively an issue if
- a substantial amount of keys are not replicated within 2 seconds, AND
- something other than the AMM, such as a computation or a new worker coming online, tilted the memory balance of the cluster enough that the AMM will take a substantially different decision from the previous one
In all other cases, including when some/most of the keys that the AMM asked to be replicated have already moved and increased the memory usage in the recipient, then the AMM will just take the same decision as before.
Again, it is OK if this happens occasionally - you'll just end up with a bunch of superfluous replications which will be reverted shortly by ReduceReplicas.
Can we see this when it happens?
At every AMM iteration, in the debug log you can read how many keys each worker is going to receive and drop. You may witness a worker that was previously unaffected suddenly loaded with replication requests.
e.g.
FIRST ITERATION:
distributed.active_memory_manager - DEBUG - Running policy: ReduceReplicas()
distributed.active_memory_manager - DEBUG - Running policy: RetireWorker('tcp://127.0.0.1:33379')
distributed.active_memory_manager - DEBUG - Enacting suggestions for 143 tasks:
distributed.active_memory_manager - DEBUG - - <WorkerState 'tcp://127.0.0.1:34809', name: tcp://127.0.0.1:34809, status: running, memory: 1, processing: 0> to acquire 95 replicas
distributed.active_memory_manager - DEBUG - - <WorkerState 'tcp://127.0.0.1:34810', name: tcp://127.0.0.1:34810, status: running, memory: 1, processing: 0> to acquire 48 replicas
distributed.active_memory_manager - DEBUG - Active Memory Manager run in 1ms
HEALTHY second iteration: most of the tasks have been replicated within 2s
distributed.active_memory_manager - DEBUG - Running policy: ReduceReplicas()
distributed.active_memory_manager - DEBUG - Running policy: RetireWorker('tcp://127.0.0.1:33379')
distributed.active_memory_manager - DEBUG - Enacting suggestions for 12 tasks:
distributed.active_memory_manager - DEBUG - - <WorkerState 'tcp://127.0.0.1:34809', name: tcp://127.0.0.1:34809, status: running, memory: 1, processing: 0> to acquire 12 replicas
distributed.active_memory_manager - DEBUG - Active Memory Manager run in 1ms
HEALTHY second iteration: network and/or the workers' event loop are slow, but nothing besides the AMM itself substantially changed the cluster balance
distributed.active_memory_manager - DEBUG - Running policy: ReduceReplicas()
distributed.active_memory_manager - DEBUG - Running policy: RetireWorker('tcp://127.0.0.1:33379')
distributed.active_memory_manager - DEBUG - Enacting suggestions for 135 tasks:
distributed.active_memory_manager - DEBUG - - <WorkerState 'tcp://127.0.0.1:34809', name: tcp://127.0.0.1:34809, status: running, memory: 1, processing: 0> to acquire 92 replicas
distributed.active_memory_manager - DEBUG - - <WorkerState 'tcp://127.0.0.1:34810', name: tcp://127.0.0.1:34810, status: running, memory: 1, processing: 0> to acquire 43 replicas
distributed.active_memory_manager - DEBUG - Active Memory Manager run in 1ms
UNHEALTHY second iteration: network and/or the workers' event loop are slow, and a computation suddenly saturated the first recipient, so the AMM now decides to load the second recipient a lot more. When (if) eventually the replicas on the first recipient finally go through, we'll end up with a bunch of duplicates which will be removed by ReduceReplicas.
distributed.active_memory_manager - DEBUG - Running policy: ReduceReplicas()
distributed.active_memory_manager - DEBUG - Running policy: RetireWorker('tcp://127.0.0.1:33379')
distributed.active_memory_manager - DEBUG - Enacting suggestions for 135 tasks:
distributed.active_memory_manager - DEBUG - - <WorkerState 'tcp://127.0.0.1:34809', name: tcp://127.0.0.1:34809, status: running, memory: 1, processing: 0> to acquire 20 replicas
distributed.active_memory_manager - DEBUG - - <WorkerState 'tcp://127.0.0.1:34810', name: tcp://127.0.0.1:34810, status: running, memory: 1, processing: 0> to acquire 115 replicas
distributed.active_memory_manager - DEBUG - Active Memory Manager run in 1ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'm worried about is that 2s is very small for key replication imo. network latencies, network bandwidth, (de-)serialization and diskIO all play together here. Especially if a single worker is required to offload a few GB of data that will very likely take more than 2s, wouldn't it?
Debug logs are pretty hidden but I'm not familiar enough with the system, yet, to make a better suggestion. I was wondering if there is a condition that would allow us to detect such a situation and then log a meaningful warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, we can defer this to another PR/issue. I'm mostly wondering if this is something we should follow up on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could keep track of incomplete replications from previous iterations and don't reiterate them for e.g. 30 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion continues at #5759
All review comments have been addressed |
ff49565
to
36928dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @crusaderky! 👍
# Sleep 0.01s when there are 4 tasks or less | ||
# Sleep 0.5s when there are 200 or more | ||
poll_interval = max(0.01, min(0.5, len(ws.has_what) / 400)) | ||
await asyncio.sleep(poll_interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I forgot that done
depended on general scheduler state, not just the state of the RetireWorker
instance. So the done
state can flip even when the policy isn't actively doing anything.
# the time to migrate. This is OK if it happens occasionally, but if this | ||
# setting is too aggressive the cluster will get flooded with repeated comm | ||
# requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a busy cluster, it wouldn't surprise me if the full roundtrip (including the scheduler finding out the task had migrated) could take longer that 2s, especially when data is spilled to disk, and especially if multiple workers are retiring (scale-down).
I suppose we'll just have to try it and find out. I could see it becoming necessary to have some memory on the AMM and not drop self.pending
every cycle (or store pending
as prev
for the next cycle, and remove any current suggestions that were in prev
or something), but that will be a future PR.
Unit Test Results 15 files - 3 15 suites - 3 8h 25m 3s ⏱️ - 1h 6m 43s For more details on these failures, see this check. Results for commit 53ae7bc. ± Comparison against base commit 30ffa9c. ♻️ This comment has been updated with latest results. |
All unit test failures are unrelated. |
There is a merge conflict. I'll have a very brief look over the code. unless anything big jumps out, we're good to go. |
**Failure condition** | ||
|
||
There may not be any suitable workers to receive the tasks from the retiring worker. | ||
This happens in two use cases: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this respect resources or host restrictions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. The current implementation of retire_workers
(which in turn uses replicate
) doesn't, either.
It would be good to hear from the GPU people who use a hybrid cluster if they have a use case of GPU data that will fail to unpickle on non-GPU workers (different answers for different libraries I guess?), but then again, nobody complained until today so I guess the answer is no? At any rate this PR does not introduce a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was worried about GPUs as well but I agree this is not our concern right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @quasiben for visibility.
If we need "Respect resources and restrictions in retirement" we should file an extra ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the current resources design is fit for purpose.
Take for example a typical use of resources, which is to run a task that requires huge heap and/or dependencies. This says nothing about the size of the output.
e.g. the task could be annotated with resources={"MEM": 20e9}
, but its output could be a handful of MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not entirely sure if resources are perfect. host and worker restrictions, maybe.
LGTM. Remaining questions should not stop merging |
In it goes! |
Scheduler.retire_workers(), either when called from the client or from the adaptive scaler, can now be invoked safely in the middle of a computation.
Done
Out of scope
CC @mrocklin @jrbourbeau @fjetter