More discriminating RESTART
shutdown logic
#107909
Labels
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
>enhancement
Team:Distributed
Meta label for distributed team
In a rolling restart we recommend users wait for the cluster health to reach
green
in between node restarts, and some users will also wait for rebalancing to complete each time. This is unnecessarily conservative: it's safe to restart a node while the cluster health is stillyellow
after the previous restart as long as the initializing shards are unrelated to the shards on the node that is to be restarted next.It's not reasonable to ask users to compute when it's safe to restart a node themselves, but nor is it especially reasonable to wait for
green
health after each node since this may extend the restart time by hours or even days in a large cluster. I believe the shutdown API should be able to solve this by reportingshardMigrationStatus == COMPLETE
on aRESTART
shutdown when all the shards on the target node are fully replicated. That's different from today's behaviour in which aRESTART
shutdown hasshardMigrationStatus == COMPLETE
immediately, forcing users to use other APIs (e.g. cluster health) to wait as necessary.The text was updated successfully, but these errors were encountered: