Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More discriminating RESTART shutdown logic #107909

Open
DaveCTurner opened this issue Apr 25, 2024 · 3 comments
Open

More discriminating RESTART shutdown logic #107909

DaveCTurner opened this issue Apr 25, 2024 · 3 comments
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

In a rolling restart we recommend users wait for the cluster health to reach green in between node restarts, and some users will also wait for rebalancing to complete each time. This is unnecessarily conservative: it's safe to restart a node while the cluster health is still yellow after the previous restart as long as the initializing shards are unrelated to the shards on the node that is to be restarted next.

It's not reasonable to ask users to compute when it's safe to restart a node themselves, but nor is it especially reasonable to wait for green health after each node since this may extend the restart time by hours or even days in a large cluster. I believe the shutdown API should be able to solve this by reporting shardMigrationStatus == COMPLETE on a RESTART shutdown when all the shards on the target node are fully replicated. That's different from today's behaviour in which a RESTART shutdown has shardMigrationStatus == COMPLETE immediately, forcing users to use other APIs (e.g. cluster health) to wait as necessary.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Apr 25, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Apr 25, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@prathm3
Copy link

prathm3 commented May 3, 2024

Hey @DaveCTurner , As per my understanding here we want to change the shardMigrationStatus according to different conditions ( STALLED, IN_PROCESS, COMPLETED, NOT_STARTED) and also we do not want to update shardMigrationStatus when RESTART ( shutdownType ) is triggered.
looking at the code

if (SingleNodeShutdownMetadata.Type.RESTART.equals(shutdownType)) {
            return new ShutdownShardMigrationStatus(
                SingleNodeShutdownMetadata.Status.COMPLETE,
                0,
                "no shard relocation is necessary for a node restart",
                null
            );
 }

here we are marking status as COMPLETE when shutdownType is RESTART but if above condition is removed code will behave exactly same as we want ( correct me if I am wrong here ) which is based on different condition we will update the status .
Am I missing something ? Any pointers ? TIA..

@DaveCTurner
Copy link
Contributor Author

Hi @prathm3, thanks for your interest here. I'm not sure I understand your question, but are you asking because you're interested in contributing a solution? This is quite a subtle issue and needs some discussion by the team before we decide on a path forwards. I wouldn't recommend on working on this area for now..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

3 participants