Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use slot node id to detect node re-configuration #1913

Closed
wants to merge 1 commit into from

Conversation

vmihailenco
Copy link
Collaborator

Fixes #1910

@pfreixes what do you think?

cluster.go Outdated Show resolved Hide resolved
@pfreixes
Copy link
Contributor

Seems ok to me, I can give it a try in an environment where i can reproduce the issue

@vmihailenco
Copy link
Collaborator Author

Okay, I've added the missing part - please give it a try.

@pfreixes
Copy link
Contributor

Oks awesome, Ill try it during Today and Ill tell you something!

@pfreixes
Copy link
Contributor

Tested and still failing, Ill keep working on this and see if there is a none intrusive way of fixing this

@vmihailenco
Copy link
Collaborator Author

Okay, let me know if you find out why this approach does not work - it looked promising...

@pfreixes
Copy link
Contributor

pfreixes commented Oct 2, 2021

Hi @vmihailenco let me summarize why the fixed proposed in that PR is still having issues. But before this just a recap of how is being done the roll out by AWS for replacing the master instances.

  • First the new instances appear as slaves using a different ID but the same address as the master node that they want to replace.
  • New slaves sync with all of the data.
  • Once slaves have reached the point where they are in pair with the masters they are promoted as a new Masters and the former one is moved as slave.
  • DNS change for changing the IP that would resolve the addr used by the new master and the former one, from now on it will point to the new one, before that change the DNS resolution was pointing to the former master.
  • Former master is kept alive for some minutes for redirecting traffic to the new master.

There are at least two issues with the solution proposed:

  • The new master, presented as slave, joins to the group with the new ID but using the same addr. With the change proposed this will trigger the creation of a new Node from the very beginning since the same node with same addr with different id will appear from day zero, but DNS will still resolve to the former one. This would prevent us to recycle the pool later on since new id has been already used.

  • After Fixing the previous issue, which might be considered cosmetic, there is still another issue. If we keep using the master that would be replaced until it starts returning MOVED X to Y .. and we force the reload of the state at that moment, we are gonna recreate the node with the new ID but nothing is guaranteeing us that the DNS will be already pointing to the new IP address, if we have the bad luck - AWS always speaks of 30 seconds for full DNS propagation thought most of the time this value is quite smaller - of still resolving to the bad IP - the former master - we will have the new node - new ID - using a pool that is still using the invalid IP address, any effort for calling the lazy reload will have no effects since no new node IDs will appear. Eventually once the former master is down the connections of the pool will get recycled and new DNS queries will return the right IP address.

For solving this scenario, Ive created this PR #1914, which follows the original idea that was commented. Ive tested this and it mostly address the issues, but due the DNS propagation during 20 seconds half of the request fail. But its a considerable improvement considering that without that change the downtime is total, all requests affected, during more than 10 minutes.

@vmihailenco
Copy link
Collaborator Author

vmihailenco commented Oct 3, 2021

Thanks for the explanation and for the new PR.

Do you think we still should merge this PR or it makes the situation worse?

@pfreixes
Copy link
Contributor

pfreixes commented Oct 4, 2021

I dont think that this PR is providing value and indeed is surfacing the error in a different way, reporting pool closed, which is more hard to understand.

IMO the way to go, not a must, for adding the id as part of the state would be having the nodes indexed by ID instead of addr, but in any case this would be more a "cosmetic" change which would not help on the issue presented.

So, to sum up, I would close this PR

@vmihailenco
Copy link
Collaborator Author

Thanks, makes sense.

@vmihailenco vmihailenco closed this Oct 4, 2021
@monkey92t monkey92t deleted the fix/moved-node-id branch March 6, 2023 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants