fix: use slot node id to detect node re-configuration #1913

vmihailenco · 2021-09-29T11:10:40Z

Fixes #1910

@pfreixes what do you think?

cluster.go

pfreixes · 2021-09-29T11:18:10Z

Seems ok to me, I can give it a try in an environment where i can reproduce the issue

vmihailenco · 2021-09-29T11:44:10Z

Okay, I've added the missing part - please give it a try.

pfreixes · 2021-09-29T13:14:52Z

Oks awesome, Ill try it during Today and Ill tell you something!

pfreixes · 2021-09-30T15:03:32Z

Tested and still failing, Ill keep working on this and see if there is a none intrusive way of fixing this

vmihailenco · 2021-10-01T05:47:45Z

Okay, let me know if you find out why this approach does not work - it looked promising...

pfreixes · 2021-10-02T20:29:59Z

Hi @vmihailenco let me summarize why the fixed proposed in that PR is still having issues. But before this just a recap of how is being done the roll out by AWS for replacing the master instances.

First the new instances appear as slaves using a different ID but the same address as the master node that they want to replace.
New slaves sync with all of the data.
Once slaves have reached the point where they are in pair with the masters they are promoted as a new Masters and the former one is moved as slave.
DNS change for changing the IP that would resolve the addr used by the new master and the former one, from now on it will point to the new one, before that change the DNS resolution was pointing to the former master.
Former master is kept alive for some minutes for redirecting traffic to the new master.

There are at least two issues with the solution proposed:

The new master, presented as slave, joins to the group with the new ID but using the same addr. With the change proposed this will trigger the creation of a new Node from the very beginning since the same node with same addr with different id will appear from day zero, but DNS will still resolve to the former one. This would prevent us to recycle the pool later on since new id has been already used.
After Fixing the previous issue, which might be considered cosmetic, there is still another issue. If we keep using the master that would be replaced until it starts returning MOVED X to Y .. and we force the reload of the state at that moment, we are gonna recreate the node with the new ID but nothing is guaranteeing us that the DNS will be already pointing to the new IP address, if we have the bad luck - AWS always speaks of 30 seconds for full DNS propagation thought most of the time this value is quite smaller - of still resolving to the bad IP - the former master - we will have the new node - new ID - using a pool that is still using the invalid IP address, any effort for calling the lazy reload will have no effects since no new node IDs will appear. Eventually once the former master is down the connections of the pool will get recycled and new DNS queries will return the right IP address.

For solving this scenario, Ive created this PR #1914, which follows the original idea that was commented. Ive tested this and it mostly address the issues, but due the DNS propagation during 20 seconds half of the request fail. But its a considerable improvement considering that without that change the downtime is total, all requests affected, during more than 10 minutes.

vmihailenco · 2021-10-03T13:34:33Z

Thanks for the explanation and for the new PR.

Do you think we still should merge this PR or it makes the situation worse?

pfreixes · 2021-10-04T08:18:50Z

I dont think that this PR is providing value and indeed is surfacing the error in a different way, reporting pool closed, which is more hard to understand.

IMO the way to go, not a must, for adding the id as part of the state would be having the nodes indexed by ID instead of addr, but in any case this would be more a "cosmetic" change which would not help on the issue presented.

So, to sum up, I would close this PR

vmihailenco · 2021-10-04T10:11:59Z

Thanks, makes sense.

vmihailenco commented Sep 29, 2021

View reviewed changes

cluster.go Outdated Show resolved Hide resolved

vmihailenco force-pushed the fix/moved-node-id branch from 9bea59d to e3a6d70 Compare September 29, 2021 11:38

fix: use slot node id to detect node re-configuration

425f2fc

vmihailenco force-pushed the fix/moved-node-id branch from e3a6d70 to 425f2fc Compare September 29, 2021 11:43

vmihailenco closed this Oct 4, 2021

monkey92t deleted the fix/moved-node-id branch March 6, 2023 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use slot node id to detect node re-configuration #1913

fix: use slot node id to detect node re-configuration #1913

vmihailenco commented Sep 29, 2021

pfreixes commented Sep 29, 2021

vmihailenco commented Sep 29, 2021

pfreixes commented Sep 29, 2021

pfreixes commented Sep 30, 2021

vmihailenco commented Oct 1, 2021

pfreixes commented Oct 2, 2021

vmihailenco commented Oct 3, 2021 •

edited

pfreixes commented Oct 4, 2021

vmihailenco commented Oct 4, 2021

fix: use slot node id to detect node re-configuration #1913

fix: use slot node id to detect node re-configuration #1913

Conversation

vmihailenco commented Sep 29, 2021

pfreixes commented Sep 29, 2021

vmihailenco commented Sep 29, 2021

pfreixes commented Sep 29, 2021

pfreixes commented Sep 30, 2021

vmihailenco commented Oct 1, 2021

pfreixes commented Oct 2, 2021

vmihailenco commented Oct 3, 2021 • edited

pfreixes commented Oct 4, 2021

vmihailenco commented Oct 4, 2021

vmihailenco commented Oct 3, 2021 •

edited