Question: is it possible to replace other node in a list by a node with the same IP? #241

b10s · 2021-08-23T17:08:02Z

Hi,

I have a question about gossip implementation used in Prometheus alertmanager.

Currently we have alertmanager HA cluster of three nodes configured at start in config file, running with 4 nodes. One node joined the cluster after some time from another k8s namespace.

I suspect gossip protocol. If the list of available members is just a list of IPs and taking in mind in k8s cluster it is possible for new Pod to get IP of some random just died Pod, then (e.g. during cluster maintenance, when many Pods are restarted) some Pod can hijack IP of legit alertmanager Pod.
New Pod is a part of another alertmanager setup which has only one peer.

In other words gossip implementation may trust new peer with the same IP. (?)

My question would be: is there such possibility of using this library in a such way, that we can get new peer in gossip available peer table even though it was not there at beginning?

E.g.:
alertmanagerA at start time:
peer1 => peer1.a.com (1.1.1.1)
peers: 1.1.1.1

alertmanagerB at start time:
peer1 => peer1.b.com (2.2.2.2)
peer2 => peer2.b.com (3.3.3.3)
peer3 => peer3.b.com (4.4.4.4)
peers: 2.2.2.2, 3.3.3.3, 4.4.4.4

At some point of time Pods are restarted, not at the same time, so that:
alertmanagerA:
peer1 => peer1.a.com (2.2.2.2)
peers: 2.2.2.2, 22.22.22.22, 33.33.33.33, 44.44.44.44

alertmanagerB:
peer1 => peer1.b.com (22.22.22.22)
peer2 => peer2.b.com (33.33.33.33)
peer3 => peer3.b.com (44.44.44.44)
peers: 2.2.2.2, 22.22.22.22, 33.33.33.33, 44.44.44.44

peer1 of alertmanagerA accidentally got IP of peer1 of alertmanagerB.
Therefore peers of alertmanagerB are able to share a list of active peers with peer with IP 2.2.2.2, and the peer with IP 2.2.2.2 in it's turn sends live ping to extended list of nodes.

Is such situation possible in theory?

I'll try to reproduce it in a small k8s cluster with reduced list of available Pod's IPs.

b10s · 2021-09-03T08:04:19Z

Seems it is! I can reproduce in kind cluster.

I have two alertmanager clusters with the same config. One of them is:

 Args:
      --storage.path=/alertmanager
      --config.file=/config_out/alertmanager.yml
      --cluster.advertise-address=$(POD_IP):9094
      --cluster.listen-address=0.0.0.0:9094
      --cluster.peer=my-release-alertmanager-0.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-1.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-2.my-release-alertmanager-headless:9094

You can see here is only three peers.

Before making them to switch IPs there IP assignment was:

After restart few Pods few times I can make them to reuse IPs:

Since other Pods were not restarted, they still keep old IPs in their gossip available peers table. Therefore two cluster will merge into one:

UPD
to reproduce

start your kind cluster:

$ kind create cluster
...

deploy here two clusters of alertmanager:

$ helm install my-release foo/bar
$ helm install my-bad-release foo/bar

find your kind's k8s cluster contaienr and enter it:

docker exec -it 942e41a1c6e6 bash

inside container change CNI settings and restart kubelet:

# sed -i 's/"subnet": "10.244.0.0\/24"/"subnet": "10.244.0.0\/28"/g' /etc/cni/net.d/10-kindnet.conflist
# systemctl restart kubelet

create few more Pods with nginx to make sure there is no more available IPs
delete one alertmanager's Pods from one cluster and one from another using the same command so there will be chance they will reuse IP of each other
enjoy merged alertmanager cluster

b10s · 2021-09-06T05:15:08Z

What I think is:

the gossip protocol itself is good with such situation because it doesn't care about: security, a list of desired peers.
the alertmanager probably should compare peers from it's config with peers taken from gossip library as a basic check (not sure what to do if lists differ).

b10s mentioned this issue Sep 6, 2021

[Kubernetes] Peers ip caching causes all clusters to degrade over time prometheus/alertmanager#2250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: is it possible to replace other node in a list by a node with the same IP? #241

Question: is it possible to replace other node in a list by a node with the same IP? #241

b10s commented Aug 23, 2021 •

edited

b10s commented Sep 3, 2021 •

edited

b10s commented Sep 6, 2021

Question: is it possible to replace other node in a list by a node with the same IP? #241

Question: is it possible to replace other node in a list by a node with the same IP? #241

Comments

b10s commented Aug 23, 2021 • edited

b10s commented Sep 3, 2021 • edited

b10s commented Sep 6, 2021

b10s commented Aug 23, 2021 •

edited

b10s commented Sep 3, 2021 •

edited