Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: is it possible to replace other node in a list by a node with the same IP? #241

Open
b10s opened this issue Aug 23, 2021 · 2 comments

Comments

@b10s
Copy link

b10s commented Aug 23, 2021

Hi,

I have a question about gossip implementation used in Prometheus alertmanager.

Currently we have alertmanager HA cluster of three nodes configured at start in config file, running with 4 nodes. One node joined the cluster after some time from another k8s namespace.

I suspect gossip protocol. If the list of available members is just a list of IPs and taking in mind in k8s cluster it is possible for new Pod to get IP of some random just died Pod, then (e.g. during cluster maintenance, when many Pods are restarted) some Pod can hijack IP of legit alertmanager Pod.
New Pod is a part of another alertmanager setup which has only one peer.

In other words gossip implementation may trust new peer with the same IP. (?)

My question would be: is there such possibility of using this library in a such way, that we can get new peer in gossip available peer table even though it was not there at beginning?

E.g.:
alertmanagerA at start time:
peer1 => peer1.a.com (1.1.1.1)
peers: 1.1.1.1

alertmanagerB at start time:
peer1 => peer1.b.com (2.2.2.2)
peer2 => peer2.b.com (3.3.3.3)
peer3 => peer3.b.com (4.4.4.4)
peers: 2.2.2.2, 3.3.3.3, 4.4.4.4

At some point of time Pods are restarted, not at the same time, so that:
alertmanagerA:
peer1 => peer1.a.com (2.2.2.2)
peers: 2.2.2.2, 22.22.22.22, 33.33.33.33, 44.44.44.44

alertmanagerB:
peer1 => peer1.b.com (22.22.22.22)
peer2 => peer2.b.com (33.33.33.33)
peer3 => peer3.b.com (44.44.44.44)
peers: 2.2.2.2, 22.22.22.22, 33.33.33.33, 44.44.44.44

peer1 of alertmanagerA accidentally got IP of peer1 of alertmanagerB.
Therefore peers of alertmanagerB are able to share a list of active peers with peer with IP 2.2.2.2, and the peer with IP 2.2.2.2 in it's turn sends live ping to extended list of nodes.

Is such situation possible in theory?

I'll try to reproduce it in a small k8s cluster with reduced list of available Pod's IPs.

@b10s
Copy link
Author

b10s commented Sep 3, 2021

Seems it is! I can reproduce in kind cluster.

I have two alertmanager clusters with the same config. One of them is:

 Args:
      --storage.path=/alertmanager
      --config.file=/config_out/alertmanager.yml
      --cluster.advertise-address=$(POD_IP):9094
      --cluster.listen-address=0.0.0.0:9094
      --cluster.peer=my-release-alertmanager-0.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-1.my-release-alertmanager-headless:9094
      --cluster.peer=my-release-alertmanager-2.my-release-alertmanager-headless:9094

You can see here is only three peers.

Before making them to switch IPs there IP assignment was:

Selection_999(551)

After restart few Pods few times I can make them to reuse IPs:

Selection_999(552)

Since other Pods were not restarted, they still keep old IPs in their gossip available peers table. Therefore two cluster will merge into one:
Selection_999(549)

UPD
to reproduce

  1. start your kind cluster:
$ kind create cluster
...
  1. deploy here two clusters of alertmanager:
$ helm install my-release foo/bar
$ helm install my-bad-release foo/bar
  1. find your kind's k8s cluster contaienr and enter it:
docker exec -it 942e41a1c6e6 bash
  1. inside container change CNI settings and restart kubelet:
# sed -i 's/"subnet": "10.244.0.0\/24"/"subnet": "10.244.0.0\/28"/g' /etc/cni/net.d/10-kindnet.conflist
# systemctl restart kubelet
  1. create few more Pods with nginx to make sure there is no more available IPs

  2. delete one alertmanager's Pods from one cluster and one from another using the same command so there will be chance they will reuse IP of each other

  3. enjoy merged alertmanager cluster

@b10s
Copy link
Author

b10s commented Sep 6, 2021

What I think is:

  • the gossip protocol itself is good with such situation because it doesn't care about: security, a list of desired peers.
  • the alertmanager probably should compare peers from it's config with peers taken from gossip library as a basic check (not sure what to do if lists differ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant