Cannot add new nodes to cluster after one node leaves #229

danthegoodman1 · 2020-11-19T17:16:14Z

It seems whenever I have a node leave a cluster, then a rejoin occurs, I get failed acks and handler queue full logs from the node still in the cluster.

Is there any sort of clean-up I need to do to clear the queue or ack on rejoin? Everything works fine with joining until one leaves then tries to rejoin...

Scenario:

Node A begins, first node in cluster.
Node B joins cluster, connection is fine
Node B is killed, after 3 failed acks marked as dead
Node B restarted, the following messages are seen:

Node A:

A node has joined: m-127.0.0.1:8888
## NODE B KILLED
2020/11/19 12:14:33 [DEBUG] memberlist: Failed ping: m-127.0.0.1:8888 (timeout reached)
2020/11/19 12:14:34 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:34 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:36 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:37 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:37 [INFO] memberlist: Marking m-127.0.0.1:8888 as failed, suspect timeout reached (0 peer confirmations)
A node has left: m-127.0.0.1:8888
## NODE B RESTARTED
2020/11/19 12:14:39 [DEBUG] memberlist: Stream connection from=127.0.0.1:53980
2020/11/19 12:14:42 [WARN] memberlist: handler queue full, dropping message (3) from=127.0.0.1:8888
2020/11/19 12:14:43 [WARN] memberlist: handler queue full, dropping message (3) from=127.0.0.1:8888

The change in port&name doesn't seem to make a difference. Whether it is the same node with the same name or not

Node B (on rejoin):

2020/11/19 12:14:39 [DEBUG] memberlist: Initiating push/pull sync with:  127.0.0.1:4444
2020/11/19 12:14:39 [WARN] memberlist: Refuting a suspect message (from: m-127.0.0.1:8888)
A node has joined: m-127.0.0.1:4444
2020/11/19 12:14:40 [INFO] memberlist: Suspect m-127.0.0.1:4444 has failed, no acks received
2020/11/19 12:14:42 [INFO] memberlist: Suspect m-127.0.0.1:4444 has failed, no acks received
2020/11/19 12:14:43 [INFO] memberlist: Marking m-127.0.0.1:4444 as failed, suspect timeout reached (0 peer confirmations)
A node has left: m-127.0.0.1:4444
2020/11/19 12:14:43 [INFO] memberlist: Suspect m-127.0.0.1:4444 has failed, no acks received

Example Code Snippet:

type eventDelegate struct{}

func (ed *eventDelegate) NotifyJoin(node *memberlist.Node) {
	fmt.Println("A node has joined: " + node.String())
}

func (ed *eventDelegate) NotifyLeave(node *memberlist.Node) {
	fmt.Println("A node has left: " + node.String())
}

func (ed *eventDelegate) NotifyUpdate(node *memberlist.Node) {
	fmt.Println("A node was updated: " + node.String())
}

func BeginClusterDiscovery() {
	log.Println("Beginning cluster discovery...")
	log.Println(*NodeInterface, *NodePort) // These are taken as cli flags
	MemberName = fmt.Sprintf("m-%s:%d", *NodeInterface, *NodePort)
	MemberList, _ = memberlist.Create(&memberlist.Config{
		ProtocolVersion:     5,
		BindAddr:            *NodeInterface,
		BindPort:            *NodePort,
		AdvertiseAddr:       *NodeInterface,
		AdvertisePort:       *NodePort,
		TCPTimeout:          time.Second,
		IndirectChecks:      1,
		RetransmitMult:      2,
		SuspicionMult:       3,
		PushPullInterval:    15 * time.Second,
		ProbeTimeout:        200 * time.Millisecond,
		ProbeInterval:       time.Second,
		GossipInterval:      100 * time.Millisecond,
		GossipToTheDeadTime: 15 * time.Second,
		Name:                MemberName,
		Events:              &eventDelegate{},
	})
}

The text was updated successfully, but these errors were encountered:

sandyydk · 2021-05-06T11:34:38Z

Did you try list.Leave(timeout) ? For a node exiting the memberlist, it would be a good practice to leave gracefully.

danthegoodman1 · 2021-05-06T11:38:07Z

Did you try list.Leave(timeout) ? For a node exiting the memberlist, it would be a good practice to leave gracefully.

@sandyydk i had not, I went with another solution a while ago, but graceful shutdown wouldn’t account for node crashes or network downtime no?

sandyydk · 2021-05-06T11:46:04Z

Did you try list.Leave(timeout) ? For a node exiting the memberlist, it would be a good practice to leave gracefully.

@sandyydk i had not, I went with another solution a while ago, but graceful shutdown wouldn’t account for node crashes or network downtime no?

Yes. It shouldn't. What fixed it for you?

danthegoodman1 · 2021-05-06T11:47:57Z

@sandyydk Nothing fixed it, I went with a custom gossip protocol implementation instead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot add new nodes to cluster after one node leaves #229

Cannot add new nodes to cluster after one node leaves #229

danthegoodman1 commented Nov 19, 2020

sandyydk commented May 6, 2021

danthegoodman1 commented May 6, 2021

sandyydk commented May 6, 2021

danthegoodman1 commented May 6, 2021

Cannot add new nodes to cluster after one node leaves #229

Cannot add new nodes to cluster after one node leaves #229

Comments

danthegoodman1 commented Nov 19, 2020

sandyydk commented May 6, 2021

danthegoodman1 commented May 6, 2021

sandyydk commented May 6, 2021

danthegoodman1 commented May 6, 2021