Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot add new nodes to cluster after one node leaves #229

Open
danthegoodman1 opened this issue Nov 19, 2020 · 4 comments
Open

Cannot add new nodes to cluster after one node leaves #229

danthegoodman1 opened this issue Nov 19, 2020 · 4 comments

Comments

@danthegoodman1
Copy link

It seems whenever I have a node leave a cluster, then a rejoin occurs, I get failed acks and handler queue full logs from the node still in the cluster.

Is there any sort of clean-up I need to do to clear the queue or ack on rejoin? Everything works fine with joining until one leaves then tries to rejoin...

Scenario:

  • Node A begins, first node in cluster.
  • Node B joins cluster, connection is fine
  • Node B is killed, after 3 failed acks marked as dead
  • Node B restarted, the following messages are seen:

Node A:

A node has joined: m-127.0.0.1:8888
## NODE B KILLED
2020/11/19 12:14:33 [DEBUG] memberlist: Failed ping: m-127.0.0.1:8888 (timeout reached)
2020/11/19 12:14:34 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:34 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:36 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:37 [INFO] memberlist: Suspect m-127.0.0.1:8888 has failed, no acks received
2020/11/19 12:14:37 [INFO] memberlist: Marking m-127.0.0.1:8888 as failed, suspect timeout reached (0 peer confirmations)
A node has left: m-127.0.0.1:8888
## NODE B RESTARTED
2020/11/19 12:14:39 [DEBUG] memberlist: Stream connection from=127.0.0.1:53980
2020/11/19 12:14:42 [WARN] memberlist: handler queue full, dropping message (3) from=127.0.0.1:8888
2020/11/19 12:14:43 [WARN] memberlist: handler queue full, dropping message (3) from=127.0.0.1:8888

The change in port&name doesn't seem to make a difference. Whether it is the same node with the same name or not

Node B (on rejoin):

2020/11/19 12:14:39 [DEBUG] memberlist: Initiating push/pull sync with:  127.0.0.1:4444
2020/11/19 12:14:39 [WARN] memberlist: Refuting a suspect message (from: m-127.0.0.1:8888)
A node has joined: m-127.0.0.1:4444
2020/11/19 12:14:40 [INFO] memberlist: Suspect m-127.0.0.1:4444 has failed, no acks received
2020/11/19 12:14:42 [INFO] memberlist: Suspect m-127.0.0.1:4444 has failed, no acks received
2020/11/19 12:14:43 [INFO] memberlist: Marking m-127.0.0.1:4444 as failed, suspect timeout reached (0 peer confirmations)
A node has left: m-127.0.0.1:4444
2020/11/19 12:14:43 [INFO] memberlist: Suspect m-127.0.0.1:4444 has failed, no acks received

Example Code Snippet:

type eventDelegate struct{}

func (ed *eventDelegate) NotifyJoin(node *memberlist.Node) {
	fmt.Println("A node has joined: " + node.String())
}

func (ed *eventDelegate) NotifyLeave(node *memberlist.Node) {
	fmt.Println("A node has left: " + node.String())
}

func (ed *eventDelegate) NotifyUpdate(node *memberlist.Node) {
	fmt.Println("A node was updated: " + node.String())
}

func BeginClusterDiscovery() {
	log.Println("Beginning cluster discovery...")
	log.Println(*NodeInterface, *NodePort) // These are taken as cli flags
	MemberName = fmt.Sprintf("m-%s:%d", *NodeInterface, *NodePort)
	MemberList, _ = memberlist.Create(&memberlist.Config{
		ProtocolVersion:     5,
		BindAddr:            *NodeInterface,
		BindPort:            *NodePort,
		AdvertiseAddr:       *NodeInterface,
		AdvertisePort:       *NodePort,
		TCPTimeout:          time.Second,
		IndirectChecks:      1,
		RetransmitMult:      2,
		SuspicionMult:       3,
		PushPullInterval:    15 * time.Second,
		ProbeTimeout:        200 * time.Millisecond,
		ProbeInterval:       time.Second,
		GossipInterval:      100 * time.Millisecond,
		GossipToTheDeadTime: 15 * time.Second,
		Name:                MemberName,
		Events:              &eventDelegate{},
	})
}
@sandyydk
Copy link

sandyydk commented May 6, 2021

Did you try list.Leave(timeout) ? For a node exiting the memberlist, it would be a good practice to leave gracefully.

@danthegoodman1
Copy link
Author

Did you try list.Leave(timeout) ? For a node exiting the memberlist, it would be a good practice to leave gracefully.

@sandyydk i had not, I went with another solution a while ago, but graceful shutdown wouldn’t account for node crashes or network downtime no?

@sandyydk
Copy link

sandyydk commented May 6, 2021

Did you try list.Leave(timeout) ? For a node exiting the memberlist, it would be a good practice to leave gracefully.

@sandyydk i had not, I went with another solution a while ago, but graceful shutdown wouldn’t account for node crashes or network downtime no?

Yes. It shouldn't. What fixed it for you?

@danthegoodman1
Copy link
Author

@sandyydk Nothing fixed it, I went with a custom gossip protocol implementation instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants