Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memberlist.Members safe use docstring #250

Open
krajorama opened this issue Nov 22, 2021 · 2 comments · May be fixed by #251
Open

Memberlist.Members safe use docstring #250

krajorama opened this issue Nov 22, 2021 · 2 comments · May be fixed by #251

Comments

@krajorama
Copy link

krajorama commented Nov 22, 2021

func (m *Memberlist) Members() []*Node {

Hi! TL;DR;
Memberlist.Members function docstring missing warning that reading through the pointers can lead to data race.
We've seen it trigger race detection once in tests for Prometheus Alertmanager.
The issue is with prometheus/alertmanager#1428 that introduced checking addresses via the Node pointers to verify that expected members are there.
Would you accept a PR to mention the potential for data race in the docstring?
What would be a better solution? Would it make sense to copy the Nodes/Addresses under mutex?

Thanks in advance, krajo

Info:

WARNING: DATA RACE
Write at 0x00c0000ea510 by goroutine 97:
  github.com/hashicorp/memberlist.(*Memberlist).aliveNode()
     github.com/hashicorp/memberlist/state.go:1108 +0x6fd
  github.com/hashicorp/memberlist.(*Memberlist).handleAlive()
     github.com/hashicorp/memberlist/net.go:673 +0x58e
  github.com/hashicorp/memberlist.(*Memberlist).packetHandler()
     github.com/hashicorp/memberlist/net.go:448 +0x236

Previous read at 0x00c0000ea510 by goroutine 103:
  github.com/hashicorp/memberlist.(*Node).Address()
     github.com/hashicorp/memberlist/state.go:43 +0x44
  github.com/prometheus/alertmanager/cluster.(*Peer).refresh()
     github.com/prometheus/alertmanager/cluster/cluster.go:452 +0x364
  github.com/prometheus/alertmanager/cluster.(*Peer).refresh-fm()
     github.com/prometheus/alertmanager/cluster/cluster.go:439 +0x4a
  github.com/prometheus/alertmanager/cluster.(*Peer).runPeriodicTask()
     github.com/prometheus/alertmanager/cluster/cluster.go:395 +0xa3
==================
--- FAIL: TestAlertmanager_ReplicasPosition (113.26s)

I could not reproduce the issue with this test, but it is possible to write a synthetic test to trigger the race detector:

func TestMemberlist_JoinDataRace(t *testing.T) {
	c1 := testConfig(t)
	c1.DeadNodeReclaimTime = 1 * time.Nanosecond
	m1, err := Create(c1)
	require.NoError(t, err)
	defer m1.Shutdown()

	bindPort := m1.config.BindPort

	// Create a second node
	c2 := testConfig(t)
	c2.BindPort = bindPort

	m2, err := Create(c2)
	require.NoError(t, err)
	//defer m2.Shutdown()

	num, err := m2.Join([]string{m1.config.Name + "/" + m1.config.BindAddr})
	if num != 1 {
		t.Fatalf("unexpected 1: %d", num)
	}
	if err != nil {
		t.Fatalf("unexpected err: %s", err)
	}

	// Check the hosts
	if len(m2.Members()) != 2 {
		t.Fatalf("should have 2 nodes! %v", m2.Members())
	}
	if m2.estNumNodes() != 2 {
		t.Fatalf("should have 2 nodes! %v", m2.Members())
	}
	yield()
	time.After(1 * time.Second)

	shutdownChannel := make(chan int)
	members := m1.Members()
	lister := func() int {
		for {
			select {
			case x := <- shutdownChannel:
				return x * 0
			default:
				// members := m1.Members()
				for _,v := range members {
					yield()
					if v.Address() == "" {
						panic("cannot be")
					}
				}
			}
		}
	}
	shutdownLister := func() {
		shutdownChannel <- 0
	}
	defer shutdownLister()
	go lister()

	m2.Leave(1 * time.Second)
	m2.Shutdown()
	// Re-create the second node with a different address
	c3 := testConfig(t)
	c3.Name = c2.Name  // force reclaim
	c3.BindPort = bindPort

	m3, err := Create(c3)
	require.NoError(t, err)
	defer m3.Shutdown()

	numB, errB := m3.Join([]string{m1.config.Name + "/" + m1.config.BindAddr})
	if numB != 1 {
		t.Fatalf("unexpected 1: %d", numB)
	}
	if errB != nil {
		t.Fatalf("unexpected err: %s", errB)
	}

	time.After(5 * time.Second)
}
@dnephin
Copy link
Contributor

dnephin commented Nov 22, 2021

Thank you for opening this issue! A PR to document safe usage of this method would be great!

I fixed a related data race in #232, but I don't think it addresses this problem. While working on that PR I noticed this problem you describe also exists with EventDelegate. If you access or modify the *Node passed to any of those methods you will also end up with a data race. If you wanted to, it might make sense to document the safe usage on that interface as well. But we can also do that in a separate PR.

As for a better solution, it would probably be better to return a non-pointer, but that would be a breaking change. So I think for now documenting it is best. We'll have to look at a better longer term solution for a v2.

krajorama added a commit to krajorama/memberlist that referenced this issue Nov 23, 2021
@krajorama krajorama linked a pull request Nov 23, 2021 that will close this issue
@krajorama
Copy link
Author

Hi, any feedback welcome on my PR to document the issue, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants