Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestSafeConfigSelector block issue #4942

Closed
charlesxsh opened this issue Nov 4, 2021 · 3 comments
Closed

TestSafeConfigSelector block issue #4942

charlesxsh opened this issue Nov 4, 2021 · 3 comments

Comments

@charlesxsh
Copy link
Contributor

charlesxsh commented Nov 4, 2021

What version of gRPC are you using?

commit 9280052

What version of Go are you using (go version)?

1.16

What operating system (Linux, Windows, …) and version?

Linux/Debian

What did you do?

If possible, provide a recipe for reproducing the error.
These potential blocking issues are found by our current WIP project. Some issues are hard to reproduce without our tool/instrumentation. We will release it as soon as it is ready.
This issue is part of #4672, I simply seperate them into individual issue so that we can track them one by one.

What did you expect to see?

no blocking issue

What did you see instead?

During the execution of following tests, the program might have blocking issues/deadlock at following positions if some selects/timeout are chosen/happened:

TestSafeConfigSelector (internal/resolver/config_selector_test.go)

func (scs *SafeConfigSelector) UpdateConfigSelector(cs ConfigSelector) {
scs.mu.Lock()
defer scs.mu.Unlock()
scs.cs = cs
}

cs1 := &fakeConfigSelector{
selectConfig: func(r RPCInfo) (*RPCConfig, error) {
cs1Called <- struct{}{}
if diff := cmp.Diff(r, testRPCInfo); diff != "" {
t.Errorf("SelectConfig(%v) called; want %v\n Diffs:\n%s", r, testRPCInfo, diff)

for dl := time.Now().Add(150 * time.Millisecond); !time.Now().After(dl); {
gotConfigChan := make(chan *RPCConfig)
go func() {
cfg, _ := scs.SelectConfig(testRPCInfo)
gotConfigChan <- cfg
}()
select {
case <-time.After(500 * time.Millisecond):
t.Fatalf("timed out waiting for cs1 or cs2 to be called")
case <-cs1Called:

If internal/resolver/config_selector_test line 121 is chosen, then g1 will be blocked at sending to cs1Called so


cannot be unlocked. This also lead UpdateConfigSelector, will be blocked at lock.

main g1( SelectConfig) g2 (UpdateConfigSelector)
mu.RLock()
mu.Lock() //blocked
cs1Called <- struct{}{} // blocked

// g1 blocked at sending to cs1Called, mu cannot be unlock, so g2 also blocked at mu.Lock()

@charlesxsh
Copy link
Contributor Author

charlesxsh commented Nov 12, 2021

@dfawley Hi there, below is more detail description about this issue:

Test Output:

=== RUN   TestSafeConfigSelector
    config_selector_test.go:112: timed out waiting for cs1 to be called

Golang stack trace

goroutine 9 [chan send]:
google.golang.org/grpc/internal/resolver.TestSafeConfigSelector.func1(0x0, 0x0, 0x717f66, 0xb, 0x1, 0x1, 0x726000)
	/fuzz/target/internal/resolver/config_selector_test.go:74 +0xbd
google.golang.org/grpc/internal/resolver.(*fakeConfigSelector).SelectConfig(0xc00000e078, 0x0, 0x0, 0x717f66, 0xb, 0x1, 0x0, 0x0)
	/fuzz/target/internal/resolver/config_selector_test.go:47 +0x7c
google.golang.org/grpc/internal/resolver.(*SafeConfigSelector).SelectConfig(0xc00017a000, 0x0, 0x0, 0x717f66, 0xb, 0x0, 0x0, 0x0)
	/fuzz/target/internal/resolver/config_selector.go:101 +0x128
google.golang.org/grpc/internal/resolver.TestSafeConfigSelector.func3(0xc00017a000, 0x0, 0x0, 0x717f66, 0xb, 0xc000034d80, 0xc00011b400, 0xc00006ae00)
	/fuzz/target/internal/resolver/config_selector_test.go:98 +0x74
created by google.golang.org/grpc/internal/resolver.TestSafeConfigSelector
	/fuzz/target/internal/resolver/config_selector_test.go:97 +0x47f

So according to the analysis above, if timeout happened first, goroutine 9 will be blocked forever.

(The line number could be mismatched since the code is instrumented by the bug finding tool. config_selector_test.go:74 infers

cs1 := &fakeConfigSelector{
		selectConfig: func(r RPCInfo) (*RPCConfig, error) {
			cs1Called <- struct{}{}   <--- line 74
...

@dfawley
Copy link
Member

dfawley commented Nov 12, 2021

But also in this case, the test has failed. I'm not super concerned about a secondary goroutine hanging in a test after it fails. It's not ideal, but it's also not worth over-complicating the code to handle more gracefully. Also, it seems the channel is buffered, so I'm not actually sure what the full set of circumstances that would lead to this would be. I could see this one hanging here:

Making that have a buffer size of 1 should fix it pretty simply, too.

Aaaaand, I just noticed that you're running some pretty old code. It seems someone else noticed this already, and that's why there is now a buffer on the channel when there was not one in the version of the code you are running. (#4570)

@charlesxsh
Copy link
Contributor Author

@dfawley Exactly. Like your mentioned, line 118 is also an potential place if fail in timeout. Same things for

retChan1 := make(chan *RPCConfig)
retChan2 := make(chan *RPCConfig)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants