Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault due to access of SafeConfigSelector.cs before initialization #4343

Closed
dfinkel opened this issue Apr 15, 2021 · 5 comments · Fixed by #4398
Closed

Segfault due to access of SafeConfigSelector.cs before initialization #4343

dfinkel opened this issue Apr 15, 2021 · 5 comments · Fixed by #4398
Assignees

Comments

@dfinkel
Copy link

dfinkel commented Apr 15, 2021

What version of gRPC are you using?

v1.37.0 (also observed with v1.36.1)
Not observed with v1.35.0

What version of Go are you using (go version)?

1.16.2

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

If possible, provide a recipe for reproducing the error.

It appears that we're seeing this during initialization of the Google Cloud Spanner client, but creating clients in a loop doesn't seem to reproduce it.
Currently, we're encountering the segfault (below) while running presubmits as a test-flake in tests that interact with a real cloud Spanner instance (hence hitting Google's CFEs). (first encountered after upgrading our go.mod)

If the cause of such a race isn't clear to the package maintainers, I can try a bit harder to create a repro. (I suspect this is a race opened up during the xds-related refactoring of the ServiceConfig and Resolver support).

What did you expect to see?

No panics, just a connection/RPC invocation.

What did you see instead?

A segfault (nil-pointer dereference) in google.golang.org/grpc/internal/resolver.(*SafeConfigSelector).SelectConfig

 panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7d6cfc]

goroutine 14208 [running]:
google.golang.org/grpc/internal/resolver.(*SafeConfigSelector).SelectConfig(0xc0063e37a8, 0x1f75b48, 0xc00a7544e0, 0x1cfc3aa, 0x2e, 0x0, 0x0, 0x0)
	/go/pkg/mod/google.golang.org/grpc@v1.37.0/internal/resolver/config_selector.go:163 +0x7c
google.golang.org/grpc.newClientStream(0x1f75b48, 0xc00a7544e0, 0x2b6cce0, 0xc0063e3500, 0x1cfc3aa, 0x2e, 0xc00a3e7560, 0x1, 0x1, 0x0, ...)
	/go/pkg/mod/google.golang.org/grpc@v1.37.0/stream.go:182 +0x230
google.golang.org/grpc.invoke(0x1f75b48, 0xc00a7544e0, 0x1cfc3aa, 0x2e, 0x1b66400, 0xc00a66c8c0, 0x1b01160, 0xc000859dc0, 0xc0063e3500, 0xc00a3e7560, ...)
	/go/pkg/mod/google.golang.org/grpc@v1.37.0/call.go:66 +0x99
google.golang.org/grpc.(*ClientConn).Invoke(0xc0063e3500, 0x1f75b48, 0xc00a7544e0, 0x1cfc3aa, 0x2e, 0x1b66400, 0xc00a66c8c0, 0x1b01160, 0xc000859dc0, 0x0, ...)
	/go/pkg/mod/google.golang.org/grpc@v1.37.0/call.go:37 +0x1b3
google.golang.org/genproto/googleapis/spanner/v1.(*spannerClient).BatchCreateSessions(0xc00691b840, 0x1f75b48, 0xc00a7544e0, 0xc00a66c8c0, 0x0, 0x0, 0x0, 0x1ac3f80, 0xc00a754300, 0x1a3f280)
	/go/pkg/mod/google.golang.org/genproto@v0.0.0-20210331142528-b7513248f0ba/googleapis/spanner/v1/spanner.pb.go:3541 +0xd4
cloud.google.com/go/spanner/apiv1.(*Client).BatchCreateSessions.func1(0x1f75b48, 0xc00a7544e0, 0x1d7d9a0, 0x0, 0x0, 0x0, 0xc000c70ce8, 0x410438)
	/go/pkg/mod/cloud.google.com/go/spanner@v1.17.0/apiv1/spanner_client.go:357 +0x84
github.com/googleapis/gax-go/v2.invoke(0x1f75b48, 0xc00a7544e0, 0xc000f77df0, 0x1d7d9a0, 0x0, 0x0, 0x0, 0x1d7eae8, 0x0, 0x0)
	/go/pkg/mod/github.com/googleapis/gax-go/v2@v2.0.5/invoke.go:70 +0x8f
github.com/googleapis/gax-go/v2.Invoke(0x1f75b48, 0xc00a7544e0, 0xc000c70df0, 0xc00691b780, 0x1, 0x1, 0xc00a7544e0, 0x40fcdb)
	/go/pkg/mod/github.com/googleapis/gax-go/v2@v2.0.5/invoke.go:48 +0xf6
cloud.google.com/go/spanner/apiv1.(*Client).BatchCreateSessions(0xc0008597c0, 0x1f75b48, 0xc00a7544e0, 0xc00a66c8c0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0)
	/go/pkg/mod/cloud.google.com/go/spanner@v1.17.0/apiv1/spanner_client.go:355 +0x376
cloud.google.com/go/spanner.(*sessionClient).executeBatchCreateSessions(0xc001e0e8c0, 0xc0008597c0, 0xc000000019, 0xc000f9dad0, 0xc000f9db00, 0x1f49df0, 0xc0002976c0)
	/go/pkg/mod/cloud.google.com/go/spanner@v1.17.0/sessionclient.go:233 +0x4f0
created by cloud.google.com/go/spanner.(*sessionClient).batchCreateSessions
	/go/pkg/mod/cloud.google.com/go/spanner@v1.17.0/sessionclient.go:200 +0x1ab

@menghanl
Copy link
Contributor

Thanks for filing the issue.

Do you have the client side logs? Do you use xDS?
It would be helpful to see what resolver was used, and find out how ConfigSelector was not set.

@dfinkel
Copy link
Author

dfinkel commented Apr 15, 2021

Thanks for the quick response!

Sorry about the sparse initial report. No, we don't use xDS. In this case we appear to be using the dns resolver.

It hadn't occurred to me to enable verbose logging on the client before. In this case, it appears that the dns resolver fails to do a SRV record lookup for the client config because of a canceled context just before the segfault.

the relevant segment: (I've attached a more extensive section of the logs)

INFO: 2021/04/15 18:53:13 [dns] dns: SRV record lookup error: lookup _grpclb._tcp.spanner.googleapis.com on 172.17.0.1:53: dial udp 172.17.0.1:53: operation was canceled
INFO: 2021/04/15 18:53:13 [dns] dns: A record lookup error: lookup spanner.googleapis.com on 172.17.0.1:53: dial udp 172.17.0.1:53: operation was canceled
WARNING: 2021/04/15 18:53:13 [core] ccResolverWrapper: reporting error to cc: dns: A record lookup error: lookup spanner.googleapis.com on 172.17.0.1:53: dial udp 172.17.0.1:53: operation was canceled
INFO: 2021/04/15 18:53:13 [core] Channel Connectivity change to SHUTDOWN
INFO: 2021/04/15 18:53:13 [core] Subchannel Connectivity change to SHUTDOWN
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7a49fc]
                                                                                                                                                                                                                                                                                                                                       
goroutine 12991 [running]:
<snip>

spanner_clientconfig_resolver_panic.log

@dfawley
Copy link
Member

dfawley commented May 7, 2021

Thank you for reporting the issue.

I believe I have found the root cause of this and sent a PR to fix it. @dfinkel are you able to run with the changes in #4398 and confirm this fixes the problem for you?

@dfinkel
Copy link
Author

dfinkel commented May 7, 2021

Thank you!

I've added a replace directive to the go.mod on a clean branch and run the tests through our CI pipe line a few times (in parallel with another "control" branch to drive up the load)

I confirmed that the "control" branch failed with the segfault from this bug and the bug using the fix in #4398 did not (despite having more runs of the fix branch).

@dfawley
Copy link
Member

dfawley commented May 8, 2021

Great news, thank you for confirming!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants