Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on grpc transport #5644

Closed
wimdec opened this issue Sep 7, 2022 · 4 comments · Fixed by #5652
Closed

Deadlock on grpc transport #5644

wimdec opened this issue Sep 7, 2022 · 4 comments · Fixed by #5652
Assignees

Comments

@wimdec
Copy link

wimdec commented Sep 7, 2022

What version of gRPC are you using?

1.49

What version of Go are you using (go version)?

1.19

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I do not have a reproducible scenario.
Since we switched to grpc-go 1.49 in our application, we see sometimes that our application is hanging on grpc requests.
From the extracted stack traces, I could clearly see that there is a deadlock in transport caused by the recent PR #5494
In this PR this is mentioned on the http2_client mutex:

	// Do not access controlBuf with mu held.
	mu            sync.Mutex // guard the following variables

But that is exactly what is happening very infrequently in our case.
Cause is

t.closeStream(stream, errStreamDrain, false, http2.ErrCodeNo, statusGoAway, nil, false)

This closeStream request is done while holding the http2_client mutex.
And the implementation of closeStream is calling controlBuf at line:
t.controlBuf.executeAndPut(addBackStreamQuota, cleanup)

This violates the assumption of mutex. In this case the http2_client mutex is acquired while trying to acquire the controlbuf mutex.

And in another goroutine in parallel, the controlbuf mutex is acquired while trying to acquire the http2_client mutex.
Reason is the added mutex lock at line:

Relevant stack traces:

goroutine 4163 [semacquire, 437 minutes]:
runtime.gopark(0xc000a5ab58?, 0x470de0?, 0x40?, 0xcb?, 0xc000a5aa58?)
   /usr/local/go/src/runtime/proc.go:363 +0xd6 fp=0xc000a5aa50 sp=0xc000a5aa30 pc=0x4389f6
runtime.goparkunlock(...)
   /usr/local/go/src/runtime/proc.go:369
runtime.semacquire1(0xc000b7d3c4, 0xd8?, 0x3, 0x1)
   /usr/local/go/src/runtime/sema.go:150 +0x1fe fp=0xc000a5aab8 sp=0xc000a5aa50 pc=0x448efe
sync.runtime_SemacquireMutex(0x20?, 0xde?, 0x0?)
   /usr/local/go/src/runtime/sema.go:77 +0x25 fp=0xc000a5aae8 sp=0xc000a5aab8 pc=0x462ea5
sync.(*Mutex).lockSlow(0xc000b7d3c0)
   /usr/local/go/src/sync/mutex.go:171 +0x165 fp=0xc000a5ab38 sp=0xc000a5aae8 pc=0x4709c5
sync.(*Mutex).Lock(...)
   /usr/local/go/src/sync/mutex.go:90
google.golang.org/grpc/internal/transport.(*controlBuffer).executeAndPut(0xc000b7d3b0, 0xc000a5aba8?, {0x10e38c0, 0xc00076c3a8})
   /workspace/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:331 +0x52 fp=0xc000a5ab80 sp=0xc000a5ab38 pc=0x8e6db2
google.golang.org/grpc/internal/transport.(*http2Client).closeStream(0xc00038e780, 0xc000e7e000, {0x10e3840?, 0xc0000120f0?}, 0x0, 0x0, 0x8000?, 0xc000d537a0?, 0x0)
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:870 +0x1fc fp=0xc000a5abc8 sp=0xc000a5ab80 pc=0x8f629c
google.golang.org/grpc/internal/transport.(*http2Client).handleGoAway(0xc00038e780, 0xc000d0f530)
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:1237 +0x53d fp=0xc000a5ad90 sp=0xc000a5abc8 pc=0x8f883d
google.golang.org/grpc/internal/transport.(*http2Client).reader(0xc00038e780)
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:1546 +0x9cd fp=0xc000a5afc8 sp=0xc000a5ad90 pc=0x8fbbed
google.golang.org/grpc/internal/transport.newHTTP2Client.func7()
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:367 +0x26 fp=0xc000a5afe0 sp=0xc000a5afc8 pc=0x8f14e6
runtime.goexit()
   /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc000a5afe8 sp=0xc000a5afe0 pc=0x466aa1
created by google.golang.org/grpc/internal/transport.newHTTP2Client
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:367 +0x18ff

goroutine 4934 [semacquire, 437 minutes]:
runtime.gopark(0xc000f89470?, 0xc000a986c0?, 0xc0?, 0x0?, 0xc00004517f?)
   /usr/local/go/src/runtime/proc.go:363 +0xd6 fp=0xc0006964a0 sp=0xc000696480 pc=0x4389f6
runtime.goparkunlock(...)
   /usr/local/go/src/runtime/proc.go:369
runtime.semacquire1(0xc00038e8e4, 0x28?, 0x3, 0x1)
   /usr/local/go/src/runtime/sema.go:150 +0x1fe fp=0xc000696508 sp=0xc0006964a0 pc=0x448efe
sync.runtime_SemacquireMutex(0xc0005e6390?, 0x0?, 0x203000?)
   /usr/local/go/src/runtime/sema.go:77 +0x25 fp=0xc000696538 sp=0xc000696508 pc=0x462ea5
sync.(*Mutex).lockSlow(0xc00038e8e0)
   /usr/local/go/src/sync/mutex.go:171 +0x165 fp=0xc000696588 sp=0xc000696538 pc=0x4709c5
sync.(*Mutex).Lock(...)
   /usr/local/go/src/sync/mutex.go:90
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func3({0xe0a980?, 0xc0006aec30?})
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:723 +0x12a fp=0xc0006965d0 sp=0xc000696588 pc=0x8f5bea
google.golang.org/grpc/internal/transport.(*http2Client).NewStream.func5({0xe0a980, 0xc0006aec30})
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:755 +0x4c fp=0xc0006965f8 sp=0xc0006965d0 pc=0x8f58cc
google.golang.org/grpc/internal/transport.(*controlBuffer).executeAndPut(0xc000b7d3b0, 0x10ec540?, {0x10e3940, 0xc0006aec30})
   /workspace/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:337 +0xc3 fp=0xc000696640 sp=0xc0006965f8 pc=0x8e6e23
google.golang.org/grpc/internal/transport.(*http2Client).NewStream(0xc00038e780, {0x10ec540, 0xc000f89350}, 0xc0010023c0)
   /workspace/vendor/google.golang.org/grpc/internal/transport/http2_client.go:754 +0x438 fp=0xc000696828 sp=0xc000696640 pc=0x8f51f8
google.golang.org/grpc.(*csAttempt).newStream(0xc0001a4c00)
   /workspace/vendor/google.golang.org/grpc/stream.go:441 +0x4a fp=0xc000696860 sp=0xc000696828 pc=0x983aca
google.golang.org/grpc.newClientStreamWithParams.func2(0xc0001a4c00)
   /workspace/vendor/google.golang.org/grpc/stream.go:312 +0x3a fp=0xc000696880 sp=0xc000696860 pc=0x982f9a
google.golang.org/grpc.(*clientStream).withRetry(0xc0005685a0, 0xc0007bf370, 0xc000696a08)
   /workspace/vendor/google.golang.org/grpc/stream.go:715 +0x144 fp=0xc0006968f0 sp=0xc000696880 pc=0x984a84
google.golang.org/grpc.newClientStreamWithParams({0x10ec540, 0xc000f89260}, 0x17b7ea0, 0xc000443500, {0xfa02df, 0x31}, {0x0, 0x0, 0x0, 0x0, ...}, ...)
   /workspace/vendor/google.golang.org/grpc/stream.go:321 +0x9ac fp=0xc000696a38 sp=0xc0006968f0 pc=0x9829ac
google.golang.org/grpc.newClientStream.func2({0x10ec540?, 0xc000f89260?}, 0xc000f89260?)
   /workspace/vendor/google.golang.org/grpc/stream.go:192 +0x9f fp=0xc000696ac8 sp=0xc000696a38 pc=0x981f7f
google.golang.org/grpc.newClientStream({0x10ec540, 0xc000f89260}, 0x17b7ea0, 0xc000443500, {0xfa02df, 0x31}, {0x0, 0x0, 0x0})
   /workspace/vendor/google.golang.org/grpc/stream.go:220 +0x4c2 fp=0xc000696ba0 sp=0xc000696ac8 pc=0x981b62
google.golang.org/grpc.invoke({0x10ec540?, 0xc000f89260?}, {0xfa02df?, 0x4?}, {0xf0b1a0, 0xc0006aea50}, {0xf41840, 0xc000c407e0}, 0x0?, {0x0, ...})
   /workspace/vendor/google.golang.org/grpc/call.go:66 +0x7d fp=0xc000696c08 sp=0xc000696ba0 pc=0x96473d

What did you expect to see?

no deadlock

What did you see instead?

deadlock

@wimdec
Copy link
Author

wimdec commented Sep 9, 2022

@zasweq Adding defer on this line:

t.closeStream(stream, errStreamDrain, false, http2.ErrCodeNo, statusGoAway, nil, false)

solves the issue.
Is that a good solution?

@dfawley
Copy link
Member

dfawley commented Sep 13, 2022

@wimdec what is your server (or proxy) running? It doesn't seem like this condition should be possible if it is following best-HTTP/2-practices:

https://httpwg.org/specs/rfc7540.html#GOAWAY

A server that is attempting to gracefully shut down a connection SHOULD send an initial GOAWAY frame with the last stream identifier set to 231-1 and a NO_ERROR code. This signals to the client that a shutdown is imminent and that initiating further requests is prohibited. After allowing time for any in-flight stream creation (at least one round-trip time), the server can send another GOAWAY frame with an updated last stream identifier. This ensures that a connection can be cleanly shut down without losing requests.

@wimdec
Copy link
Author

wimdec commented Sep 14, 2022

We are using kubernetes ingress-nginx, version v1.2.0 as proxy.
Server behind is using same grpc-go version as the client.

@dfawley
Copy link
Member

dfawley commented Sep 14, 2022

We are using kubernetes ingress-nginx, version v1.2.0 as proxy.

The behavior seen here would be the proxy (not the backing grpc-go server) if that's an L7/HTTP/2 proxy, which I'm pretty sure it is. So they are most likely not following the recommended way of gracefully terminating the connection, which could be raised as an issue with them. If they were allowing enough time for us to stop creating streams the connection, no new streams should need to be terminated this way, and we wouldn't hit this deadlock. (We still have a bug which will be fixed ASAP.)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants