need guidance for detecting client stream close in interceptor #5324

andreimatei · 2022-04-28T22:29:38Z

Hello gRPC friends,

We (CockroachDB) are seeing a rare nil-pointer crashes on calls to ClientStream.Context() from inside a streaming interceptor. I'd like to kindly ask for your guidance in dealing with it, because the situation is not clear to me and the struggle is real. The code in question is copied from an opentracing library, so the issue is potentially affecting others too.

The background is that we have a client streaming interceptor that creates tracing spans (think OpenTracing/OpenTelementry) for every streaming RPC call. We want the span to live for "the life of the stream". The interceptor in question is here; it calls the wrapped Streamer to get a ClientStream, and then creates a tracing span, and then wraps the inner ClientStream in our own implementation that deal with signaling the end of the stream so that we close the span.

The crash I'm talking about is here. This code waits for the "stream to end" through either Recv() receiving an EOF, or through the stream's ctx being canceled. And it's this cs.Context() that seems to sometimes (rarely) crash, with an NPE from gRPC (from here). I'm not sure why it crashes, and I'm also not sure that we're doing the right thing in the first place.

The contract of ClientStream says:

// It should not be called until after Header or RecvMsg has returned. Once
// called, subsequent client-side retries are disabled.

I believe we're breaking this contract, as we're calling Context() before RecvMsg(). At least, before our code calls RecvMsg(), and I also don't see a call inside gRPC, happening at ClientStream creation time. So, perhaps this is our problem. Except it seems surprising to me that, if the answer is this clear, we're not crashing all the time.

I've looked a bit deeper, in gRPC code, to try to understand where Context()'s contract is coming from. I couldn't figure it out. The only interesting thing I found was a comment on the cs.attempt field, that says:

So, attempt can be nil only inside newClientStream function when clientStream is first created.

I believe the issue that this comment hints to is not relevant to us, because out Context() call comes after newClientStream(). The panicking line is cs.attempt.s.Context(), and attempt.s.ctx are initialized early, as far as I can tell. So, I'm not sure why ClientStream.Context() seems to insist on not being called before RecvMsg(). Perhaps you could clarify?

The second part of my question is about whether our code is generally sane / how should it look like? What's the recommended way to be notified of a streaming RPCs end? We're doing two things that I find dubious:

We're using this ClientStream.Context(), as opposed to the caller's context.
We're installing a finalizer, as a catchall.

Do these look right to you?

We've copied this code from grpc-opentracing, which is, or at least used to be, a fairly high-profile library. In particular, they motivate the finalizer with a reference to a stackoverflow topic which seems dubious to me. Do you agree with it?

I've checked how other similar libraries do this, and the answer seems to be always different:

the more modern gRPC OpenTracing code does not concern itself with either ctx cancellation or a finalizer.
the OpenTelemetry equivalent interceptor code does have a similar context cancellation provision, added explicitly when someone complained about a goroutine leak when the stream is canceled, but this code uses the user's ctx, not the stream's internal ctx. Is this better?

Thank you very much for your help! Please let me know if I should clarify anything.

The text was updated successfully, but these errors were encountered:

hexfusion · 2022-04-28T22:44:09Z

Possibly related to #5323 ?

andreimatei · 2022-04-29T16:00:35Z

Thanks @hexfusion! It certainly seems to me like that patch could fix our panic. @dfawley what do you think?

dfawley · 2022-04-29T16:26:26Z

#5323 should only fix cases related to retry and a race that happens in a very small window. Do you have retries enabled (through service config either specified by the client to Dial or by the service owner through the name resolver (e.g. DNS TXT records, xDS configuration))? I don't believe it's possible for the race #5323 should fix to happen without an explicit retry configuration.

andreimatei · 2022-04-29T17:34:03Z

I do not believe we explicitly enable any sort of retries. But, from what I can gather, "transparent retries" are always enabled; can those not trigger the #5323 race? I'm also confused by this comment on the WithDisableRetry dial option, which claims that retries (I guess the non-transparent kind) are always enabled - but how can that be/which error codes would be automatically retried?
Anecdotally, the two crash reports that we've had are both potentially related to networking trouble: one of them was a laptop coming out of sleep, another one was an automated chaos test that kills processes.

Anyway, if you don't believe #5323 is my answer, I'm curious to understand if our use of ClientStream.Context() detailed above is legal according to Context()'s contract. And, more generally, if you've got the time, whether our interceptor (or, rather, OpenTracing's interceptor) is sane.

dfawley · 2022-04-29T20:02:16Z

how can that be/which error codes would be automatically retried?

"Retry support" is always enabled unless that dial option is set (and that environment variable no longer does anything; sorry). However, retries are not configured to happen by default except for transparent retries. It requires explicit configuration to declare the retryable status codes.

"transparent retries" are always enabled; can those not trigger the #5323 race?

They are. And they could under very tight races (it might take 2 races hitting at once). Basically your first attempt has to create a transport stream successfully, then that stream ends up getting a REFUSED_STREAM response or a GOAWAY that was already in flight when the stream was created to enable a transparent retry. Then to get the nil pointer in the attempt, you would need to select the transport and then fail to create the stream but not transparently retry it. (Calling Context on the stream could do the second half of that reliably, since it commits the RPC.)

Anyway, if you don't believe #5323 is my answer

This change should make this whole class of error impossible. So I think it will fix your panics. If you have a way to test that theory, it would be a nice data point to see if the PR works for you.

I'm curious to understand if our use of ClientStream.Context() detailed above is legal according to Context()'s contract

It's okay to call Context early. I wouldn't recommend it, though, since it commits the current stream attempt and makes any kind of retrying impossible. (Actually, because of a bug fixed by #5323, it could retry currently when stream creation fails and lead to seeing the wrong context.) But nothing in our API should ever give you a panic unless you've done something very obviously wrong like (*grpc.ClientConn)(nil).NewStream or something.

whether our interceptor (or, rather, OpenTracing's interceptor) is sane

I took a look at that stack overflow question and added a comment about clients not receiving the io.EOF (they are broken unless they cancel the RPC context or close the clientconn). OpenTelemetry seems fine to be using the user's context, as cancelation of that will cancel and clean up the stream correctly. The only reason to use the other context is to get the peer address / etc that grpc adds into the context.

Generally, interceptors should be able to assume clients are following the rules listed here: https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream

Not doing so leaks resources in gRPC-Go.

andreimatei · 2022-04-29T20:56:38Z

Thanks Doug!

how can that be/which error codes would be automatically retried?

"Retry support" is always enabled unless that dial option is set (and that environment variable no longer does anything; sorry). However, retries are not configured to happen by default except for transparent retries. It requires explicit configuration to declare the retryable status codes.

"transparent retries" are always enabled; can those not trigger the #5323 race?

They are. And they could under very tight races (it might take 2 races hitting at once). Basically your first attempt has to create a transport stream successfully, then that stream ends up getting a REFUSED_STREAM response or a GOAWAY that was already in flight when the stream was created to enable a transparent retry. Then to get the nil pointer in the attempt, you would need to select the transport and then fail to create the stream but not transparently retry it. (Calling Context on the stream could do the second half of that reliably, since it commits the RPC.)

Anyway, if you don't believe #5323 is my answer

This change should make this whole class of error impossible. So I think it will fix your panics. If you have a way to test that theory, it would be a nice data point to see if the PR works for you.

Great! I will do some experiments to see how easily I can reproduce the panic (it's definitely not frequent). If I can reproduce it, I will try your patch and report back.

I'm curious to understand if our use of ClientStream.Context() detailed above is legal according to Context()'s contract

It's okay to call Context early. I wouldn't recommend it, though, since it commits the current stream attempt and makes any kind of retrying impossible. (Actually, because of a bug fixed by #5323, it could retry currently when stream creation fails and lead to seeing the wrong context.) But nothing in our API should ever give you a panic unless you've done something very obviously wrong like (*grpc.ClientConn)(nil).NewStream or something.

I see; so the comment on ClientStream.Context() that says to not call it until Recv() is referring to the "committing" of the ClientStream to a... lower-level stream(?), and thus the inhibiting of transparent retries? Even though the comment mentions the retries, and I stared at it for a while, it was not obvious to me that the "don't use it" part is simply about the retries.

whether our interceptor (or, rather, OpenTracing's interceptor) is sane

I took a look at that stack overflow question and added a comment about clients not receiving the io.EOF (they are broken unless they cancel the RPC context or close the clientconn).

Thank you; that's what I thought. So the finalizer thing that we and OpenTracing do is non-sense (assuming that the RPC callers are well behaved), right?

OpenTelemetry seems fine to be using the user's context, as cancelation of that will cancel and clean up the stream correctly. The only reason to use the other context is to get the peer address / etc that grpc adds into the context.

Ack. Then I will also switch our code to listen on the user's context. Which should fix the panic, independently of #5323.
Just to make sure - the ClientStream contexts inherit from the client ctx, right?
And you agree that indeed the right thing to do is to listen for context cancellation as one possible trigger of the tracing span finish? So, basically, the way the OpenTelemetry does its thing - providing its own wrapper implementation of ClientStream and also listening for ctx cancellation - is a good way of doing things?

Generally, interceptors should be able to assume clients are following the rules listed here: https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream

Not doing so leaks resources in gRPC-Go.

Ack.

Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes cockroachdb#80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.

80476: security: update the TLS cipher suite list r=bdarnell a=knz Fixes #80483 This does not really change the list, it merely explains more clearly how it was built. Release note: None 80635: release: use preflight tool for RedHat images r=celiala a=rail RedHat Connect has introduced a new certification workflow, which requires running some tests locally and submit the results back to RedHat. This patch runs the `preflight` tool to address the new workflow changes. Fixes #80633 Release note: None 80878: util/tracing: fix crash in StreamClientInterceptor r=andreimatei a=andreimatei Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go#5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) This patch also removes a paranoid catch-all finalizer in the interceptor that assured that the RPC client span is always eventually closed (at a random GC time), regardless of what the caller does - i.e. even if the caller forgets about interacting with the response stream or canceling the ctx. This kind of paranoia is not needed. The code in question was copied from grpc-opentracing[1], which quoted a StackOverflow answer[2] about whether or not a client is allowed to simply walk away from a streaming call. As a result of conversations triggered by this patch [3], that SO answer was updated to reflect the fact that it is not, in fact, OK for a client to do so, as it will leak gRPC resources. The client's contract is specified in [4] (although arguably that place is not the easiest to find by a casual gRPC user). In any case, this patch gets rid of the finalizer. This could in theory result in leaked spans if our own code is buggy in the way that the paranoia prevented, but all our TestServers check that spans don't leak like that so we are pretty protected. FWIW, a newer re-implementation of the OpenTracing interceptor[4] doesn't have the finalizer (although it also doesn't listen for ctx cancellation, so I think it's buggy), and neither does the equivalent OpenTelemetry interceptor[6]. Fixes #80689 [1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174 [2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl [3] grpc/grpc-go#5324 [4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream [5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52 [6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193 Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed. Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

dfawley · 2022-05-03T20:28:01Z

I see; so the comment on ClientStream.Context() that says to not call it until Recv() is referring to the "committing" of the ClientStream to a... lower-level stream(?), and thus the inhibiting of transparent retries?

Yes, that sounds right.

Thank you; that's what I thought. So the finalizer thing that we and OpenTracing do is non-sense (assuming that the RPC callers are well behaved), right?

Yes, I don't think that should be required, assuming your application is following the requirements that the gRPC documentation lists. The finalizer idea may originate from this comment.

Just to make sure - the ClientStream contexts inherit from the client ctx, right?

Yes, it is a child of the application's context passed to NewStream or Invoke.

And you agree that indeed the right thing to do is to listen for context cancellation as one possible trigger of the tracing span finish?

If you are installing a stats.Handler, then looking for a stats.End event would be even simpler. This would be guaranteed to be called if the user is doing any of the required final steps.

So, basically, the way the OpenTelemetry does its thing - providing its own wrapper implementation of ClientStream and also listening for ctx cancellation - is a good way of doing things?

Yes, this would work if you have an interceptor and not a stats handler.

dfawley · 2022-05-10T17:45:43Z

Let us know if you have any more questions on this.

andreimatei added the Type: Question label Apr 28, 2022

andreimatei mentioned this issue Apr 28, 2022

roachtest: tpcc/w=100/nodes=3/chaos=true failed cockroachdb/cockroach#80689

Closed

andreimatei mentioned this issue May 2, 2022

util/tracing: fix crash in StreamClientInterceptor cockroachdb/cockroach#80878

Merged

dfawley closed this as completed May 10, 2022

github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need guidance for detecting client stream close in interceptor #5324

need guidance for detecting client stream close in interceptor #5324

andreimatei commented Apr 28, 2022

hexfusion commented Apr 28, 2022

andreimatei commented Apr 29, 2022

dfawley commented Apr 29, 2022

andreimatei commented Apr 29, 2022 •

edited

dfawley commented Apr 29, 2022

andreimatei commented Apr 29, 2022

dfawley commented May 3, 2022

dfawley commented May 10, 2022

need guidance for detecting client stream close in interceptor #5324

need guidance for detecting client stream close in interceptor #5324

Comments

andreimatei commented Apr 28, 2022

hexfusion commented Apr 28, 2022

andreimatei commented Apr 29, 2022

dfawley commented Apr 29, 2022

andreimatei commented Apr 29, 2022 • edited

dfawley commented Apr 29, 2022

andreimatei commented Apr 29, 2022

dfawley commented May 3, 2022

dfawley commented May 10, 2022

andreimatei commented Apr 29, 2022 •

edited