Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need guidance for detecting client stream close in interceptor #5324

Closed
andreimatei opened this issue Apr 28, 2022 · 8 comments
Closed

need guidance for detecting client stream close in interceptor #5324

andreimatei opened this issue Apr 28, 2022 · 8 comments

Comments

@andreimatei
Copy link

Hello gRPC friends,

We (CockroachDB) are seeing a rare nil-pointer crashes on calls to ClientStream.Context() from inside a streaming interceptor. I'd like to kindly ask for your guidance in dealing with it, because the situation is not clear to me and the struggle is real. The code in question is copied from an opentracing library, so the issue is potentially affecting others too.

The background is that we have a client streaming interceptor that creates tracing spans (think OpenTracing/OpenTelementry) for every streaming RPC call. We want the span to live for "the life of the stream". The interceptor in question is here; it calls the wrapped Streamer to get a ClientStream, and then creates a tracing span, and then wraps the inner ClientStream in our own implementation that deal with signaling the end of the stream so that we close the span.

The crash I'm talking about is here. This code waits for the "stream to end" through either Recv() receiving an EOF, or through the stream's ctx being canceled. And it's this cs.Context() that seems to sometimes (rarely) crash, with an NPE from gRPC (from here). I'm not sure why it crashes, and I'm also not sure that we're doing the right thing in the first place.

The contract of ClientStream says:

// It should not be called until after Header or RecvMsg has returned. Once
// called, subsequent client-side retries are disabled.

I believe we're breaking this contract, as we're calling Context() before RecvMsg(). At least, before our code calls RecvMsg(), and I also don't see a call inside gRPC, happening at ClientStream creation time. So, perhaps this is our problem. Except it seems surprising to me that, if the answer is this clear, we're not crashing all the time.

I've looked a bit deeper, in gRPC code, to try to understand where Context()'s contract is coming from. I couldn't figure it out. The only interesting thing I found was a comment on the cs.attempt field, that says:

So, attempt can be nil only inside newClientStream function when clientStream is first created.

I believe the issue that this comment hints to is not relevant to us, because out Context() call comes after newClientStream(). The panicking line is cs.attempt.s.Context(), and attempt.s.ctx are initialized early, as far as I can tell. So, I'm not sure why ClientStream.Context() seems to insist on not being called before RecvMsg(). Perhaps you could clarify?


The second part of my question is about whether our code is generally sane / how should it look like? What's the recommended way to be notified of a streaming RPCs end? We're doing two things that I find dubious:

  1. We're using this ClientStream.Context(), as opposed to the caller's context.
  2. We're installing a finalizer, as a catchall.

Do these look right to you?

We've copied this code from grpc-opentracing, which is, or at least used to be, a fairly high-profile library. In particular, they motivate the finalizer with a reference to a stackoverflow topic which seems dubious to me. Do you agree with it?

I've checked how other similar libraries do this, and the answer seems to be always different:

  • the more modern gRPC OpenTracing code does not concern itself with either ctx cancellation or a finalizer.
  • the OpenTelemetry equivalent interceptor code does have a similar context cancellation provision, added explicitly when someone complained about a goroutine leak when the stream is canceled, but this code uses the user's ctx, not the stream's internal ctx. Is this better?

Thank you very much for your help! Please let me know if I should clarify anything.

@hexfusion
Copy link
Contributor

Possibly related to #5323 ?

@andreimatei
Copy link
Author

Thanks @hexfusion! It certainly seems to me like that patch could fix our panic. @dfawley what do you think?

@dfawley
Copy link
Member

dfawley commented Apr 29, 2022

#5323 should only fix cases related to retry and a race that happens in a very small window. Do you have retries enabled (through service config either specified by the client to Dial or by the service owner through the name resolver (e.g. DNS TXT records, xDS configuration))? I don't believe it's possible for the race #5323 should fix to happen without an explicit retry configuration.

@andreimatei
Copy link
Author

andreimatei commented Apr 29, 2022

I do not believe we explicitly enable any sort of retries. But, from what I can gather, "transparent retries" are always enabled; can those not trigger the #5323 race? I'm also confused by this comment on the WithDisableRetry dial option, which claims that retries (I guess the non-transparent kind) are always enabled - but how can that be/which error codes would be automatically retried?
Anecdotally, the two crash reports that we've had are both potentially related to networking trouble: one of them was a laptop coming out of sleep, another one was an automated chaos test that kills processes.

Anyway, if you don't believe #5323 is my answer, I'm curious to understand if our use of ClientStream.Context() detailed above is legal according to Context()'s contract. And, more generally, if you've got the time, whether our interceptor (or, rather, OpenTracing's interceptor) is sane.

@dfawley
Copy link
Member

dfawley commented Apr 29, 2022

how can that be/which error codes would be automatically retried?

"Retry support" is always enabled unless that dial option is set (and that environment variable no longer does anything; sorry). However, retries are not configured to happen by default except for transparent retries. It requires explicit configuration to declare the retryable status codes.

"transparent retries" are always enabled; can those not trigger the #5323 race?

They are. And they could under very tight races (it might take 2 races hitting at once). Basically your first attempt has to create a transport stream successfully, then that stream ends up getting a REFUSED_STREAM response or a GOAWAY that was already in flight when the stream was created to enable a transparent retry. Then to get the nil pointer in the attempt, you would need to select the transport and then fail to create the stream but not transparently retry it. (Calling Context on the stream could do the second half of that reliably, since it commits the RPC.)

Anyway, if you don't believe #5323 is my answer

This change should make this whole class of error impossible. So I think it will fix your panics. If you have a way to test that theory, it would be a nice data point to see if the PR works for you.

I'm curious to understand if our use of ClientStream.Context() detailed above is legal according to Context()'s contract

It's okay to call Context early. I wouldn't recommend it, though, since it commits the current stream attempt and makes any kind of retrying impossible. (Actually, because of a bug fixed by #5323, it could retry currently when stream creation fails and lead to seeing the wrong context.) But nothing in our API should ever give you a panic unless you've done something very obviously wrong like (*grpc.ClientConn)(nil).NewStream or something.

whether our interceptor (or, rather, OpenTracing's interceptor) is sane

I took a look at that stack overflow question and added a comment about clients not receiving the io.EOF (they are broken unless they cancel the RPC context or close the clientconn). OpenTelemetry seems fine to be using the user's context, as cancelation of that will cancel and clean up the stream correctly. The only reason to use the other context is to get the peer address / etc that grpc adds into the context.

Generally, interceptors should be able to assume clients are following the rules listed here: https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream

Not doing so leaks resources in gRPC-Go.

@andreimatei
Copy link
Author

Thanks Doug!

how can that be/which error codes would be automatically retried?

"Retry support" is always enabled unless that dial option is set (and that environment variable no longer does anything; sorry). However, retries are not configured to happen by default except for transparent retries. It requires explicit configuration to declare the retryable status codes.

"transparent retries" are always enabled; can those not trigger the #5323 race?

They are. And they could under very tight races (it might take 2 races hitting at once). Basically your first attempt has to create a transport stream successfully, then that stream ends up getting a REFUSED_STREAM response or a GOAWAY that was already in flight when the stream was created to enable a transparent retry. Then to get the nil pointer in the attempt, you would need to select the transport and then fail to create the stream but not transparently retry it. (Calling Context on the stream could do the second half of that reliably, since it commits the RPC.)

Anyway, if you don't believe #5323 is my answer

This change should make this whole class of error impossible. So I think it will fix your panics. If you have a way to test that theory, it would be a nice data point to see if the PR works for you.

Great! I will do some experiments to see how easily I can reproduce the panic (it's definitely not frequent). If I can reproduce it, I will try your patch and report back.

I'm curious to understand if our use of ClientStream.Context() detailed above is legal according to Context()'s contract

It's okay to call Context early. I wouldn't recommend it, though, since it commits the current stream attempt and makes any kind of retrying impossible. (Actually, because of a bug fixed by #5323, it could retry currently when stream creation fails and lead to seeing the wrong context.) But nothing in our API should ever give you a panic unless you've done something very obviously wrong like (*grpc.ClientConn)(nil).NewStream or something.

I see; so the comment on ClientStream.Context() that says to not call it until Recv() is referring to the "committing" of the ClientStream to a... lower-level stream(?), and thus the inhibiting of transparent retries? Even though the comment mentions the retries, and I stared at it for a while, it was not obvious to me that the "don't use it" part is simply about the retries.

whether our interceptor (or, rather, OpenTracing's interceptor) is sane

I took a look at that stack overflow question and added a comment about clients not receiving the io.EOF (they are broken unless they cancel the RPC context or close the clientconn).

Thank you; that's what I thought. So the finalizer thing that we and OpenTracing do is non-sense (assuming that the RPC callers are well behaved), right?

OpenTelemetry seems fine to be using the user's context, as cancelation of that will cancel and clean up the stream correctly. The only reason to use the other context is to get the peer address / etc that grpc adds into the context.

Ack. Then I will also switch our code to listen on the user's context. Which should fix the panic, independently of #5323.
Just to make sure - the ClientStream contexts inherit from the client ctx, right?
And you agree that indeed the right thing to do is to listen for context cancellation as one possible trigger of the tracing span finish? So, basically, the way the OpenTelemetry does its thing - providing its own wrapper implementation of ClientStream and also listening for ctx cancellation - is a good way of doing things?

Generally, interceptors should be able to assume clients are following the rules listed here: https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream

Not doing so leaks resources in gRPC-Go.

Ack.

andreimatei added a commit to andreimatei/cockroach that referenced this issue May 2, 2022
Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go/pull/5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.

This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:
1) ClientStream.Context() pins the RPC attempt to a lower-level
connection, and inhibits gRPC's ability to sometimes transparently
retry failed calls. In fact, there's a comment on ClientStream.Context()
that tells you not to call it before using the stream through other
methods like Recv(), which imply that the RPC is already "pinned" and
transparent retries are no longer possible anyway. We were breaking
this.
2) One of the grpc-go maintainers suggested that, due to the bugs
reference above, this call could actually sometimes give us "the
wrong context", although how wrong exactly is not clear to me (i.e.
could we have gotten a ctx that doesn't inherit from the caller's ctx?
Or a ctx that's canceled independently from the caller?)

This patch also removes a paranoid catch-all finalizer in the
interceptor that assured that the RPC client span is always eventually
closed (at a random GC time), regardless of what the caller does - i.e.
even if the caller forgets about interacting with the response stream or
canceling the ctx.  This kind of paranoia is not needed. The code in
question was copied from grpc-opentracing[1], which quoted a
StackOverflow answer[2] about whether or not a client is allowed to
simply walk away from a streaming call. As a result of conversations
triggered by this patch [3], that SO answer was updated to reflect the fact
that it is not, in fact, OK for a client to do so, as it will leak gRPC
resources. The client's contract is specified in [4] (although arguably
that place is not the easiest to find by a casual gRPC user). In any
case, this patch gets rid of the finalizer. This could in theory result
in leaked spans if our own code is buggy in the way that the paranoia
prevented, but all our TestServers check that spans don't leak like that
so we are pretty protected. FWIW, a newer re-implementation of the
OpenTracing interceptor[4] doesn't have the finalizer (although it also
doesn't listen for ctx cancellation, so I think it's buggy), and neither
does the equivalent OpenTelemetry interceptor[6].

Fixes cockroachdb#80689

[1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174
[2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl
[3] grpc/grpc-go#5324
[4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
[5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52
[6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193

Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.
andreimatei added a commit to andreimatei/cockroach that referenced this issue May 3, 2022
Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go/pull/5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.

This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:
1) ClientStream.Context() pins the RPC attempt to a lower-level
connection, and inhibits gRPC's ability to sometimes transparently
retry failed calls. In fact, there's a comment on ClientStream.Context()
that tells you not to call it before using the stream through other
methods like Recv(), which imply that the RPC is already "pinned" and
transparent retries are no longer possible anyway. We were breaking
this.
2) One of the grpc-go maintainers suggested that, due to the bugs
reference above, this call could actually sometimes give us "the
wrong context", although how wrong exactly is not clear to me (i.e.
could we have gotten a ctx that doesn't inherit from the caller's ctx?
Or a ctx that's canceled independently from the caller?)

This patch also removes a paranoid catch-all finalizer in the
interceptor that assured that the RPC client span is always eventually
closed (at a random GC time), regardless of what the caller does - i.e.
even if the caller forgets about interacting with the response stream or
canceling the ctx.  This kind of paranoia is not needed. The code in
question was copied from grpc-opentracing[1], which quoted a
StackOverflow answer[2] about whether or not a client is allowed to
simply walk away from a streaming call. As a result of conversations
triggered by this patch [3], that SO answer was updated to reflect the fact
that it is not, in fact, OK for a client to do so, as it will leak gRPC
resources. The client's contract is specified in [4] (although arguably
that place is not the easiest to find by a casual gRPC user). In any
case, this patch gets rid of the finalizer. This could in theory result
in leaked spans if our own code is buggy in the way that the paranoia
prevented, but all our TestServers check that spans don't leak like that
so we are pretty protected. FWIW, a newer re-implementation of the
OpenTracing interceptor[4] doesn't have the finalizer (although it also
doesn't listen for ctx cancellation, so I think it's buggy), and neither
does the equivalent OpenTelemetry interceptor[6].

Fixes cockroachdb#80689

[1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174
[2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl
[3] grpc/grpc-go#5324
[4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
[5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52
[6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193

Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue May 3, 2022
80476: security: update the TLS cipher suite list r=bdarnell a=knz

Fixes #80483 

This does not really change the list, it merely explains more clearly
how it was built.

Release note: None

80635: release: use preflight tool for RedHat images r=celiala a=rail

RedHat Connect has introduced a new certification workflow, which
requires running some tests locally and submit the results back to
RedHat.

This patch runs the `preflight` tool to address the new workflow
changes.

Fixes #80633

Release note: None

80878: util/tracing: fix crash in StreamClientInterceptor r=andreimatei a=andreimatei

Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go#5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.

This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:
1) ClientStream.Context() pins the RPC attempt to a lower-level
connection, and inhibits gRPC's ability to sometimes transparently
retry failed calls. In fact, there's a comment on ClientStream.Context()
that tells you not to call it before using the stream through other
methods like Recv(), which imply that the RPC is already "pinned" and
transparent retries are no longer possible anyway. We were breaking
this.
2) One of the grpc-go maintainers suggested that, due to the bugs
reference above, this call could actually sometimes give us "the
wrong context", although how wrong exactly is not clear to me (i.e.
could we have gotten a ctx that doesn't inherit from the caller's ctx?
Or a ctx that's canceled independently from the caller?)

This patch also removes a paranoid catch-all finalizer in the
interceptor that assured that the RPC client span is always eventually
closed (at a random GC time), regardless of what the caller does - i.e.
even if the caller forgets about interacting with the response stream or
canceling the ctx.  This kind of paranoia is not needed. The code in
question was copied from grpc-opentracing[1], which quoted a
StackOverflow answer[2] about whether or not a client is allowed to
simply walk away from a streaming call. As a result of conversations
triggered by this patch [3], that SO answer was updated to reflect the fact
that it is not, in fact, OK for a client to do so, as it will leak gRPC
resources. The client's contract is specified in [4] (although arguably
that place is not the easiest to find by a casual gRPC user). In any
case, this patch gets rid of the finalizer. This could in theory result
in leaked spans if our own code is buggy in the way that the paranoia
prevented, but all our TestServers check that spans don't leak like that
so we are pretty protected. FWIW, a newer re-implementation of the
OpenTracing interceptor[4] doesn't have the finalizer (although it also
doesn't listen for ctx cancellation, so I think it's buggy), and neither
does the equivalent OpenTelemetry interceptor[6].

Fixes #80689

[1] https://github.com/grpc-ecosystem/grpc-opentracing/blob/8e809c8a86450a29b90dcc9efbf062d0fe6d9746/go/otgrpc/client.go#L174
[2] https://stackoverflow.com/questions/42915337/are-you-required-to-call-recv-until-you-get-io-eof-when-interacting-with-grpc-cl
[3] grpc/grpc-go#5324
[4] https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
[5] https://github.com/grpc-ecosystem/go-grpc-middleware/blob/master/tracing/opentracing/client_interceptors.go#L37-L52
[6] https://github.com/open-telemetry/opentelemetry-go-contrib/blame/main/instrumentation/google.golang.org/grpc/otelgrpc/interceptor.go#L193

Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@dfawley
Copy link
Member

dfawley commented May 3, 2022

I see; so the comment on ClientStream.Context() that says to not call it until Recv() is referring to the "committing" of the ClientStream to a... lower-level stream(?), and thus the inhibiting of transparent retries?

Yes, that sounds right.

Thank you; that's what I thought. So the finalizer thing that we and OpenTracing do is non-sense (assuming that the RPC callers are well behaved), right?

Yes, I don't think that should be required, assuming your application is following the requirements that the gRPC documentation lists. The finalizer idea may originate from this comment.

Just to make sure - the ClientStream contexts inherit from the client ctx, right?

Yes, it is a child of the application's context passed to NewStream or Invoke.

And you agree that indeed the right thing to do is to listen for context cancellation as one possible trigger of the tracing span finish?

If you are installing a stats.Handler, then looking for a stats.End event would be even simpler. This would be guaranteed to be called if the user is doing any of the required final steps.

So, basically, the way the OpenTelemetry does its thing - providing its own wrapper implementation of ClientStream and also listening for ctx cancellation - is a good way of doing things?

Yes, this would work if you have an interceptor and not a stats handler.

@dfawley
Copy link
Member

dfawley commented May 10, 2022

Let us know if you have any more questions on this.

@dfawley dfawley closed this as completed May 10, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants