Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaked ClientHttp2Stream when canceling duplex streams #2184

Closed
Matt-Esch opened this issue Aug 6, 2022 · 11 comments
Closed

Leaked ClientHttp2Stream when canceling duplex streams #2184

Matt-Esch opened this issue Aug 6, 2022 · 11 comments

Comments

@Matt-Esch
Copy link

I am upgrading a codebase from native grpc to grpc-js. When using grpc-js I see a memory leak affecting duplex streams which is caused by unclosed ClientHttp2Stream instances on the caller side of duplex streams. My best guess at the cause of the problem is that the client cancels the request and the underlying ClientHttp2Stream waits indefinitely (or until the socket is closed) for trailers that never arrive. So while the the leak appears on the client side, it's possibly caused by buggy server side behavior.

  • Is there an option to configure how long to wait for trailers?
  • Any suggestions for debugging if the server did send trailers, and if not why not?

I don't have a deterministic reproduction, some fraction of calls leak when cancelling them, so I don't know what state the connection needs to be in to cause it to leak.

Environment

  • Ubuntu 20.04.4 LTS (GNU/Linux 5.13.0-1019-aws aarch64)
  • Node v12.22.6 and v16.16.0
  • @grpc/grpc-js@1.6.8

Additional context

Related to #1339

Heapdump shows leaked Http2CallStreams (shown with distance -)
image

Http2CallStream instances are leaked through the bound this on the the ClientHttp2Stream trailers event listener
image

We can see that the calls were cancelled on the client side looking at the finalStatus
image

@Matt-Esch
Copy link
Author

@murgatroid99 I can't see how #2187 would cause the listener to not be holding a reference, what am I missing? I understand it's worthwhile to be consistent with the other listeners, but wouldn't it be better to replace .on with .once so that the listener is removed once the trailers are received? This would only go so far because the retainer is ultimately the ClientHttp2Stream which maintains a list of all of the open streams.

If we look at the _events context we still see retainers in the events list, I just think that the heapdump visualisation doesn't make this obvious. We can also see that the ClientHttp2Stream is retained by the ClientHttp2Session, so even if we removed all references to the Http2CallStream, the ClientHttp2Stream would still leak.

image

Ultimately the session is not closed for some reason. I could not correlate the session with an open session on the server side but this is because I have no way of identifying which session belongs to which Http2CallStream.

@murgatroid99
Copy link
Member

I made that change based on the information provided. You showed a screenshot with a single retainer, so I took action regarding that single retainer. Does your heapdump still show however many leaked Http2CallStream instances? If so, what retainers are listed for them now?

Changing on to once wouldn't affect anything because there isn't supposed to be a trailers event after the request is cancelled. And the ClientHttp2Session is not supposed to be closed at this point, because that's how HTTP/2 works.

grpc-js calls ClientHttp2Stream#close to close the connection when ending requests, including cancellation. The http2 documentation says in that case that a stream will be destroyed when "The http2stream.close() method is called, and (for client streams only) pending data has been read." So, it seems likely that the problem is that the stream is still alive because pending data has not yet been read. I will have to investigate how to handle that, but in the meantime, you may be able to avoid the leak simply by draining incoming data on calls that you have cancelled.

@Matt-Esch
Copy link
Author

So, it seems likely that the problem is that the stream is still alive because pending data has not yet been read.

Which makes sense I guess, but I don't understand why data is not being read. The stream is in flowing mode so it should be emitting data events until the stream is drained, no?

@Matt-Esch
Copy link
Author

It's possible that the session needs more tracking, this issue is interesting: nodejs/node#42710

@murgatroid99
Copy link
Member

The stream flow is connected to the stream returned by the gRPC API. If you have that stream in flowing mode and you are still experiencing this leak, then there may be a disconnect somewhere internally.

It is unlikely that that linked Node issue is related. That one is about sessions leaking on the server, while this is about streams leaking on the client.

@murgatroid99
Copy link
Member

I think I see the issue. Normally the API-level stream flow is connected to the http2 stream by a cycle of the API-level stream requesting a message, and the http2 stream outputting a message, triggering the API-level stream to request another message. That cycle is broken when the code processing the http2 stream's output stops outputting messages after the call ends, leaving the stream in a paused state forever.

@Matt-Esch
Copy link
Author

I did try resuming from the caller side and it didn't resolve the issue, but presumably this is because of a disconnect? What I was still seeing was:

image

@Matt-Esch
Copy link
Author

I have tested #2193 and so far it looks like it has plugged a leak, but I have an issue with http2 as #1652 is causing my process to crash still with node v16.16.0

@murgatroid99
Copy link
Member

Do you think #1652 is related to this leak, or is it just another problem you are having?

@Matt-Esch
Copy link
Author

I suspect that the double free is triggered by cancel and that the fix put in place triggers it more frequently. This is still a bug with node though, but my feeling is that grpc-js is not production ready unless node http2 is also production ready and becuase grpc-js doesn't need to rely on node's http2 implementation, it should still be an open related issue that people who are looking to us this library should be aware of.

@murgatroid99
Copy link
Member

First, I have released #2193 in grpc-js version 1.6.10.

I suspect that the double free is triggered by cancel and that the fix put in place triggers it more frequently.

I can see now how these would be connected. Fixing a memory leak inherently causes resources to be freed more consistently, which would make it more likely for a double free bug to be triggered.

grpc-js doesn't need to rely on node's http2 implementation

grpc-js absolutely does need to rely on Node's http2 implementation. That is one of the main reasons the library exists at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants