Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netty: default grace time for RPCs can leak memory #5443

Merged
merged 2 commits into from Mar 18, 2019
Merged

netty: default grace time for RPCs can leak memory #5443

merged 2 commits into from Mar 18, 2019

Conversation

fabiokung
Copy link
Contributor

tl;dr

Using Long.MAX_VALUE as the delay for Netty's NioEventLoop#schedule can cause (long) deadlines for scheduled tasks to wrap into negative values, which is unspecified behavior. Recent versions of netty are guarding against overflows, but not all versions of grpc-java are using a recent
enough netty.

When connections are gracefully closed, Netty's Http2ConnectionHandler sets up a timeout task forcing resources to be freed after a grace period. When deadlines for these tasks wrap into negative values, a race was observed in production systems (with grpc-java <= 1.12.1) causing Netty
to never properly release buffers associated with active streams in the connection being (gracefully) closed.

Once I explicitly forced a reasonable value (5s) for Http2ConnectionHandler#gracefulShutdownTimeoutMillis (through NettyServerBuilder#maxConnectionAgeGrace) the memory leak went away.

Recent versions of grpc-java will potentially not be impacted by this - or be less impacted - since Netty is now protecting against overflows.

Unfortunately, we are currently stuck with 1.10.x (and potentially 1.12.x in the short term), so it would be great to have the fix backported all the way back to these branches. The fix is very simple, let me know if I can help backporting it.

Longer report

We are observing a (slow) memory leak with grpc-java 1.10.1 and 1.12.1 (which seems to be present in current master as well, under some conditions). Our gRPC servers see some amount of channels being disconnected without a full http2 handshake (properly closing both sides). We have no control over this as clients can just terminate TCP connections without going through a full ManagedChannel#shutdown(). Our logs will occasionally show:

WARNING: Stream Error
io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place

That should be harmless, but this fact may be relevant later on.

Looking at heapdumps from instances with 5-30GB of memory leaked, all of the extra memory is off-heap (DirectByteBufs). Unfortunately, I can not share the heapdumps as they have sensitive information.

The leaked DirectByteBufs are retained by grpc's netty PooledUnsafeDirectByteBuf objects:

screen shot 2019-03-07 at 11 10 25 am

Looking at what's retaining these buffers ("shortest path to GC roots"), we can see that Http2ConnectionHandler$ClosingChannelFutureListener objects are part of the reference graph:

screen shot 2019-03-07 at 11 12 14 am

To me that was a strong indication that buffers are being leaked while connections/channels/streams are being closed. We can also see that buffers are being held in the DefaultHttp2RemoteFlowController$FlowState#pendingWriteQueue.

I then tried to understand why these buffers were not being flushed, and got as far as WeightedFairQueueByteDistributor not being able to write them (possibly due to streamableBytes already being 0 because of the onStreamClosed() callback, or because the underlying window of the connection will never be > 0, since the connection is already closed). The DefaultHttp2Connection$DefaultStream holding all these pendingWrite buffers is in a HALF_CLOSED_REMOTE state, and DefaultHttp2RemoteFlowController$FlowState instances are not writeable:

screen shot 2019-03-07 at 11 28 11 am

My read of the Netty code (and there is high probability I missed something) is that active channels in this state will stay stuck forever, until they are forcedly reaped by DefaultHttp2RemoteFlowController$FlowState#cancel, which only happens when both sides of the stream are closed (HALF_CLOSED is not enough).

Next thing I started to investigate is why these streams were never being forced closed. It turns out Http2ConnectionHandler$ClosingChannelFutureListener#timeoutTask had negative deadline values:

screen shot 2019-03-07 at 11 32 43 am

These timeout tasks were apparently never being executed by the NioEventLoop (executor for scheduled tasks in netty), so streams were never being forced closed.

... which brings us to the fix in this PR, and to my current workaround of always explicitly setting a reasonable value for NettyServerBuilder#maxConnectionAgeGrace. We are using 5s, since we don't have any RPCs that should take longer than that after channels are closed.

Regardless of fixing negative deadlines for scheduled timeoutTasks, from what I can tell, there may be still an issue with timeouts being too large. Memory may still be leaked if streams are never forced closed, and if grace times for RPC calls are too high (or indefinite - the default).

A proper fix for this seems more complicated, and may need changes on Netty to make sure all active streams are forced closed (both sides) when the underlying TCP connection is gone. At that point there is no much sense in holding onto buffers, pendingWrites will never be completed.

Using Long.MAX_VALUE as the delay for Netty's NioEventLoop#schedule can
cause (long) deadlines for scheduled tasks to wrap into negative values,
which is unspecified behavior. Recent versions of netty are guarding
against overflows, but not all versions of grpc-java are using a recent
enough netty.

When connections are gracefully closed, Netty's Http2ConnectionHandler
sets up a timeout task forcing resources to be freed after a grace
period. When their deadlines wrap into negative values, a race was
observed in production systems (with grpc-java <= 1.12.1) causing Netty
to never properly release buffers associated with active streams in the
connection being (gracefully) closed.
@thelinuxfoundation
Copy link

Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement.

After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.

Regards,
CLA GitHub bot

@fabiokung
Copy link
Contributor Author

CLA signed (as an employee of Netflix).

@ejona86
Copy link
Member

ejona86 commented Mar 7, 2019

CC @carl-mastrangelo

The fix to Netty was because we filed netty/netty#7970. IIRC that was a Netty regression, which I think was caused by netty/netty#7402 (which impacted Nio, even though the problem being fixed was with Epoll).

When this code was originally written, the -1 thing wouldn't work. It seems zpencer from our team changed Netty in netty/netty#6954 to add support, but gRPC didn't swap over to using it.

It looks like it would be beneficial to use -1 so that the ClosingChannelFutureListener will avoid scheduling a task. It seems that no harm should come from that task with any recent version of Netty, however.


It is weird to backport what is essentially a workaround to an unsupported release of gRPC from over a year ago for a Netty bug. We only support the two most recent releases; while we can still help with releases older than that, this is a really old release. Could you give a bit of a glimpse why it is hard to upgrade from grpc-java v1.10?

But also, it looks like ClosingChannelFutureListener properly cleans up the scheduled timer when the channel closes. It seems like the channel must still alive; if it was closed then the HTTP/2 code would have released the buffers. So I don't think this change alone will solve your problem.

It's unclear to me why the connections are still alive. The only way I'd know for them to be alive is if RPCs were still running on them. Are you doing long-lived RPCs?

maxConnectionAgeGrace defaults to infinite, but so does maxConnectionAge. Since you are setting maxConnectionAge, then it seems very reasonable to set maxConnectionAgeGrace as well. Although that should really only matter if you have long-lived RPCs.

@@ -60,7 +60,7 @@

static final long MAX_CONNECTION_IDLE_NANOS_DISABLED = Long.MAX_VALUE;
static final long MAX_CONNECTION_AGE_NANOS_DISABLED = Long.MAX_VALUE;
static final long MAX_CONNECTION_AGE_GRACE_NANOS_INFINITE = Long.MAX_VALUE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd much rather you leave all this as-is. Instead, only change secondGoAwayAndClose() to convert from this value to -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion there. Will do.

@@ -96,6 +96,7 @@
* the context of the Netty Channel thread.
*/
class NettyServerHandler extends AbstractNettyHandler {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid full-file formatters; every developer can have slightly different settings and they can constantly "fight" causing lots of useless churn. Please revert these unnecessary changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. I'd suggest updating CONTRIBUTING.md with that info, I just followed it and used the recommended Google settings for IntelliJ.

Copy link
Contributor Author

@fabiokung fabiokung Mar 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, CONTRIBUTING.md already has it. I missed it somehow:

Don't fix code style and formatting unless you are already changing that line to address an issue. PRs with irrelevant changes won't be merged. If you do want to fix formatting or style, do that in a separate PR.

@fabiokung
Copy link
Contributor Author

fabiokung commented Mar 8, 2019

@ejona86 I made the changes you suggested. PTAL.

Could you give a bit of a glimpse why it is hard to upgrade from grpc-java v1.10?

Long story short: transitive dependencies. We have tons of other internal libraries stuck at guava 19.x, so we can not easily upgrade grpc-java. We are working on it, but it is going to take months.

But also, it looks like ClosingChannelFutureListener properly cleans up the scheduled timer when the channel closes. It seems like the channel must still alive; if it was closed then the HTTP/2 code would have released the buffers. So I don't think this change alone will solve your problem.

Exactly, channels never being closed is the real root cause here. My read of DefaultHttp2RemoteFlowController is that resources will only be freed when both sides of the Stream are closed, which is never happening (unless the timeout task from ClosingChannelFutureListener fires and forces it to be closed).

What solved my problem was the workaround of always forcing a reasonable (short) grace time for RPCs, since this seems to be the only way to force these Streams to be properly closed when the underlying TCP connection is terminated. This is what I tried to convey in the last two paragraphs of my "Longer report". If there is really a problem here, it seems dangerous having an infinite timeout (no timeout) as the default, as it will likely cause memory leaks often. TBH I'm a bit surprised this hasn't been experienced (or noticed) by others yet.

It's unclear to me why the connections are still alive. The only way I'd know for them to be alive is if RPCs were still running on them. Are you doing long-lived RPCs?

All our RPCs are documented here: https://github.com/Netflix/titus-api-definitions/blob/0.0.1-rc57/src/main/proto/netflix/titus/titus_job_api.proto#L748

None of them are particularly long, and should take at most a few seconds to complete. There are a few server streaming RPCs (JobManagementService::ObserveJob[s]) that keep the channel open for longer. However, even with server/client stream RPCs, it seems to me that resources should be freed as soon as the underlying TCP connection is closed. An infinite timeout doesn't make sense here, since pending writes will never be able to complete.

Keep it consistent with other timeout configurations.
@ejona86
Copy link
Member

ejona86 commented Mar 8, 2019

internal libraries stuck at guava 19.x

Ah, that I fully understand. It's useful to know that not all of the old Guava versions are out of circulation yet. If there's something small we can do to help, we may be open to the idea. But I also understand it just sort of takes time.

My read of DefaultHttp2RemoteFlowController is that resources will only be freed when both sides of the Stream are closed, which is never happening (unless the timeout task from ClosingChannelFutureListener fires and forces it to be closed).

Thanks for pointing that out. I think I may know what's happening. You may be bitten by half-close cleanup issues: #4202. The client-side fix was in 1.11. It seems the server-side is still not fixed: #2162 . (Only one has to be fixed to avoid leaks, but we can't guarantee that the remote side will behave as we would hope.) Are the clients in a different language or were all tests using grpc-java 1.10 clients?

The easiest way to trigger this is for the server to respond with an error before the client half closed. That's easiest to trigger in a streaming RPC.

it seems to me that resources should be freed as soon as the underlying TCP connection is closed

The problem is the TCP connection is never being closed, because there are still believed to be RPCs on it :-).

@ejona86
Copy link
Member

ejona86 commented Mar 8, 2019

The code looks good. We'll probably want to reword the commit message to make it more clear this is more of an optimization at this point.

@fabiokung
Copy link
Contributor Author

Oh nice! The issues you linked seem to be the root cause indeed. Our clients are in varied languages and versions, we have no control over their code. Ideally we'd prefer to protect servers against all and any potentially bad clients.

The easiest way to trigger this is for the server to respond with an error before the client half closed. That's easiest to trigger in a streaming RPC.

How can the server know when clients are about to cut their side of the connection?

The problem is the TCP connection is never being closed, because there are still believed to be RPCs on it :-)

By TCP connection being closed I mean the actual socket is in a terminal state, because the TCP connection was broken (TCP FINs, timeouts, etc). I'm not familiar enough with the HTTP2 spec and Netty internals to know if streams will see an END_STREAM in all cases when it happens, but if so, your proposal on netty/netty#8275 seems reasonable.

We'll probably want to reword the commit message to make it more clear this is more of an optimization at this point.

SGTM. Feel free to do it as you squash commits for merging. Or let me know if you prefer I amend mine.

@ejona86
Copy link
Member

ejona86 commented Mar 8, 2019

The easiest way to trigger this is for the server to respond with an error before the client half closed. That's easiest to trigger in a streaming RPC.

How can the server know when clients are about to cut their side of the connection?

I was saying all that from the HTTP/2 stream perspective; TCP connections didn't come into play. Client half close is just that the client finished sending the request.

@fabiokung
Copy link
Contributor Author

it's pretty usual in our environment for clients to cut the (server) streaming RPC by just doing exit(2). In any case these client processes can just die at any point without being a well-behaved http2/grpc client.

@ejona86
Copy link
Member

ejona86 commented Mar 8, 2019

In any case these client processes can just die at any point without being a well-behaved http2/grpc client.

If the clients die then things should be better off. If the client's kernel sends a TCP RST and it is received by the server, everything is golden; the channel will be cleaned up. If the client machine dies or there's a network breakage or the TCP RST is dropped by the network, the server may get stuck. That is why we have NettyServerBuilder.keepAliveTime. It should be defaulting to 2 hours to day, so eventually the server will notice and clean things up.

@fabiokung
Copy link
Contributor Author

Got it. That is definitely not happening in my case then, the memory leaking builds up throughout the day, and we see ~40GB of ram leaking on a period of ~11h, when we then kill these processes/instances.

@fabiokung
Copy link
Contributor Author

What if NettyServerBuilder.keepAliveTime is being interrupted/impacted by the Http2Connection being in the middle of a gracefulShutdown? (ClosingChannelFutureListener is involved)

@ejona86
Copy link
Member

ejona86 commented Mar 8, 2019

If the keepalive times out, then it kills all RPCs: https://github.com/grpc/grpc-java/blob/v1.19.0/netty/src/main/java/io/grpc/netty/NettyServerHandler.java#L831 . And that would catch ones leaked by the half close issue.

@fabiokung
Copy link
Contributor Author

fabiokung commented Mar 8, 2019

hrmm, then I can't explain why the leaked half closed ones are not being reaped by the ping timeouts. There might be another issue somewhere else.

We are using the default NettyServerBuilder.keepAliveTime on 1.10.1, which seems to be 2h (with a timeout of 20s).

@creamsoup creamsoup added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 12, 2019
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 12, 2019
@dapengzhang0 dapengzhang0 added this to the 1.20 milestone Mar 15, 2019
@dapengzhang0
Copy link
Member

@carl-mastrangelo :shipit: ?

Copy link
Contributor

@carl-mastrangelo carl-mastrangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@carl-mastrangelo carl-mastrangelo added the kokoro:force-run Add this label to a PR to tell Kokoro to re-run all tests. Not generally necessary label Mar 18, 2019
@grpc-kokoro grpc-kokoro removed the kokoro:force-run Add this label to a PR to tell Kokoro to re-run all tests. Not generally necessary label Mar 18, 2019
@carl-mastrangelo carl-mastrangelo merged commit 7df2d5f into grpc:master Mar 18, 2019
@carl-mastrangelo
Copy link
Contributor

@fabiokung Merged, thanks!

@lock lock bot locked as resolved and limited conversation to collaborators Jun 16, 2019
@fabiokung fabiokung deleted the fix-rpc-graceful-leak branch July 15, 2019 17:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants