Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netty: fix StreamBufferingEncoder GOAWAY bug #8020

Merged
merged 8 commits into from Apr 16, 2021

Conversation

dapengzhang0
Copy link
Member

@dapengzhang0 dapengzhang0 commented Mar 26, 2021

There is a bug in StreamBufferingEncoder such that when client receives GOWAY while there are pending streams due to MAX_CONCURRENT_STREAMS, we see the following error:

io.netty.handler.codec.http2.Http2Exception$StreamException: Maximum active streams violated for this endpoint.
        at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:147)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.checkNewStreamAllowed(DefaultHttp2Connection.java:896)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.createStream(DefaultHttp2Connection.java:748)
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultEndpoint.createStream(DefaultHttp2Connection.java:668)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders0(DefaultHttp2ConnectionEncoder.java:201)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionEncoder.writeHeaders(DefaultHttp2ConnectionEncoder.java:167)
        at io.netty.handler.codec.http2.DecoratingHttp2FrameWriter.writeHeaders(DecoratingHttp2FrameWriter.java:53)
        at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:153)
        at io.netty.handler.codec.http2.StreamBufferingEncoder.writeHeaders(StreamBufferingEncoder.java:141)
        at io.grpc.netty.NettyClientHandler.createStreamTraced(NettyClientHandler.java:584) 
        at io.grpc.netty.NettyClientHandler.createStream(NettyClientHandler.java:567)
        at io.grpc.netty.NettyClientHandler.write(NettyClientHandler.java:328)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:709)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:792)
        at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:702)
        at io.netty.channel.DefaultChannelPipeline.write(DefaultChannelPipeline.java:1015)
        at io.netty.channel.AbstractChannel.write(AbstractChannel.java:289)
        at io.grpc.netty.WriteQueue$AbstractQueuedCommand.run(WriteQueue.java:213)
        at io.grpc.netty.WriteQueue.flush(WriteQueue.java:128)
        at io.grpc.netty.WriteQueue.drainNow(WriteQueue.java:114)
        at io.grpc.netty.NettyClientHandler.goingAway(NettyClientHandler.java:783)
        at io.grpc.netty.NettyClientHandler.access$300(NettyClientHandler.java:91)
        at io.grpc.netty.NettyClientHandler$3.onGoAwayReceived(NettyClientHandler.java:280)
        at io.netty.handler.codec.http2.DefaultHttp2Connection.goAwayReceived(DefaultHttp2Connection.java:236)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.onGoAwayRead0(DefaultHttp2ConnectionDecoder.java:218)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onGoAwayRead(DefaultHttp2ConnectionDecoder.java:551)
        at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onGoAwayRead(Http2InboundFrameLogger.java:119)
        at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readGoAwayFrame(DefaultHttp2FrameReader.java:591)
        at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:272)
        at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:160)
        at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41)
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:174)
        at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378)
        at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:498)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:437)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1486)
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1235)
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1282)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:498)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:437)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)

Made a local copy of StreamBufferingEncoder with fix.
Also upstreamed the fix to netty/netty#11144

@dapengzhang0 dapengzhang0 force-pushed the fix-goaway branch 3 times, most recently from 2b37b39 to 95060b2 Compare March 26, 2021 00:41
@dapengzhang0
Copy link
Member Author

dapengzhang0 commented Mar 29, 2021

The bug should come from the way StreamBufferingEncoder.writeHeaders() handles the condition connection().goAwayReceived(). The current behavior is to delegate to super.writeHeaders() and let the stream fail, but this will end up with Http2Exception. See https://github.com/netty/netty/blob/e5951d46fc89db507ba7d2968d2ede26378f0b04/codec-http2/src/main/java/io/netty/handler/codec/http2/StreamBufferingEncoder.java#L152-L155

The right way to handle goaway condition should be either

  • Approach 1: create a pending stream anyway even if goAwayReceived() is true and let the external goaway handling logic (e.g. NettyClientHandler) to cleanup the pending streams.

or

  • Approach 2: abort the stream immediately with goaway details rather than fail with Http2Exception.

Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this could be achieved without copying the buffering encoder.

This change looks like it might fail more RPCs that previously were allowed, because previously the failure required GOAWAY+MAX_CONCURRENT_STREAMS and now it just requires GOAWAY.

Is goAwayReceived() == true during

// Try to allocate as many in-flight streams as possible, to reduce race window of
?

@dapengzhang0
Copy link
Member Author

It looks like this could be achieved without copying the buffering encoder.

How? And shouldn't StreamBufferingEncoder be fixed anyway?

This change looks like it might fail more RPCs that previously were allowed, because previously the failure required GOAWAY+MAX_CONCURRENT_STREAMS and now it just requires GOAWAY.

We can switch the order of if-conditions to retain the old behavior, then this is almost the same as approach 1:

    if (isExistingStream(streamId) || canCreateStream()) { 
      return super.writeHeaders(ctx, streamId, headers, streamDependency, weight,
          exclusive, padding, endOfStream, promise);
    }
    if (connection().goAwayReceived()) {
      promise.setFailure(new Http2Exception(Http2Error.NO_ERROR, "GOAWAY received"));
      return promise;
    }

@ejona86
Copy link
Member

ejona86 commented Apr 7, 2021

How? And shouldn't StreamBufferingEncoder be fixed anyway?

Yes, we should fix any upstream bug in StreamBufferingEncoder. But I'd prefer to not copy the entire class if it were easy to avoid.

With the current change it looks like we'd just need a if (connection().goAwayReceived()) condition before where gRPC calls writeHeaders.

We can switch the order of if-conditions to retain the old behavior, then this is almost the same as approach 1:

I do think it is essential to check canCreateStream() before goAwayReceived() as the encoder is only supposed to deal with MAX_CONCURRENT_STREAMS, so everything else should be let through.

@dapengzhang0
Copy link
Member Author

we'd just need a if (connection().goAwayReceived()) condition before where gRPC calls writeHeaders.

As we are not able to check MAX_CONCURRENT_STREAMS number in NettyClientHandler, this looks like it might fail more RPCs that previously were allowed again.

@ejona86
Copy link
Member

ejona86 commented Apr 16, 2021

we'd just need a if (connection().goAwayReceived()) condition before where gRPC calls writeHeaders.

As we are not able to check MAX_CONCURRENT_STREAMS number in NettyClientHandler, this looks like it might fail more RPCs that previously were allowed again.

That was equivalent behavior before your recent Netty changes. With the changes that have been made to netty/netty#11144 it is no longer equivalent. It is possible to do the isExistingStream/canCreateStream checks within gRPC, but we'd need to monitor maxConcurrentStreams.

s = Status.INTERNAL.withDescription(
"Failed due to abrupt GOAWAY, but can't find GOAWAY details");
if (connection().goAwayReceived()) {
if (streamId > connection().local().lastStreamKnownByPeer()
Copy link
Member

@ejona86 ejona86 Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordinarily I'd ask you to add a comment mentioning that checking for numActiveStreams is a workaround until we update to a netty version with netty/netty#11144.

However, I think we may need the numActiveStreams check. The error message "Abrupt GOAWAY closed unsent stream. HTTP/2 error code: NO_ERROR" is misleading when max concurrent streams is hit. Should we construct a special status in the case connection().local().numActiveStreams() == connection().local().maxActiveStreams()? "At MAX_CONCURRENT_STREAMS limit and GOAWAY has been received. limit: " + maxActiveStreams()

I'm fine doing that as follow-up (or making a bug to track it; although that seems probably more work than just making the change).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #8097. Will fix in a follow-up PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants