Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address maxConcurrentStreams violation on write timeout #3908

Merged
merged 10 commits into from Nov 23, 2021

Conversation

jrhee17
Copy link
Contributor

@jrhee17 jrhee17 commented Oct 31, 2021

Background

Currently, armeria maintains a state HttpResponseDecoder#unfinishedResponses to check how many in-flight requests are being processed for a connection.

Armeria uses this value to check if all connections occupy too many concurrent streams, and creates a new connection if necessary.

On the other hand, netty maintains it's own state to check how many in-flight requests are being processed for a connection. (DefaultHttp2Connection.DefaultEndpoint#numActiveStreams)

Netty checks this value before creating a stream, and throws a Http2Exception$StreamException if MAX_CONCURRENT_STREAMS is unavailable.

Problem Statement

Currently, when a WriteTimeoutException is triggered, armeria decrements unfinishedResponses and removes the response. (A WriteTimeoutException is thrown when a request header isn't written within a predefined writeTimeoutMillis)
However, netty may not be aware that armeria has failed the response. Consequently, netty's numActiveStreams is greater than armeria's unfinishedResponses. This may cause a violation of MAX_CONCURRENT_STREAMS for additional requests on the connection.

Motivation
Netty always calls Http2ResponseDecoder.onStreamClosed before decrementing numActiveStreams.
If we want numActiveStreams to be in sync with unfinishedResponses, I propose that we modify the timing of decrementing unfinishedResponses to Http2ResponseDecoder.onStreamClosed.

In detail, when a WriteTimeoutException is scheduled

timeoutFuture = ch.eventLoop().schedule(
() -> failAndReset(WriteTimeoutException.get()),
timeoutMillis, TimeUnit.MILLISECONDS);

the response is closed.

Consequently, after the stream processes the close event, whenComplete is triggered.

resWrapper.whenComplete().handle((unused, cause) -> {
if (eventLoop.inEventLoop()) {
onWrapperCompleted(resWrapper, id, cause);
} else {
eventLoop.execute(() -> onWrapperCompleted(resWrapper, id, cause));
}
return null;
});

And the response is removed (and unfinishedResponses is decremented)

However, as far as netty is concerned, the request may have been written and may still be processing.

Misc

Reproduced maxConcurrentStreams when WriteTimeoutException occurs at 225a684

Modifications

  • Remove the removeResponse call from Http2ResponseDecoder. onWrapperCompleted, and rely on onStreamClosed to remove the response/decrement unfinishedResponses
  • When receiving callbacks for onHeadersRead, onDataRead, onRstStreamRead, also check if resWrapper had been closed. This preserves behavior since res was previously removed on WriteTimeoutException, resulting in res == null.

Update

I realized that if we simply don't process values when headers/data/rst are received, then we might not send a GoAway and close the connection when disconnectWhenFinished = true due to df43379.

I've verified this behavior from test cases added in 8018da1

I've modified further such that:

  • Only remove responses when onStreamClosed is called.
  • Remove calls to channel().close(); if shouldSendGoAway() is true for onDataRead, onHeadersRead since onStreamClosed will handle this instead.
  • Remove onStreamClosed to try to close the ResponseWrapper only if the underlying delegate is open.

d1183d8

There is a slight change of behavior, where a GoAway may be triggered from onRstStream as well. Let me know if this change shouldn't be made 🙏

Result:

@jrhee17 jrhee17 force-pushed the bugfix/max-concurrent-streams-v2 branch from 5be65fc to e5d769f Compare November 1, 2021 05:26
@codecov
Copy link

codecov bot commented Nov 1, 2021

Codecov Report

Merging #3908 (dca247c) into master (de88d77) will increase coverage by 0.03%.
The diff coverage is 80.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #3908      +/-   ##
============================================
+ Coverage     73.29%   73.32%   +0.03%     
- Complexity    15548    15563      +15     
============================================
  Files          1365     1365              
  Lines         59863    59882      +19     
  Branches       7598     7606       +8     
============================================
+ Hits          43877    43909      +32     
+ Misses        12136    12120      -16     
- Partials       3850     3853       +3     
Impacted Files Coverage Δ
...m/linecorp/armeria/client/HttpResponseDecoder.java 84.33% <ø> (-0.52%) ⬇️
.../linecorp/armeria/client/Http2ResponseDecoder.java 71.52% <80.00%> (+3.96%) ⬆️
...ecorp/armeria/server/LengthBasedServiceNaming.java 75.00% <0.00%> (-16.67%) ⬇️
.../linecorp/armeria/client/retrofit2/PipeBuffer.java 83.72% <0.00%> (-4.66%) ⬇️
...eria/client/eureka/EurekaEndpointGroupBuilder.java 31.25% <0.00%> (-1.65%) ⬇️
...ia/internal/common/stream/ByteBufDecoderInput.java 85.61% <0.00%> (-1.44%) ⬇️
...corp/armeria/client/RefreshingAddressResolver.java 75.00% <0.00%> (-1.43%) ⬇️
.../common/stream/DefaultStreamMessageDuplicator.java 77.56% <0.00%> (-1.26%) ⬇️
...com/linecorp/armeria/client/RedirectingClient.java 71.75% <0.00%> (-0.93%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4682d53...dca247c. Read the comment docs.

@jrhee17 jrhee17 marked this pull request as ready for review November 4, 2021 15:24
@minwoox minwoox added the defect label Nov 5, 2021
Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What an awesome analysis! 😄

@@ -104,6 +104,7 @@ void clientTimeout() throws InterruptedException {
assertThat(loggingEventCaptor.getAllValues()).noneMatch(event -> {
return event.getLevel() == Level.WARN &&
event.getThrowableProxy() != null &&
event.getThrowableProxy().getMessage() != null &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert?

removeResponse(id);
// Removing the response and decrementing {@code unfinishedResponses} isn't done immediately
// here. Instead, we rely on {@code Http2ResponseDecoder#onStreamClosed} to decrement
// `unfinishedResponses` to match the timing where netty decrements {@code numActiveStreams}.
Copy link
Member

@minwoox minwoox Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Suggested change
// `unfinishedResponses` to match the timing where netty decrements {@code numActiveStreams}.
// `unfinishedResponses` after Netty decrements `numActiveStreams` in `DefaultHttp2Connection` so that `unfinishedResponses` is never greater than `numActiveStreams`.

(We don't have to use {@code ...} in the comment. 😉 )

@ikhoon ikhoon added this to the 1.14.0 milestone Nov 5, 2021
Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jrhee17
Copy link
Contributor Author

jrhee17 commented Nov 6, 2021

Sorry reviewers

While taking another look, I realized that the channel might not send a goAway frame when necessary.
Let me add a unit test and a revised commit soon 🙏

@jrhee17 jrhee17 force-pushed the bugfix/max-concurrent-streams-v2 branch from bcc82f0 to d1183d8 Compare November 7, 2021 05:45
@jrhee17
Copy link
Contributor Author

jrhee17 commented Nov 7, 2021

Sorry 😅 I've added 3 commits, I'd appreciate if you can take another look 🙏

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some questions. 😄

.responseTimeoutMillis(0)
.build();

assertThat(client.get("/delayed?seconds=4").aggregate().join().status()).isEqualTo(OK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request doesn't seem to be closed unlike what the method name says?

I maybe miss something, but if the MAX_CONNECTION_AGE is two seconds and the response is sent after 4 seconds, shouldn't the connection should be closed forcefully? 🤔

Copy link
Contributor Author

@jrhee17 jrhee17 Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the late check

Maybe I'm misunderstanding the specification, but I thought that even if MAX_CONNECTION_AGE is reached, the client waits for requests to finish (so it doesn't forcefully fail long-running requests). I've updated the test name to better reflect this 😅

I've also organized how I think MAX_CONNECTION_AGE works

  1. A timer is scheduled for MAX_CONNECTION_AGE, but is only respected if there are no running requests

    if (!isServer && !hasRequestsInProgress(ctx)) {
    logger.debug("{} Closing a {} connection exceeding the max age: {}ns",
    ctx.channel(), name, maxConnectionAgeNanos);
    ctx.channel().close();
    }

  2. Instead, every time a request may have finished, we check if MAX_CONNECTION_AGE has passed, and there are no running requests.

if (shouldSendGoAway()) {
channel().close();
}

if (shouldSendGoAway()) {
channel().close();
}

if (shouldSendGoAway()) {
// The connection has reached its lifespan.
// Should send a GOAWAY frame if it did not receive or send a GOAWAY frame.
channel().close();
}

  1. At the same time, we don't allow future requests for connections where MAX_CONNECTION_AGE has passed.
    return active && !responseDecoder.needsToDisconnectWhenFinished();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the test name to better reflect this 😅

Thanks for that. 😄

but I thought that even if MAX_CONNECTION_AGE is reached, the client waits for requests to finish (so it doesn't forcefully fail long-running requests)

Exactly. My memory was wrong. Thanks for checking it. 😄

Copy link
Contributor

@ikhoon ikhoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM!

minwoox pushed a commit that referenced this pull request Nov 23, 2021
…bort (#3920)

Motivation:

#3908 (comment)

Once a request reaches `HttpSessionHandler#invoke`
https://github.com/line/armeria/blob/d39c8f5719d7de7eef26df37917cc2df786b5b53/core/src/main/java/com/linecorp/armeria/client/HttpSessionHandler.java#L169

`HttpResponseDecoder.unfinishedResponses` hae been incremented from either of the three following locations:
1. https://github.com/line/armeria/blob/d39c8f5719d7de7eef26df37917cc2df786b5b53/core/src/main/java/com/linecorp/armeria/client/HttpChannelPool.java#L276
2. https://github.com/line/armeria/blob/d39c8f5719d7de7eef26df37917cc2df786b5b53/core/src/main/java/com/linecorp/armeria/client/HttpChannelPool.java#L788
3. https://github.com/line/armeria/blob/d39c8f5719d7de7eef26df37917cc2df786b5b53/core/src/main/java/com/linecorp/armeria/client/HttpChannelPool.java#L503

Usually, once a request completes (with either success or failure), the response is removed and `HttpResponseDecoder.unfinishedResponses` is decremented.

However, if a request is cancelled early it is possible that `unfinishedResponses` for the session isn't decremented.


This is problematic because:
1. A http2 connection may not be able to fully utilize maximum streams (although 25 streams is allowed, only 24 streams are utilized)
2. Max connection age may not function properly, since it needs `HttpResponseDecoder.unfinishedResponses == 0` before closing connections.

Modifications:

- Decrement `HttpResponseDecoder.unfinishedResponses` if a request is cancelled immediately.

Result:

- More stable behavior from Armeria client
@minwoox
Copy link
Member

minwoox commented Nov 23, 2021

https://github.com/line/armeria/runs/4285541086?check_suite_focus=true#step:6:1154
@jrhee17 Could you check if the failure is related to the change, please?

@jrhee17 jrhee17 force-pushed the bugfix/max-concurrent-streams-v2 branch from 435a06b to dca247c Compare November 23, 2021 04:45
@jrhee17
Copy link
Contributor Author

jrhee17 commented Nov 23, 2021

I've tried force pushing to rerun the test.

Could you check if the failure is related to the change, please?

SessionProtocol[3] is H1, and this PR doesn't touch anything http1 related so I think this change probably isn't related.
Let me look into the source of flakiness separately 🙏 (unless the same test fails again with the re-run 😅 )

@ikhoon
Copy link
Contributor

ikhoon commented Nov 23, 2021

I've tried force pushing to rerun the test.

You might want to install @trustin's awesome script https://gist.github.com/trustin/05cbb70e22fc5e7c8b5ffbd1f0d99c8b

@minwoox
Copy link
Member

minwoox commented Nov 23, 2021

Thanks for checking it. 😅

Let me look into the source of flakiness separately

We don't have to do it right now. I just wanted to make sure. 😄

@ikhoon
Copy link
Contributor

ikhoon commented Nov 23, 2021

line/armeria/runs/4285541086?check_suite_focus=true#step:6:1154 @jrhee17 Could you check if the failure is related to the change, please?

I guess The flaky test seems not related to this PR.

@minwoox minwoox merged commit 8763394 into line:master Nov 23, 2021
@minwoox
Copy link
Member

minwoox commented Nov 23, 2021

Thanks a lot, @jrhee17 for fixing this bug which is really hard to find the cause. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Received a Maximum active streams violated for this endpoint. from client side
3 participants