Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce HTTP 1.1 Full msg pipeline traversals #13785

Merged
merged 4 commits into from Jan 24, 2024

Conversation

franz1981
Copy link
Contributor

Reduce HTTP 1.1 Full msg pipeline traversals

Motivation:

HttpObjectEncoder can already embed the data into the header buffer, enabling MessageToMessageEncoder to save allocating promise combiners and reducing the number of pipeline traversal. Sadly this happens only if the header estimation leave enough room in the header buffer.

Modifications:

Enhance the ability to embed the full msg content into the header buffer if under specific thresholds or for heap buffers, which will likely be copied into direct ones regardless, later one, anticipating the copy.

Result:

Faster small or heap data's writes

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 17, 2024

This is similar to #13783, but I'm still working on thresholds and to simplify the code.

Please let me know what you think about the idea behind it @normanmaurer @idelpivnitskiy

My concerns related this change are related the heap buffer "unbounded" copy strategy:

  1. a successful single buffer write still relies on an header estimation: if the estimation proved to be wrong we risk to waste a LOT of bytes eg headers estimated = 64 bytes - real headers = 65 bytes - heap data content = 4K: because of a single header byte wrongly estimated, we waste 4093 bytes!!!
  2. what if the strategy to always copy them is such a good choice that direct buffers would have benefit beyond the chosen 128 bytes cutoff/threshold I've chosen? It risks to make heap buffers a faster choice than direct ones :"/

The latter is more related the motivation behind this change:

  • additional pipeline traversals are not nice, but how to estimate their cost vs a copy? it's very arch dependent...
  • creating a PrimiseCombiner has a heap footprint (ie 40 bytes), introduce additional atomic operations (ie adding its future listeners per each write's future) which cost is difficult to compare/related to the copy cost

In short, the cutoff value could be slightly bigger (1KB or a proportion vs the estimated headers?) for heap buffers, in order to avoid the first problem and still guarantee heap buffers to not have too much advantage over direct ones.

@franz1981 franz1981 marked this pull request as draft January 17, 2024 11:55
@franz1981 franz1981 marked this pull request as ready for review January 17, 2024 13:27
@franz1981
Copy link
Contributor Author

franz1981 commented Jan 22, 2024

@normanmaurer @chrisvest I've collected few data from a simple hello world benchmark, where the stack is using a byte[] backed ByteBuf in the full http msg, and the difference is so big which is difficult to believe.

On my local box, I've created a quarkus instance (which is using vertx and netty under the hood), using 4 I/O threads ( === event loops) with a very simple end point, like https://github.com/franz1981/quarkus-profiling-workshop/blob/master/src/main/java/profiling/workshop/greeting/GreetingResource.java but adding the NonBlocking annotation to make it run on the event loop threads, without any handoff.

The load generation is using h2load and is fairly simplicistic, just to capture any improvement in the peak throughput:

$ h2load http://localhost:8080/hello -c 10 -m 10 -D 60s -t 2 --h1

To make sure it has enough cpu to push the server to the limit.
It uses HTTP 1.1 pipelining level of 10, to make more evident any improvement, although I'm aware it's not very realistic.

before this change I got:

1470780.67 req/s, 150.08MB/s

while, after:

1730044.33 req/s, 176.54MB/s

which is an improvement of ~17% (!!).

The flamegraphs and profiling data indeed report a similar story, showing that both the encoding and the promise success on response writing has halved their relative costs.

I can make the CUTOFF value a static final and if you agree, this is good to go

@normanmaurer
Copy link
Member

@franz1981 this is amazing! I like it

@normanmaurer
Copy link
Member

@normanmaurer normanmaurer added this to the 4.1.107.Final milestone Jan 22, 2024
Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions but in general I think it's a sound strategy.

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 23, 2024

@bryce-anderson PTAL

I've limited the copy size till 12.5% of the estimated header size, which means that if the estimation contains tons of headers (> 2K, which seems unlikely, we can still afford to embedd > 256 bytes content in) we limit the cost of a wrong estimation.

I would prefer to raise the value in case of heap buffers, because they benefit a lot more from embedding the content, but I cannot see how to do it without loosing space in case of a wrong estimation...
My point is that estimations should adapt based on the context and if they are wrong, they can be wrong regardless, losing space. The hit is somehow decent because if the pooled allocator is using thread local buffers to the event loop thread, it won't be likely allocated, but is just using an existing allocated one.

I can provide another commit to raise the limit to ~1K (or 512 bytes?) just for heap buffers if you agree, wdyt @normanmaurer ?

The point is that request/responses act very differently, in common scenarios: "usually" requests tends to have more headers and less content, while responses, the opposite, unless it's a POST/PUT request

Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can provide another commit to raise the limit to ~1K (or 512 bytes?) just for heap buffers if you agree, wdyt @normanmaurer ?

@franz1981, I'll defer to you and @normanmaurer.

Motivation:

HttpObjectEncoder can already embed the data into the header buffer, enabling MessageToMessageEncoder to save allocating promise combiners and reducing the number of pipeline traversal. Sadly this happens only if the header estimation leave enough room in the header buffer.

Modifications:

Enhance the ability to embed the full msg content into the header buffer if under specific thresholds or for heap buffers, which will likely be copied into direct ones regardless, later one, anticipating the copy.

Result:

Faster small or heap data's writes
@franz1981
Copy link
Contributor Author

With 23fb903 I've fixed a broken (from long time) benchmark as well, and I've compared before/after, getting

before:

Benchmark                                (contentBytes)  (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt        Score        Error  Units
HttpRequestEncoderBenchmark.fullMessage             128               true            false           true  thrpt   20  8141334.970 ± 213631.397  ops/s
HttpRequestEncoderBenchmark.fullMessage             128               true            false          false  thrpt   20  5305790.008 ± 544748.708  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false           true  thrpt   20  7333156.696 ± 781234.881  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false          false  thrpt   20  5267705.708 ± 361571.646  ops/s

after:

Benchmark                                (contentBytes)  (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt        Score        Error  Units
HttpRequestEncoderBenchmark.fullMessage             128               true            false           true  thrpt   20  8075926.038 ±  95644.252  ops/s
HttpRequestEncoderBenchmark.fullMessage             128               true            false          false  thrpt   20  7659763.433 ± 278948.037  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false           true  thrpt   20  7221123.772 ± 660299.770  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false          false  thrpt   20  7131083.271 ± 616222.581  ops/s

which correctly show that:

  • given that the pipeline here is not present, there's no penalty in adding 2 ByteBuf over it
  • void promises have a special treatments in case of 2 ByteBufs which makes their performance still decent
  • non-void promises with multiple buffers, force creating a promise combiner, which perform costy atomic operations to register a new listener to them

In short, this benchmark shows that is beneficial to save creating the combiner, which we kind of know already, but is nice to have something which make it evident, although very artificial.

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 24, 2024

PTAL @bryce-anderson @normanmaurer @He-Pin this could be good to go

@franz1981
Copy link
Contributor Author

franz1981 commented Jan 24, 2024

@bryce-anderson @normanmaurer

I got an idea to help "expert" users: what if I create a protected method which users could extend to decide by their own if account for the content or not?

I could expose a method to know if the operation has succeeded actually, or there is no way to have a feedback...
but my point is to leave users the option to increase the value beyond 128 or perform different strategies if they feel is worthy.

Conversely I can to add new sys property to make it configurable eg by exposing the current threshold values too: this last one could be enough, although it applies to both heap/direct, while possibly users which arrive to such low level tunable prefer to have more control over it.

Any suggestion?

@normanmaurer
Copy link
Member

@bryce-anderson @normanmaurer

I got an idea to help "expert" users: what if I create a protected method which users could extend to decide by their own if account for the content or not?

I could expose a method to know if the operation has succeeded actually, or there is no way to have a feedback... but my point is to leave users the option to increase the value beyond 128 or perform different strategies if they feel is worthy.

Conversely I can to add new sys property to make it configurable eg by exposing the current threshold values too: this last one could be enough, although it applies to both heap/direct, while possibly users which arrive to such low level tunable prefer to have more control over it.

Any suggestion?

I would just not do it for now... Let's keep things simple

Copy link
Contributor

@bryce-anderson bryce-anderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franz1981, I really like how this shook out: simple but very effective.

@normanmaurer normanmaurer merged commit 7893d5f into netty:4.1 Jan 24, 2024
15 checks passed
@normanmaurer
Copy link
Member

@franz1981 thanks a lot, this is great!

Can you do a PR against main as well ?

@franz1981
Copy link
Contributor Author

Yep @normanmaurer this could be done easily..for other changes I did, I was more confident newer JDK versions would have been fine, and didn't want to pollute the code with cheap tricks, but this one is different...

franz1981 added a commit to franz1981/netty that referenced this pull request Feb 9, 2024
Reduce HTTP 1.1 Full msg pipeline traversals

Motivation:

HttpObjectEncoder can already embed the data into the header buffer,
enabling MessageToMessageEncoder to save allocating promise combiners
and reducing the number of pipeline traversal. Sadly this happens only
if the header estimation leave enough room in the header buffer.

Modifications:

Enhance the ability to embed the full msg content into the header buffer
if under specific thresholds or for heap buffers, which will likely be
copied into direct ones regardless, later one, anticipating the copy.

Result:

Faster small or heap data's writes
@franz1981 franz1981 deleted the 4.1_less_traversals branch April 3, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants