Reduce HTTP 1.1 Full msg pipeline traversals #13785

franz1981 · 2024-01-17T11:54:34Z

Reduce HTTP 1.1 Full msg pipeline traversals

Motivation:

HttpObjectEncoder can already embed the data into the header buffer, enabling MessageToMessageEncoder to save allocating promise combiners and reducing the number of pipeline traversal. Sadly this happens only if the header estimation leave enough room in the header buffer.

Modifications:

Enhance the ability to embed the full msg content into the header buffer if under specific thresholds or for heap buffers, which will likely be copied into direct ones regardless, later one, anticipating the copy.

Result:

Faster small or heap data's writes

franz1981 · 2024-01-17T11:55:18Z

This is similar to #13783, but I'm still working on thresholds and to simplify the code.

Please let me know what you think about the idea behind it @normanmaurer @idelpivnitskiy

My concerns related this change are related the heap buffer "unbounded" copy strategy:

a successful single buffer write still relies on an header estimation: if the estimation proved to be wrong we risk to waste a LOT of bytes eg headers estimated = 64 bytes - real headers = 65 bytes - heap data content = 4K: because of a single header byte wrongly estimated, we waste 4093 bytes!!!
what if the strategy to always copy them is such a good choice that direct buffers would have benefit beyond the chosen 128 bytes cutoff/threshold I've chosen? It risks to make heap buffers a faster choice than direct ones :"/

The latter is more related the motivation behind this change:

additional pipeline traversals are not nice, but how to estimate their cost vs a copy? it's very arch dependent...
creating a PrimiseCombiner has a heap footprint (ie 40 bytes), introduce additional atomic operations (ie adding its future listeners per each write's future) which cost is difficult to compare/related to the copy cost

In short, the cutoff value could be slightly bigger (1KB or a proportion vs the estimated headers?) for heap buffers, in order to avoid the first problem and still guarantee heap buffers to not have too much advantage over direct ones.

franz1981 · 2024-01-22T17:04:30Z

@normanmaurer @chrisvest I've collected few data from a simple hello world benchmark, where the stack is using a byte[] backed ByteBuf in the full http msg, and the difference is so big which is difficult to believe.

On my local box, I've created a quarkus instance (which is using vertx and netty under the hood), using 4 I/O threads ( === event loops) with a very simple end point, like https://github.com/franz1981/quarkus-profiling-workshop/blob/master/src/main/java/profiling/workshop/greeting/GreetingResource.java but adding the NonBlocking annotation to make it run on the event loop threads, without any handoff.

The load generation is using h2load and is fairly simplicistic, just to capture any improvement in the peak throughput:

$ h2load http://localhost:8080/hello -c 10 -m 10 -D 60s -t 2 --h1

To make sure it has enough cpu to push the server to the limit.
It uses HTTP 1.1 pipelining level of 10, to make more evident any improvement, although I'm aware it's not very realistic.

before this change I got:

1470780.67 req/s, 150.08MB/s

while, after:

1730044.33 req/s, 176.54MB/s

which is an improvement of ~17% (!!).

The flamegraphs and profiling data indeed report a similar story, showing that both the encoding and the promise success on response writing has halved their relative costs.

I can make the CUTOFF value a static final and if you agree, this is good to go

normanmaurer · 2024-01-22T18:20:54Z

@franz1981 this is amazing! I like it

normanmaurer · 2024-01-22T18:21:35Z

@idelpivnitskiy @bryce-anderson @chrisvest PTAL

bryce-anderson

A few questions but in general I think it's a sound strategy.

codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java

franz1981 · 2024-01-23T09:21:47Z

@bryce-anderson PTAL

I've limited the copy size till 12.5% of the estimated header size, which means that if the estimation contains tons of headers (> 2K, which seems unlikely, we can still afford to embedd > 256 bytes content in) we limit the cost of a wrong estimation.

I would prefer to raise the value in case of heap buffers, because they benefit a lot more from embedding the content, but I cannot see how to do it without loosing space in case of a wrong estimation...
My point is that estimations should adapt based on the context and if they are wrong, they can be wrong regardless, losing space. The hit is somehow decent because if the pooled allocator is using thread local buffers to the event loop thread, it won't be likely allocated, but is just using an existing allocated one.

I can provide another commit to raise the limit to ~1K (or 512 bytes?) just for heap buffers if you agree, wdyt @normanmaurer ?

The point is that request/responses act very differently, in common scenarios: "usually" requests tends to have more headers and less content, while responses, the opposite, unless it's a POST/PUT request

bryce-anderson

I can provide another commit to raise the limit to ~1K (or 512 bytes?) just for heap buffers if you agree, wdyt @normanmaurer ?

@franz1981, I'll defer to you and @normanmaurer.

codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java

Motivation: HttpObjectEncoder can already embed the data into the header buffer, enabling MessageToMessageEncoder to save allocating promise combiners and reducing the number of pipeline traversal. Sadly this happens only if the header estimation leave enough room in the header buffer. Modifications: Enhance the ability to embed the full msg content into the header buffer if under specific thresholds or for heap buffers, which will likely be copied into direct ones regardless, later one, anticipating the copy. Result: Faster small or heap data's writes

franz1981 · 2024-01-24T14:44:07Z

With 23fb903 I've fixed a broken (from long time) benchmark as well, and I've compared before/after, getting

before:

Benchmark                                (contentBytes)  (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt        Score        Error  Units
HttpRequestEncoderBenchmark.fullMessage             128               true            false           true  thrpt   20  8141334.970 ± 213631.397  ops/s
HttpRequestEncoderBenchmark.fullMessage             128               true            false          false  thrpt   20  5305790.008 ± 544748.708  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false           true  thrpt   20  7333156.696 ± 781234.881  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false          false  thrpt   20  5267705.708 ± 361571.646  ops/s

after:

Benchmark                                (contentBytes)  (pooledAllocator)  (typePollution)  (voidPromise)   Mode  Cnt        Score        Error  Units
HttpRequestEncoderBenchmark.fullMessage             128               true            false           true  thrpt   20  8075926.038 ±  95644.252  ops/s
HttpRequestEncoderBenchmark.fullMessage             128               true            false          false  thrpt   20  7659763.433 ± 278948.037  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false           true  thrpt   20  7221123.772 ± 660299.770  ops/s
HttpRequestEncoderBenchmark.fullMessage             128              false            false          false  thrpt   20  7131083.271 ± 616222.581  ops/s

which correctly show that:

given that the pipeline here is not present, there's no penalty in adding 2 ByteBuf over it
void promises have a special treatments in case of 2 ByteBufs which makes their performance still decent
non-void promises with multiple buffers, force creating a promise combiner, which perform costy atomic operations to register a new listener to them

In short, this benchmark shows that is beneficial to save creating the combiner, which we kind of know already, but is nice to have something which make it evident, although very artificial.

franz1981 · 2024-01-24T14:45:53Z

PTAL @bryce-anderson @normanmaurer @He-Pin this could be good to go

franz1981 · 2024-01-24T14:52:52Z

@bryce-anderson @normanmaurer

I got an idea to help "expert" users: what if I create a protected method which users could extend to decide by their own if account for the content or not?

I could expose a method to know if the operation has succeeded actually, or there is no way to have a feedback...
but my point is to leave users the option to increase the value beyond 128 or perform different strategies if they feel is worthy.

Conversely I can to add new sys property to make it configurable eg by exposing the current threshold values too: this last one could be enough, although it applies to both heap/direct, while possibly users which arrive to such low level tunable prefer to have more control over it.

Any suggestion?

normanmaurer · 2024-01-24T15:50:13Z

@bryce-anderson @normanmaurer

I got an idea to help "expert" users: what if I create a protected method which users could extend to decide by their own if account for the content or not?

I could expose a method to know if the operation has succeeded actually, or there is no way to have a feedback... but my point is to leave users the option to increase the value beyond 128 or perform different strategies if they feel is worthy.

Conversely I can to add new sys property to make it configurable eg by exposing the current threshold values too: this last one could be enough, although it applies to both heap/direct, while possibly users which arrive to such low level tunable prefer to have more control over it.

Any suggestion?

I would just not do it for now... Let's keep things simple

bryce-anderson

@franz1981, I really like how this shook out: simple but very effective.

normanmaurer · 2024-01-24T17:46:49Z

@franz1981 thanks a lot, this is great!

Can you do a PR against main as well ?

franz1981 · 2024-01-24T17:58:10Z

Yep @normanmaurer this could be done easily..for other changes I did, I was more confident newer JDK versions would have been fine, and didn't want to pollute the code with cheap tricks, but this one is different...

Reduce HTTP 1.1 Full msg pipeline traversals Motivation: HttpObjectEncoder can already embed the data into the header buffer, enabling MessageToMessageEncoder to save allocating promise combiners and reducing the number of pipeline traversal. Sadly this happens only if the header estimation leave enough room in the header buffer. Modifications: Enhance the ability to embed the full msg content into the header buffer if under specific thresholds or for heap buffers, which will likely be copied into direct ones regardless, later one, anticipating the copy. Result: Faster small or heap data's writes

franz1981 marked this pull request as draft January 17, 2024 11:55

franz1981 added the improvement label Jan 17, 2024

franz1981 requested a review from normanmaurer January 17, 2024 13:27

franz1981 force-pushed the 4.1_less_traversals branch from 24204eb to d98344b Compare January 17, 2024 13:27

franz1981 marked this pull request as ready for review January 17, 2024 13:27

normanmaurer approved these changes Jan 22, 2024

View reviewed changes

normanmaurer requested review from chrisvest, bryce-anderson and idelpivnitskiy January 22, 2024 18:21

normanmaurer added this to the 4.1.107.Final milestone Jan 22, 2024

bryce-anderson reviewed Jan 22, 2024

View reviewed changes

codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java Outdated Show resolved Hide resolved

codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java Outdated Show resolved Hide resolved

He-Pin reviewed Jan 22, 2024

View reviewed changes

codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java Outdated Show resolved Hide resolved

franz1981 force-pushed the 4.1_less_traversals branch from d98344b to 43f0b1f Compare January 23, 2024 09:16

bryce-anderson reviewed Jan 23, 2024

View reviewed changes

codec-http/src/main/java/io/netty/handler/codec/http/HttpObjectEncoder.java Outdated Show resolved Hide resolved

franz1981 added 4 commits January 24, 2024 15:34

Address Bryce comments: fix maxFast and limit the threshold

f25b026

make it DRY

cea2a45

Fix benchmark, adding content bytes parameter

23fb903

franz1981 force-pushed the 4.1_less_traversals branch from 611964d to 23fb903 Compare January 24, 2024 14:40

normanmaurer approved these changes Jan 24, 2024

View reviewed changes

bryce-anderson approved these changes Jan 24, 2024

View reviewed changes

normanmaurer merged commit 7893d5f into netty:4.1 Jan 24, 2024
15 checks passed

franz1981 deleted the 4.1_less_traversals branch April 3, 2024 12:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce HTTP 1.1 Full msg pipeline traversals #13785

Reduce HTTP 1.1 Full msg pipeline traversals #13785

franz1981 commented Jan 17, 2024

franz1981 commented Jan 17, 2024 •

edited

franz1981 commented Jan 22, 2024 •

edited

normanmaurer commented Jan 22, 2024

normanmaurer commented Jan 22, 2024

bryce-anderson left a comment

franz1981 commented Jan 23, 2024 •

edited

bryce-anderson left a comment

franz1981 commented Jan 24, 2024

franz1981 commented Jan 24, 2024 •

edited

franz1981 commented Jan 24, 2024 •

edited

normanmaurer commented Jan 24, 2024

bryce-anderson left a comment

normanmaurer commented Jan 24, 2024

franz1981 commented Jan 24, 2024

Reduce HTTP 1.1 Full msg pipeline traversals #13785

Reduce HTTP 1.1 Full msg pipeline traversals #13785

Conversation

franz1981 commented Jan 17, 2024

franz1981 commented Jan 17, 2024 • edited

franz1981 commented Jan 22, 2024 • edited

normanmaurer commented Jan 22, 2024

normanmaurer commented Jan 22, 2024

bryce-anderson left a comment

Choose a reason for hiding this comment

franz1981 commented Jan 23, 2024 • edited

bryce-anderson left a comment

Choose a reason for hiding this comment

franz1981 commented Jan 24, 2024

franz1981 commented Jan 24, 2024 • edited

franz1981 commented Jan 24, 2024 • edited

normanmaurer commented Jan 24, 2024

bryce-anderson left a comment

Choose a reason for hiding this comment

normanmaurer commented Jan 24, 2024

franz1981 commented Jan 24, 2024

franz1981 commented Jan 17, 2024 •

edited

franz1981 commented Jan 22, 2024 •

edited

franz1981 commented Jan 23, 2024 •

edited

franz1981 commented Jan 24, 2024 •

edited

franz1981 commented Jan 24, 2024 •

edited