Benchmarks and improve performance of `ServletIo#requestBody` #171

TimWSpence · 2023-01-24T15:42:54Z

Background

We merged some dependency updates and noticed a spike in CPU usage and in GC frequency/duration. We managed to narrow it down to http4s-servlet:0.23.12 (.13 is also affected). With some git bisection and making local snapshots we managed to pin it down to #50.

Changes in this PR

This PR adds a benchmark to compare the performance of requestBody and reader and makes some suggested improvements to requestBody off the back of that. The benchmarks do show a very significant reduction in throughput and increase in allocations when using requestBody. The most significant thing that I found was that the requestBody benchmark was allocating an enormous number of int[] (35-40% of allocations) because we are spawning 3 additional fibers per chunk - one in the parallel Dispatcher and two because the use of Stream#concurrently means that Pull.eval(q.take) is an interruptible eval.

Conequently I made the following changes:

Use a sequential Dispatcher (I was actually confused how the current code works as there are no ordering guarantees for q.submit(chunk).unsafeRunAndForget when running on the parallel Dispatcher?) to eliminate one fiber allocation. I'm only forcing sequential in the benchmarks thus far but this very much supports Create a sequential dispatcher for each request and response #170
Replace concurrently with flatMap so that Pull.eval(q.take) is no longer interruptible and hence eliminate the other fiber allocations
Remove the boxing on the queue. I didn't see a huge difference with this but given that the benchmarks suggest that the improved requestBody still gives ~50% less throughput and allocates about 10X more, I thought that every little helps.

Benchmark results

Current version

Benchmark                                                         (iters)  (size)   Mode  Cnt         Score        Error   Units
ServletIoBenchmarks.reader                                           1000  100000  thrpt   20       206.176 ±      2.882   ops/s
ServletIoBenchmarks.reader:·gc.alloc.rate                            1000  100000  thrpt   20       514.322 ±      7.181  MB/sec
ServletIoBenchmarks.reader:·gc.alloc.rate.norm                       1000  100000  thrpt   20   2748150.063 ±      5.020    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space                   1000  100000  thrpt   20       511.571 ±      8.988  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space.norm              1000  100000  thrpt   20   2733572.797 ±  37047.601    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space               1000  100000  thrpt   20         0.032 ±      0.010  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space.norm          1000  100000  thrpt   20       172.966 ±     52.647    B/op
ServletIoBenchmarks.reader:·gc.count                                 1000  100000  thrpt   20       668.000               counts
ServletIoBenchmarks.reader:·gc.time                                  1000  100000  thrpt   20       361.000                   ms

ServletIoBenchmarks.requestBody                                      1000  100000  thrpt   20        39.742 ±      0.779   ops/s
ServletIoBenchmarks.requestBody:·gc.alloc.rate                       1000  100000  thrpt   20      2371.828 ±     46.455  MB/sec
ServletIoBenchmarks.requestBody:·gc.alloc.rate.norm                  1000  100000  thrpt   20  65741530.766 ±   2074.704    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space              1000  100000  thrpt   20      2396.249 ±     47.821  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space.norm         1000  100000  thrpt   20  66418490.614 ± 273169.904    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space          1000  100000  thrpt   20         0.029 ±      0.012  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space.norm     1000  100000  thrpt   20       806.306 ±    327.324    B/op
ServletIoBenchmarks.requestBody:·gc.count                            1000  100000  thrpt   20      1886.000               counts
ServletIoBenchmarks.requestBody:·gc.time                             1000  100000  thrpt   20       926.000                   ms

This PR

Benchmark                                                         (iters)  (size)   Mode  Cnt         Score       Error   Units
ServletIoBenchmarks.reader                                           1000  100000  thrpt   20       188.385 ±     3.473   ops/s
ServletIoBenchmarks.reader:·gc.alloc.rate                            1000  100000  thrpt   20       470.069 ±     8.604  MB/sec
ServletIoBenchmarks.reader:·gc.alloc.rate.norm                       1000  100000  thrpt   20   2749302.659 ±  1382.197    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space                   1000  100000  thrpt   20       468.061 ±    10.620  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space.norm              1000  100000  thrpt   20   2737395.458 ± 25786.814    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space               1000  100000  thrpt   20         0.026 ±     0.009  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space.norm          1000  100000  thrpt   20       150.846 ±    54.054    B/op
ServletIoBenchmarks.reader:·gc.count                                 1000  100000  thrpt   20       611.000              counts
ServletIoBenchmarks.reader:·gc.time                                  1000  100000  thrpt   20       357.000                  ms

ServletIoBenchmarks.requestBody                                      1000  100000  thrpt   20       104.962 ±     0.326   ops/s
ServletIoBenchmarks.requestBody:·gc.alloc.rate                       1000  100000  thrpt   20      1942.809 ±     6.386  MB/sec
ServletIoBenchmarks.requestBody:·gc.alloc.rate.norm                  1000  100000  thrpt   20  20396363.510 ±  8978.371    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space              1000  100000  thrpt   20      1960.285 ±    11.354  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space.norm         1000  100000  thrpt   20  20579794.985 ± 91489.964    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space          1000  100000  thrpt   20         0.022 ±     0.007  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space.norm     1000  100000  thrpt   20       234.842 ±    70.506    B/op
ServletIoBenchmarks.requestBody:·gc.count                            1000  100000  thrpt   20      1545.000              counts
ServletIoBenchmarks.requestBody:·gc.time                             1000  100000  thrpt   20       739.000                  ms

[Edited by @rossabaker to make the benchmarks monospace]

Concurrently introduces an fs2 interruptible scope which means that each pull from the queue forks a fiber. The GC impact of this is significant.

rossabaker

This looks like major progress, though I'm awfully disappointed that it still falls so far short. We may need to revert the whole queue concept.

The unsafeRunAndForget does seem dubious here with the parallel dispatcher. I'm now concerned we may have a critical correctness bug in the current version.

benchmarks/src/main/scala/org/http4s/servlet/ServletIoBenchmarks.scala

rossabaker · 2023-01-25T02:43:16Z

This should be a unit test, but I think we've got a bigger problem than performance with that parallel dispatcher: 4c9eb64. Specifically, I'm seeing it reading an empty body!

[Edit: no, sequential is failing that way, too. I think I need to look at this with fresh eyes tomorrow.]

TimWSpence · 2023-01-25T10:10:35Z

This looks like major progress, though I'm awfully disappointed that it still falls so far short. We may need to revert the whole queue concept.

Yeah it's a shame because the new code is much easier to understand. I wonder if part of the problem is that we're effectively double-queuing each chunk? Because there's a queue inside the sequential dispatcher as well.

The unsafeRunAndForget does seem dubious here with the parallel dispatcher. I'm now concerned we may have a critical correctness bug in the current version.

Yeah it seems like it shouldn't work to me. But therre may be a critical correctness bug in my understanding 😅

Also thanks very much for fixing the spacing on the benchmarks! 😂

rossabaker · 2023-01-25T14:00:30Z

I went ahead and made it a per-request dispatcher. Quick-and-dirty as of 9210893:

[info] Benchmark                        (iters)  (size)   Mode  Cnt    Score    Error  Units
[info] ServletIoBenchmarks.reader          1000  100000  thrpt    3  828.621 ± 15.904  ops/s
[info] ServletIoBenchmarks.requestBody     1000  100000  thrpt    3   56.251 ± 10.994  ops/s

Here's where requestBody is spending its time:

[info]   3.1%  28.4% sun.misc.Unsafe.unpark
[info]   2.2%  20.3% java.lang.Object.hashCode
[info]   1.5%  13.1% sun.misc.Unsafe.park
[info]   0.9%   8.4% java.io.InputStream.read
[info]   0.8%   6.9% cats.effect.IOPlatform.<init>
[info]   0.8%   6.8% cats.effect.IOFiber.runLoop
[info]   0.2%   2.2% cats.effect.IOFiber.succeeded
[info]   0.2%   1.4% cats.effect.IO.<init>
[info]   0.1%   1.2% java.util.Random.nextBytes
[info]   0.1%   1.1% cats.effect.kernel.Resource.loop$1
[info]   1.1%  10.1% <other>

This is reader:

[info]   1.8%  19.8% org.http4s.servlet.NonBlockingServletIo.org$http4s$servlet$NonBlockingServletIo$$read$1
[info]   1.7%  19.4% java.util.Random.nextBytes
[info]   1.4%  15.5% cats.effect.IOFiber.runLoop
[info]   1.0%  10.6% cats.effect.IOPlatform.<init>
[info]   0.4%   4.3% cats.effect.IOFiber.succeeded
[info]   0.3%   2.8% cats.effect.IO$$anon$4.delay
[info]   0.2%   2.7% cats.effect.IO.flatMap
[info]   0.2%   2.0% cats.effect.kernel.Resource$$Lambda$19/1133484341.<init>
[info]   0.2%   1.9% fs2.Pull$.fs2$Pull$$viewL$1
[info]   0.1%   1.6% cats.effect.IO$.pure
[info]   1.8%  19.5% <other>

rossabaker · 2023-01-25T14:20:22Z

Note that above stack profile is on Java 8. We get a different picture on Java 17. We should be targeting 8 with the library for maximal support, but we should be optimizing for 17.

requestBody stack:

[info]   7.7%  44.0% <stack is empty, everything is filtered?>
[info]   2.6%  14.7% cats.effect.IOFiber.runLoop
[info]   1.4%   8.2% jdk.internal.misc.Unsafe.unpark
[info]   0.9%   5.1% java.util.Random.nextBytes
[info]   0.7%   4.1% cats.effect.unsafe.ScalQueue.poll
[info]   0.7%   4.1% cats.effect.IOFiber.succeeded
[info]   0.3%   1.5% cats.effect.kernel.Resource.loop$1
[info]   0.2%   1.2% cats.effect.std.Dispatcher.unsafeRunAndForget
[info]   0.2%   1.2% cats.effect.IO.<init>
[info]   0.2%   1.0% cats.Monad.$anonfun$map$1
[info]   2.6%  14.8% <other>

and

[info] ServletIoBenchmarks.reader              1000  100000  thrpt    3  353.705 ± 580.082  ops/s
[info] ServletIoBenchmarks.requestBody         1000  100000  thrpt    3   77.763 ±   2.734  ops/s

Fascinating that reader got worse, and I don't know WTF that error margin is about.

TimWSpence · 2023-01-25T15:41:32Z

benchmarks/src/main/scala/org/http4s/servlet/ServletIoBenchmarks.scala

@@ -86,6 +86,20 @@ class ServletIoBenchmarks {
        readListener.onAllDataRead()
      result
    }
+
+    override def read(buf: Array[Byte]) = {


🤦 I should have realized that the default is one byte at a time via Read

rossabaker · 2023-01-26T21:40:48Z

I started a Pull-based implementation that avoids the queues and quickly ran into about a 3x performance loss. If I change requestBody to be equivalent to reader, so that the only difference is creating and releasing an (unused in this impl) sequential Dispatcher per request, we are already in a mess!

[info] ServletIoBenchmarks.reader                 1000  100000  thrpt    2  470.326          ops/s
[info] ServletIoBenchmarks.requestBody            1000  100000  thrpt    2  153.507          ops/s

What I don't know is how much a decline in these benchmarks translates to a decline in real-world throughput. But I don't think we can compete with reader with a dispatcher-per-request.

TimWSpence · 2023-01-31T15:38:17Z

@rossabaker sorry I let this fall off my radar. Is there anything I can do to help get this whole performance thing over the line? I wasn't sure if you were planning to carry on with this PR or revert to the previous implementation

rossabaker · 2023-02-02T03:21:38Z

I got distracted by life and a couple other projects this week. Yeah... the current one is not only not performant, but strikingly incorrect. My inclination is that we should revert without breaking binary compatibility, and meanwhile keep exploring a way to clean it up and remain performant.

rossabaker · 2023-02-02T04:04:47Z

Alright, we can release #175 to undo the damage, and then keep exploring better ways via this PR.

TimWSpence added 4 commits January 19, 2023 16:27

Benchmarks of reader vs requestBody

ca4b0cb

Don't use concurrently with single-element stream

2c3970d

Concurrently introduces an fs2 interruptible scope which means that each pull from the queue forks a fiber. The GC impact of this is significant.

Optimization: don't emit empty chunk

7260d8d

Minimize allocations

831675b

TimWSpence marked this pull request as ready for review January 24, 2023 15:47

This was referenced Jan 24, 2023

Performance regression in 0.23.12 #157

Open

Create a sequential dispatcher for each request and response #170

Open

TimWSpence added 2 commits January 24, 2023 15:58

Suppress compiler warning

1179728

Fix compiler warning

290eb5b

rossabaker reviewed Jan 25, 2023

View reviewed changes

benchmarks/src/main/scala/org/http4s/servlet/ServletIoBenchmarks.scala Outdated Show resolved Hide resolved

rossabaker mentioned this pull request Jan 25, 2023

Incorrect request bodies #172

Draft

Dispatcher per request in benchmark

9210893

Bulk reads are more realistic

ffa5f3c

Not sure how I broke the workflow, but alright

20467d6

TimWSpence commented Jan 25, 2023

View reviewed changes

Scalafmt

806fac7

rossabaker mentioned this pull request Feb 2, 2023

Revert #50 #175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks and improve performance of `ServletIo#requestBody` #171

Benchmarks and improve performance of `ServletIo#requestBody` #171

TimWSpence commented Jan 24, 2023 •

edited by rossabaker

rossabaker left a comment

rossabaker commented Jan 25, 2023 •

edited

TimWSpence commented Jan 25, 2023 •

edited

rossabaker commented Jan 25, 2023

rossabaker commented Jan 25, 2023 •

edited

TimWSpence Jan 25, 2023

rossabaker commented Jan 26, 2023

TimWSpence commented Jan 31, 2023 •

edited

rossabaker commented Feb 2, 2023

rossabaker commented Feb 2, 2023

Benchmarks and improve performance of ServletIo#requestBody #171

Are you sure you want to change the base?

Benchmarks and improve performance of ServletIo#requestBody #171

Conversation

TimWSpence commented Jan 24, 2023 • edited by rossabaker

Background

Changes in this PR

Benchmark results

Current version

This PR

rossabaker left a comment

Choose a reason for hiding this comment

rossabaker commented Jan 25, 2023 • edited

TimWSpence commented Jan 25, 2023 • edited

rossabaker commented Jan 25, 2023

rossabaker commented Jan 25, 2023 • edited

TimWSpence Jan 25, 2023

Choose a reason for hiding this comment

rossabaker commented Jan 26, 2023

TimWSpence commented Jan 31, 2023 • edited

rossabaker commented Feb 2, 2023

rossabaker commented Feb 2, 2023

Benchmarks and improve performance of `ServletIo#requestBody` #171

Benchmarks and improve performance of `ServletIo#requestBody` #171

TimWSpence commented Jan 24, 2023 •

edited by rossabaker

rossabaker commented Jan 25, 2023 •

edited

TimWSpence commented Jan 25, 2023 •

edited

rossabaker commented Jan 25, 2023 •

edited

TimWSpence commented Jan 31, 2023 •

edited