Optimize HTTP2.Adapter.read_req_body/2 #37

sabiwara · 2022-10-16T08:49:20Z

Hello,

I tried rewriting HTTP2.Adapter.read_req_body/2 with a simple recursion instead of streams, for simplicity but also to see how it would improve performance.

In a quick benchmark I found that:

moving to recursion seems to have a noticeable impact on reducing memory usage, with only a small speedup
stop calling IO.iodata_length/1 on every new chunk and just keeping track on the remaining length changes the time complexity from quadratic to linear, which leads to huge speedups for big bodies (with basically the same memory usage)

I wanted to try to benchmark this in a more realistic setup, do you have any recommendation as to where I could get started?

mtrudel · 2022-10-16T14:36:37Z

This is great work!

I'm planning to turn attention at performance issues specifically as a part of the 0.7.0 release train (coming shortly after phoenixframework/phoenix#5003 and a few other stragglers land, likely mid-Nov 2022). Rough plans are laid out at mtrudel/network_benchmark#1 (comment).

As part of that work I'm planning on adding in reproducible micro benchmarks on a number of common flows to run as part of CI. This work would be a perfect test candidate for such work, at least within the HTTP/2 path. I'd propose to hold off on merging this until that comes along, as I'd like to quantify changes like this with stable benchmarks.

WDYT?

sabiwara · 2022-10-16T22:56:55Z

Sounds good!
I love the idea of reproducible micro-benchmarks, I'm looking forward to seeing the results 😁

mtrudel · 2022-11-17T21:07:00Z

@sabiwara as you may have noticed I'm spamming this PR with some changes to get the new benchmarking setup working with forked repos (this is something I can only test with forked PRs).

Ignore the chatter for now - I'm just puttering with the workflow files not your changes.

mtrudel · 2022-11-18T00:19:41Z

Alright, chatter done. You're now the proud owner of Bandit's first benchmark CI!

Check out https://github.com/mtrudel/bandit/actions/runs/3493007558#summary-9563992273 for the overview, and the CSV download on that same page for the details.

Curiously, it's showing your changes as being rather worse on memory. Not sure that I believe that, given the numbers you're showing in Benchee. Maybe not the best start for the benchmarker 😬

I'll dig into this more in-depth and report back!

sabiwara · 2022-11-19T00:53:29Z

Wow, I didn't know what to expect but this is still a surprising result 😁
I'll try to dive in a bit on my end as well, really curious about what is causing this.

mtrudel · 2022-11-19T01:16:37Z

The other thing I'd say is not to put too much faith in the benchmarker. It's a new tool and not particularly proven, especially as regards its memory profiling (which is a tough job to get right)

mtrudel · 2022-12-03T20:17:35Z

Dove more into this and all looks good (I'm seeing about a 5% lift on perf, and though I can't get sensible memory numbers out of the benchmarks your benchee work is pretty convincing).

I say we land this. Good to merge from your perspective?

sabiwara · 2022-12-03T23:52:50Z

I say we land this. Good to merge from your perspective?

Sounds good !! Memory seems tricky to measure indeed, but like you said the micro benchmark seems quite convincing 👍

mtrudel · 2022-12-04T02:36:24Z

A W E S O M E

Thanks for the great work @sabiwara! I'll be adding a few more HTTP/2 things in the next week or so and will drop a release once it's all done.

michallepicki · 2022-12-04T09:58:16Z

As per the Erlang Efficiency Guide, it's probably faster (and more memory efficient) to append to a binary accumulator instead of collecting a list, reversing it, and calling IO.iodata_to_binary()

sabiwara · 2022-12-04T10:52:03Z

As per the Erlang Efficiency Guide, it's probably faster (and more memory efficient) to append to a binary accumulator instead of collecting a list, reversing it, and calling IO.iodata_to_binary()

I'm not sure this is true, and the link provided doesn't explicitly compare the two approaches? I'm really curious if you've got a benchmark showing this, my attempts tend to show the opposite (>2x slower and slightly more memory usage with binary concat). Building, reversing lists are quite fast, so is IO.iodata_to_binary(). BTW, this is also how Enum.join is implemented.

michallepicki · 2022-12-04T11:36:06Z

I applied the optimization in elixir-mail. It works when there are no other references to the accumulator other than in the tail position - so the binary is shared and not copied, like explained in the linked docs

sabiwara · 2022-12-04T11:58:27Z

Interesting, but I'm not sure this is exactly the same thing. The old version in your example wasn't building an IO-list and directly calling IO.iodata_to_binary(), but was building binaries in each call + calling Enum.join/1 at the end (which has to do one more pass on the list to call to_string/1 on each element). It would be interesting to see how it would compare if we just replace [<<char>> | acc] with [char | acc] and Enum.join/1 with IO.iodata_to_binary/1.

sabiwara · 2022-12-04T12:06:44Z

@michallepicki FYI here is the previous benchmark I used updated with binary concat: link.
I might be doing something wrong, but I'm basically getting:

recursive_no_iodata_length       1048.36
recursive_binary                  471.28 - 2.22x slower +1.17 ms

recursive_no_iodata_length       273.61 KB
recursive_binary                 312.65 KB - 1.14x memory usage +39.04 KB

michallepicki · 2022-12-04T13:58:28Z

@sabiwara Your bench looks good and I see similar results! So in that case appending to a binary accumulator is slower, thanks for measuring!

Optimize HTTP2.Adapter.read_req_body/2

2191d8c

mtrudel force-pushed the main branch from 86e6f50 to 852e72c Compare November 1, 2022 19:50

mtrudel added the benchmark Assign this to a PR to have the benchmark CI suite run label Nov 17, 2022

mtrudel added 4 commits November 17, 2022 15:50

Merge branch 'main' into bench

7a3dbf4

Temporarily add logging to CI

ad07153

More CI tweaks to get benchmarks working on forked PRs

8448f8d

Remove CI logging

b8f65b1

mtrudel added 3 commits November 17, 2022 17:26

Add external repo defs

f694f32

Typo

a291e95

Still more typos

275b823

mtrudel added benchmark Assign this to a PR to have the benchmark CI suite run and removed benchmark Assign this to a PR to have the benchmark CI suite run labels Nov 18, 2022

Merge branch 'main' into bench

44b1730

mtrudel merged commit 1bf43b5 into mtrudel:main Dec 4, 2022

sabiwara deleted the bench branch December 4, 2022 02:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize HTTP2.Adapter.read_req_body/2 #37

Optimize HTTP2.Adapter.read_req_body/2 #37

sabiwara commented Oct 16, 2022 •

edited

mtrudel commented Oct 16, 2022

sabiwara commented Oct 16, 2022

mtrudel commented Nov 17, 2022

mtrudel commented Nov 18, 2022

sabiwara commented Nov 19, 2022

mtrudel commented Nov 19, 2022

mtrudel commented Dec 3, 2022

sabiwara commented Dec 3, 2022

mtrudel commented Dec 4, 2022

michallepicki commented Dec 4, 2022 •

edited

sabiwara commented Dec 4, 2022

michallepicki commented Dec 4, 2022

sabiwara commented Dec 4, 2022 •

edited

sabiwara commented Dec 4, 2022

michallepicki commented Dec 4, 2022

Optimize HTTP2.Adapter.read_req_body/2 #37

Optimize HTTP2.Adapter.read_req_body/2 #37

Conversation

sabiwara commented Oct 16, 2022 • edited

mtrudel commented Oct 16, 2022

sabiwara commented Oct 16, 2022

mtrudel commented Nov 17, 2022

mtrudel commented Nov 18, 2022

sabiwara commented Nov 19, 2022

mtrudel commented Nov 19, 2022

mtrudel commented Dec 3, 2022

sabiwara commented Dec 3, 2022

mtrudel commented Dec 4, 2022

michallepicki commented Dec 4, 2022 • edited

sabiwara commented Dec 4, 2022

michallepicki commented Dec 4, 2022

sabiwara commented Dec 4, 2022 • edited

sabiwara commented Dec 4, 2022

michallepicki commented Dec 4, 2022

sabiwara commented Oct 16, 2022 •

edited

michallepicki commented Dec 4, 2022 •

edited

sabiwara commented Dec 4, 2022 •

edited