Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HTTP2.Adapter.read_req_body/2 #37

Merged
merged 9 commits into from
Dec 4, 2022
Merged

Conversation

sabiwara
Copy link
Contributor

@sabiwara sabiwara commented Oct 16, 2022

Hello,

I tried rewriting HTTP2.Adapter.read_req_body/2 with a simple recursion instead of streams, for simplicity but also to see how it would improve performance.

In a quick benchmark I found that:

  • moving to recursion seems to have a noticeable impact on reducing memory usage, with only a small speedup
  • stop calling IO.iodata_length/1 on every new chunk and just keeping track on the remaining length changes the time complexity from quadratic to linear, which leads to huge speedups for big bodies (with basically the same memory usage)

I wanted to try to benchmark this in a more realistic setup, do you have any recommendation as to where I could get started?

@mtrudel
Copy link
Owner

mtrudel commented Oct 16, 2022

This is great work!

I'm planning to turn attention at performance issues specifically as a part of the 0.7.0 release train (coming shortly after phoenixframework/phoenix#5003 and a few other stragglers land, likely mid-Nov 2022). Rough plans are laid out at mtrudel/network_benchmark#1 (comment).

As part of that work I'm planning on adding in reproducible micro benchmarks on a number of common flows to run as part of CI. This work would be a perfect test candidate for such work, at least within the HTTP/2 path. I'd propose to hold off on merging this until that comes along, as I'd like to quantify changes like this with stable benchmarks.

WDYT?

@sabiwara
Copy link
Contributor Author

Sounds good!
I love the idea of reproducible micro-benchmarks, I'm looking forward to seeing the results 😁

@mtrudel mtrudel added the benchmark Assign this to a PR to have the benchmark CI suite run label Nov 17, 2022
@mtrudel
Copy link
Owner

mtrudel commented Nov 17, 2022

@sabiwara as you may have noticed I'm spamming this PR with some changes to get the new benchmarking setup working with forked repos (this is something I can only test with forked PRs).

Ignore the chatter for now - I'm just puttering with the workflow files not your changes.

@mtrudel mtrudel added benchmark Assign this to a PR to have the benchmark CI suite run and removed benchmark Assign this to a PR to have the benchmark CI suite run labels Nov 18, 2022
@mtrudel
Copy link
Owner

mtrudel commented Nov 18, 2022

Alright, chatter done. You're now the proud owner of Bandit's first benchmark CI!

Check out https://github.com/mtrudel/bandit/actions/runs/3493007558#summary-9563992273 for the overview, and the CSV download on that same page for the details.

Curiously, it's showing your changes as being rather worse on memory. Not sure that I believe that, given the numbers you're showing in Benchee. Maybe not the best start for the benchmarker 😬

I'll dig into this more in-depth and report back!

@sabiwara
Copy link
Contributor Author

Wow, I didn't know what to expect but this is still a surprising result 😁
I'll try to dive in a bit on my end as well, really curious about what is causing this.

@mtrudel
Copy link
Owner

mtrudel commented Nov 19, 2022

The other thing I'd say is not to put too much faith in the benchmarker. It's a new tool and not particularly proven, especially as regards its memory profiling (which is a tough job to get right)

@mtrudel
Copy link
Owner

mtrudel commented Dec 3, 2022

Dove more into this and all looks good (I'm seeing about a 5% lift on perf, and though I can't get sensible memory numbers out of the benchmarks your benchee work is pretty convincing).

I say we land this. Good to merge from your perspective?

@sabiwara
Copy link
Contributor Author

sabiwara commented Dec 3, 2022

I say we land this. Good to merge from your perspective?

Sounds good !! Memory seems tricky to measure indeed, but like you said the micro benchmark seems quite convincing 👍

@mtrudel mtrudel merged commit 1bf43b5 into mtrudel:main Dec 4, 2022
@mtrudel
Copy link
Owner

mtrudel commented Dec 4, 2022

A W E S O M E

Thanks for the great work @sabiwara! I'll be adding a few more HTTP/2 things in the next week or so and will drop a release once it's all done.

@sabiwara sabiwara deleted the bench branch December 4, 2022 02:46
@michallepicki
Copy link

michallepicki commented Dec 4, 2022

As per the Erlang Efficiency Guide, it's probably faster (and more memory efficient) to append to a binary accumulator instead of collecting a list, reversing it, and calling IO.iodata_to_binary()

@sabiwara
Copy link
Contributor Author

sabiwara commented Dec 4, 2022

As per the Erlang Efficiency Guide, it's probably faster (and more memory efficient) to append to a binary accumulator instead of collecting a list, reversing it, and calling IO.iodata_to_binary()

I'm not sure this is true, and the link provided doesn't explicitly compare the two approaches? I'm really curious if you've got a benchmark showing this, my attempts tend to show the opposite (>2x slower and slightly more memory usage with binary concat). Building, reversing lists are quite fast, so is IO.iodata_to_binary(). BTW, this is also how Enum.join is implemented.

@michallepicki
Copy link

I applied the optimization in elixir-mail. It works when there are no other references to the accumulator other than in the tail position - so the binary is shared and not copied, like explained in the linked docs

@sabiwara
Copy link
Contributor Author

sabiwara commented Dec 4, 2022

Interesting, but I'm not sure this is exactly the same thing. The old version in your example wasn't building an IO-list and directly calling IO.iodata_to_binary(), but was building binaries in each call + calling Enum.join/1 at the end (which has to do one more pass on the list to call to_string/1 on each element). It would be interesting to see how it would compare if we just replace [<<char>> | acc] with [char | acc] and Enum.join/1 with IO.iodata_to_binary/1.

@sabiwara
Copy link
Contributor Author

sabiwara commented Dec 4, 2022

@michallepicki FYI here is the previous benchmark I used updated with binary concat: link.
I might be doing something wrong, but I'm basically getting:

recursive_no_iodata_length       1048.36
recursive_binary                  471.28 - 2.22x slower +1.17 ms
recursive_no_iodata_length       273.61 KB
recursive_binary                 312.65 KB - 1.14x memory usage +39.04 KB

@michallepicki
Copy link

@sabiwara Your bench looks good and I see similar results! So in that case appending to a binary accumulator is slower, thanks for measuring!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Assign this to a PR to have the benchmark CI suite run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants