New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve client response code, chunked bodies #2595
Conversation
I saw this after I submitted my PR #2597 I did some experiments:
For chunked body benchmark: Using your branch:
My experiments:
|
No problem. I've got a hybrid that I'm about to push. More then... |
Pushed. If you run the following commands from the Puma folder after compiling:
You should get timing info that I find varies less on my system than wrk. You can also grab the diff for the 'Add test and benchmark files' commit and add it to master or your PR. I you have time to run it, I'd be interested in your results... A few (possibly crazy) thoughts:
It looks like there's an issue with macOS. I'll see what that's about. I'm Windows & WSL2/Ubuntu... |
lib/puma/request.rb
Outdated
next if (byte_size = part.bytesize).zero? | ||
running_len += byte_size | ||
if running_len > BUFFER_LENGTH && byte_size != running_len | ||
io.write_nonblock strm.read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to handle IO::WaitWritable
and Errno::EINTR
, also write_nonblock may perform partial write according to the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That was a mistake. Fixed in push just now. And you're correct, write_nonblock
& syswrite
both may do partial writes for large strings.
Updated things to use The chunked and normal array have 1kB elements. Also, I can't see any performance improvement with cork/uncork, so they're commented out for now. General change is that headers are written to the IOBuffer, and it is written with the first body element/segment. Early-hints are written separately. |
…rite_nonblock Co-authored-by: Calvin Xiao <calvin325@gmail.com>
I have tested this in several ways. In every test, it performs better than current code. In some tests, it may only be a few percent faster, in other tests, it's much more significant. It no longer uses cork/uncork, that code is currently commented out. It should be removed at some point. Also, I can't test on MacOS, so if anyone can test on that platform, results would be interesting to see. |
So, the justification for removing cork is basically just "after this PR, it makes no difference to latency or throughput", correct? |
Let's split the benchmark changes into a separate PR (because they look good and should just go in), and the test folder changes also into a separate PR and consider the lib directory changes by themselves. |
I'm making a few changes in the test code, and documenting it. Give me a few days. More later, as I'll add the docs somewhere public... |
Hey @MSP-Greg let me know if I can help any way here. |
Nobody can help me! Sorry, standard response from back in the crazy jazz musician days... I have cleaned up much of the test code. Added doc, more sensible names for methods, an overall md file, etc. I've got some benchmark files I'll be creating a PR for soon, then another with more of the test framework code. My branch with all of this is showing improved request per second (RPS) metrics, especially for chunked or enumerated response bodies... |
The code here grouped the test results by body type (array, chunked, string). which requires one to compare 'across' the results. The code used and shown in #2696 (comment) groups the data by body size, which I find much easier to look at. That, along with improvements, more shared code, etc, means that I'll close this in favor of code blocked by a few PR's... |
Description
When clients start writing a request, Puma does several things.
Step 3 is trivial, so improving Puma involves optimizing the other steps. Many of the benchmarks use a minimal response with few headers and a small, single string body. This drives the time needed for step 4 to zero, and hence measures steps 1 and 2.
This PR looks at the performance of step 4, using more typical responses, especially bodies that are not simple, small string arrays.
What this PR changes:
Puma::IOBuffer
is subclassed fromStringIO
. The basic API is not changed, and because it is more stream-like than aString
, it isn't affected by encoding issues (see Puma versions >= v5.0.3 throw Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE #2583). It is now used to assemble both the headers and the body.Puma::Request
- addedfast_write_body
method. This is optimized for increasing the speed of writing the request, both for when body is an enum, and also if it is a File object.Current code wrote the headers, then the body. Updated code writes when the next response addition would result in over 128kB in the response buffer. This limit can be increased, I just used it as it was the largest chunk that my system seemed to read in the client code...
Current code uses
syswrite
for most client socket operations. After a lot of benchmarking, and reviewing code elsewhere (like Unicorn), I'm not sure if there's a clear case for eithersyswrite
orwrite
. The code in the PR has been changed towrite_nonblock
.Several files have been included that allow benchmarking request/response performance. Scripts for using
test/helpers/sockets.rb
to summarize total response times and 'wrk' are included. Seebenchmarks/request_reponse_time_benchmarks.md
for a doc regarding these additions.Performance using included files is below. Some of the differences are insignificant, but the speed increase in 'Chunked Body' - '10kB Body' and especially '100kB Body' are considerable. The 'Chunked Body' rack app returns a 1kB byte string, so the '100kB Body' has 100 steps in the enumeration.
Data is included for this PR, PR 2597 (
write_nonblock
), and master. 'RPS' is requests per second, the timing info is in mS, timed from the start of the request write to receiving the full body. Note that when using sockets.rb / create_clients, one can hit a server hard enough that the request/response time will begin to increase and the RPS number will only show small increases.SSL sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client
TCP sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client
Unix sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client
Your checklist for this pull request
[ci skip]
to the title of the PR.#issue
" to the PR description or my commit messages.