Improve client response code, chunked bodies #2595

MSP-Greg · 2021-04-09T03:37:45Z

Description

When clients start writing a request, Puma does several things.

Determines which client to process next
Read and parse the request
Send the request to the app
Process the response

Step 3 is trivial, so improving Puma involves optimizing the other steps. Many of the benchmarks use a minimal response with few headers and a small, single string body. This drives the time needed for step 4 to zero, and hence measures steps 1 and 2.

This PR looks at the performance of step 4, using more typical responses, especially bodies that are not simple, small string arrays.

What this PR changes:

Puma::IOBuffer is subclassed from StringIO. The basic API is not changed, and because it is more stream-like than a String, it isn't affected by encoding issues (see Puma versions >= v5.0.3 throw Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE #2583). It is now used to assemble both the headers and the body.
Puma::Request - added fast_write_body method. This is optimized for increasing the speed of writing the request, both for when body is an enum, and also if it is a File object.
Current code wrote the headers, then the body. Updated code writes when the next response addition would result in over 128kB in the response buffer. This limit can be increased, I just used it as it was the largest chunk that my system seemed to read in the client code...
Current code uses syswrite for most client socket operations. After a lot of benchmarking, and reviewing code elsewhere (like Unicorn), I'm not sure if there's a clear case for either syswrite or write. The code in the PR has been changed to write_nonblock.
Several files have been included that allow benchmarking request/response performance. Scripts for using test/helpers/sockets.rb to summarize total response times and 'wrk' are included. See benchmarks/request_reponse_time_benchmarks.md for a doc regarding these additions.

Performance using included files is below. Some of the differences are insignificant, but the speed increase in 'Chunked Body' - '10kB Body' and especially '100kB Body' are considerable. The 'Chunked Body' rack app returns a 1kB byte string, so the '100kB Body' has 100 steps in the enumeration.

Data is included for this PR, PR 2597 (write_nonblock), and master. 'RPS' is requests per second, the timing info is in mS, timed from the start of the request write to receiving the full body. Note that when using sockets.rb / create_clients, one can hit a server hard enough that the request/response time will begin to increase and the RPS number will only show small increases.

SSL sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client

════════════════════════════════════════════════════════════════════════════ Chunked Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
 5,985   0.583   0.645   0.695   0.771   0.815   0.867   1.058   2.176   3.174  PR 2595
 2,208   3.431   3.529   3.652   3.857   3.968   4.104   4.566   5.268   5.827  PR 2597
 2,215   3.405   3.507   3.623   3.815   3.930   4.073   4.570   5.249   5.801  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
 5,731   0.624   0.674   0.715   0.777   0.815   0.872   1.087   2.345   3.511  PR 2595
   623  14.535  14.707  14.925  15.234  15.382  15.557  16.221  17.065  17.902  PR 2597
   622  14.483  14.687  14.912  15.242  15.396  15.571  16.273  17.200  17.962  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 1,847   3.954   4.093   4.269   4.521   4.659   4.805   5.268   5.982   6.759  PR 2595
    67  141.32  142.18  143.35  145.26  146.09  146.80  148.61  150.21  151.57  PR 2597
    67  141.59  142.42  143.51  145.16  145.84  146.55  148.51  150.19  151.55  Master


══════════════════════════════════════════════════════════════════════════════ Array Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
 5,993   0.591   0.651   0.700   0.773   0.813   0.863   1.042   2.182   3.223  PR 2595
 3 997   1.412   1.506   1.597   1.741   1.822   1.923   2.313   3.192   3.856  PR 2597
 4,007   1.412   1.507   1.593   1.741   1.817   1.919   2.317   3.176   3.831  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
 5,721   0.628   0.687   0.735   0.811   0.853   0.905   1.083   2.280   3.273  PR 2595
 1,598   5.039   5.154   5.298   5.516   5.632   5.776   6.315   7.042   7.685  PR 2597
 1,614   5.007   5.116   5.254   5.468   5.589   5.738   6.305   7.070   7.723  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 1,956   3.698   3.828   3.974   4.209   4.346   4.497   5.006   5.864   6.658  PR 2595
   200  46.873  47.263  47.791  48.604  48.988  49.395  50.396  51.328  52.309  PR 2597
   200  47.085  47.423  47.858  48.656  49.078  49.509  50.582  51.603  52.648  Master


═════════════════════════════════════════════════════════════════════════════ String Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
 5,951   0.586   0.655   0.709   0.789   0.831   0.882   1.062   2.219   3.164  PR 2595
 4,767   1.000   1.108   1.187   1.306   1.372   1.447   1.778   2.745   3.439  PR 2597
 4,768   0.997   1.105   1.184   1.303   1.365   1.442   1.765   2.730   3.456  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
 5,716   0.615   0.687   0.740   0.818   0.861   0.909   1.083   2.299   3.247  PR 2595
 4,593   1.036   1.140   1.213   1.331   1.398   1.481   1.850   2.853   3.603  PR 2597
 4,629   1.011   1.129   1.209   1.326   1.388   1.469   1.833   2.882   3.591  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 1,944   3.748   3.879   4.034   4.296   4.434   4.591   5.059   5.705   6.477  PR 2595
 1,793   4.190   4.311   4.475   4.761   4.896   5.061   5.549   6.199   6.889  PR 2597
 1,779   4.187   4.328   4.511   4.782   4.927   5.092   5.610   6.251   6.902  Master

TCP sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client

════════════════════════════════════════════════════════════════════════════ Chunked Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
10,485   0.285   0.308   0.336   0.383   0.411   0.464   0.812   0.986   1.160  PR 2595
10,420   0.286   0.310   0.339   0.389   0.421   0.481   0.823   0.991   1.169  PR 2597
 8,669   0.277   0.305   0.344   0.511   0.692   0.833   1.150   1.435   1.653  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,063   0.313   0.333   0.362   0.406   0.433   0.470   0.782   1.019   1.184  PR 2595
 9,812   0.310   0.335   0.365   0.415   0.444   0.494   0.846   1.047   1.236  PR 2597
 6,527   0.276   0.315   0.372   0.706   0.935   1.233   1.819   2.265   2.725  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,110   0.938   1.066   1.134   1.218   1.262   1.325   1.627   1.903   2.186  PR 2595
 4,965   0.976   1.078   1.147   1.241   1.296   1.382   1.744   2.004   2.320  PR 2597
 2,033   0.835   0.930   1.077   1.398   1.857   3.918   7.797  10.291  14.146  Master


══════════════════════════════════════════════════════════════════════════════ Array Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
10,527   0.289   0.312   0.341   0.387   0.415   0.460   0.773   0.978   1.142  PR 2595
10,372   0.288   0.312   0.341   0.390   0.421   0.475   0.797   0.997   1.171  PR 2597
 9,386   0.281   0.308   0.347   0.424   0.570   0.727   0.969   1.206   1.439  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,151   0.306   0.328   0.357   0.402   0.427   0.467   0.766   1.007   1.167  PR 2595
 9,992   0.304   0.329   0.358   0.407   0.435   0.482   0.817   1.022   1.208  PR 2597
 8,260   0.292   0.321   0.363   0.470   0.677   0.851   1.200   1.528   1.817  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,612   0.823   0.952   1.022   1.101   1.146   1.204   1.458   1.748   2.039  PR 2595
 5,666   0.849   0.953   1.015   1.094   1.135   1.194   1.479   1.738   1.991  PR 2597
 3,993   0.784   0.878   0.995   1.177   1.385   1.757   2.941   4.148   5.245  Master


═════════════════════════════════════════════════════════════════════════════ String Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
10,537   0.289   0.312   0.341   0.387   0.416   0.463   0.771   0.978   1.139  PR 2595
10,420   0.291   0.312   0.340   0.387   0.416   0.471   0.810   0.984   1.157  PR 2597
 9,543   0.287   0.313   0.348   0.418   0.519   0.686   0.944   1.131   1.362  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,131   0.309   0.332   0.361   0.407   0.434   0.471   0.760   1.002   1.163  PR 2595
10,084   0.307   0.329   0.358   0.405   0.432   0.476   0.801   1.022   1.184  PR 2597
 9,262   0.304   0.330   0.366   0.430   0.490   0.664   0.964   1.152   1.408  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,664   0.845   0.955   1.019   1.097   1.137   1.190   1.462   1.740   2.017  PR 2595
 5,620   0.849   0.960   1.024   1.102   1.143   1.199   1.467   1.766   2.059  PR 2597
 5,567   0.853   0.963   1.024   1.105   1.149   1.213   1.543   1.792   2.109  Master

Unix sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client

════════════════════════════════════════════════════════════════════════════ Chunked Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
11,566   0.280   0.300   0.325   0.364   0.384   0.411   0.629   0.763   0.913  PR 2595
11,361   0.289   0.308   0.332   0.371   0.391   0.417   0.638   0.780   0.945  PR 2597
 9,456   0.281   0.308   0.342   0.410   0.513   0.686   0.968   1.184   1.429  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,731   0.310   0.331   0.357   0.397   0.419   0.446   0.654   0.800   0.945  PR 2595
10,137   0.332   0.354   0.381   0.421   0.443   0.472   0.682   0.853   1.017  PR 2597
 6,927   0.305   0.334   0.379   0.511   0.753   1.002   1.597   2.053   2.445  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,293   0.928   1.054   1.118   1.195   1.233   1.281   1.526   1.726   1.916  PR 2595
 4,025   1.274   1.412   1.496   1.592   1.640   1.697   1.929   2.175   2.387  PR 2597
 1,945   0.938   1.072   1.265   1.835   3.104   4.112   7.347   9.340  12.977  Master


══════════════════════════════════════════════════════════════════════════════ Array Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
11,489   0.287   0.307   0.331   0.369   0.388   0.412   0.623   0.758   0.908  PR 2595
11,371   0.291   0.310   0.335   0.372   0.391   0.416   0.629   0.756   0.915  PR 2597
10,577   0.285   0.310   0.339   0.386   0.414   0.468   0.769   0.966   1.101  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,986   0.305   0.323   0.348   0.389   0.409   0.435   0.639   0.780   0.931  PR 2595
10,731   0.312   0.334   0.360   0.400   0.421   0.448   0.656   0.806   0.958  PR 2597
 9,017   0.301   0.326   0.363   0.427   0.497   0.676   1.002   1.281   1.525  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,850   0.824   0.954   1.009   1.078   1.113   1.158   1.380   1.549   1.723  PR 2595
 4,245   1.326   1.444   1.636   1.749   1.798   1.858   2.074   2.238   2.402  PR 2597
 3,489   1.222   1.430   1.642   1.834   1.941   2.098   2.796   3.937   5.012  Master


═════════════════════════════════════════════════════════════════════════════ String Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
11,506   0.284   0.303   0.329   0.367   0.387   0.412   0.622   0.757   0.903  PR 2595
11,402   0.290   0.309   0.334   0.374   0.394   0.419   0.634   0.775   0.923  PR 2597
10,948   0.291   0.312   0.340   0.382   0.407   0.443   0.684   0.874   1.016  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,996   0.305   0.326   0.351   0.389   0.410   0.434   0.641   0.782   0.938  PR 2595
10,868   0.310   0.328   0.354   0.392   0.412   0.438   0.648   0.796   0.940  PR 2597
10,504   0.311   0.332   0.360   0.404   0.427   0.459   0.693   0.887   1.049  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,824   0.824   0.952   1.009   1.079   1.116   1.163   1.390   1.565   1.743  PR 2595
 5,845   0.837   0.953   1.009   1.080   1.117   1.163   1.379   1.556   1.747  PR 2597
 5,715   0.813   0.953   1.024   1.099   1.138   1.192   1.416   1.611   1.841  Master

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

calvinxiao · 2021-04-11T15:47:53Z

I saw this after I submitted my PR #2597

I did some experiments:

implement fast_write with write_nonblock as in my PR
replace io.write strm.read with fast_write(io, strm.read)

For chunked body benchmark:

Using your branch:

1KB Body, 7508.39 RPS
10KB Body, 6859.80 RPS
100KB Body, 4339.23 RPS

My experiments:

1KB Body, 8340.42 RPS
10KB Body, 8080.94 RPS
100KB Body, 5340.13 RPS

MSP-Greg · 2021-04-11T16:34:24Z

@calvinxiao

No problem. I've got a hybrid that I'm about to push. More then...

MSP-Greg · 2021-04-11T17:00:11Z

@calvinxiao

Pushed. If you run the following commands from the Puma folder after compiling:

benchmarks/local/chunked_string_times.sh -l10 -c100 -r10 -s tcp -t5:5 -w2
benchmarks/local/chunked_string_times.sh -l10 -c100 -r10 -s unix -t5:5 -w2

You should get timing info that I find varies less on my system than wrk. You can also grab the diff for the 'Add test and benchmark files' commit and add it to master or your PR.

I you have time to run it, I'd be interested in your results...

A few (possibly crazy) thoughts:

Since cork doesn't work with UNIXSockets, it would be nice to eliminate it, if it doesn't affect requests per second (RPS) or response times. I keep thinking 'this was designed for telnet'...
I changed all the methods to use write_nonblock. For code that may get hit a lot, I'll trade a bit more complexity for fewer method calls, especially if they have begin/rescue statements.

It looks like there's an issue with macOS. I'll see what that's about. I'm Windows & WSL2/Ubuntu...

calvinxiao · 2021-04-12T00:56:43Z

lib/puma/request.rb

+        next if (byte_size = part.bytesize).zero?
+        running_len += byte_size
+        if running_len > BUFFER_LENGTH && byte_size != running_len
+          io.write_nonblock strm.read


Needs to handle IO::WaitWritable and Errno::EINTR, also write_nonblock may perform partial write according to the documentation.

Thanks. That was a mistake. Fixed in push just now. And you're correct, write_nonblock & syswrite both may do partial writes for large strings.

MSP-Greg · 2021-04-12T01:37:35Z

Updated things to use write_nonblock. Added a third rackup file, and used in the benchmarks/local/chunked_string_times.sh script. It now runs against a chunked array, a normal array, and an array with one string.

The chunked and normal array have 1kB elements.

Also, I can't see any performance improvement with cork/uncork, so they're commented out for now.

General change is that headers are written to the IOBuffer, and it is written with the first body element/segment. Early-hints are written separately.

…rite_nonblock Co-authored-by: Calvin Xiao <calvin325@gmail.com>

MSP-Greg · 2021-04-17T13:46:13Z

I have tested this in several ways. In every test, it performs better than current code. In some tests, it may only be a few percent faster, in other tests, it's much more significant.

It no longer uses cork/uncork, that code is currently commented out. It should be removed at some point.

Also, I can't test on MacOS, so if anyone can test on that platform, results would be interesting to see.

nateberkopec · 2021-05-31T13:13:07Z

So, the justification for removing cork is basically just "after this PR, it makes no difference to latency or throughput", correct?

nateberkopec · 2021-05-31T13:14:48Z

Let's split the benchmark changes into a separate PR (because they look good and should just go in), and the test folder changes also into a separate PR and consider the lib directory changes by themselves.

MSP-Greg · 2021-06-01T15:30:53Z

I'm making a few changes in the test code, and documenting it. Give me a few days. More later, as I'll add the docs somewhere public...

nateberkopec · 2021-08-18T14:37:29Z

Hey @MSP-Greg let me know if I can help any way here.

MSP-Greg · 2021-08-18T15:15:43Z

let me know if I can help any way here.

Nobody can help me! Sorry, standard response from back in the crazy jazz musician days...

I have cleaned up much of the test code. Added doc, more sensible names for methods, an overall md file, etc.

I've got some benchmark files I'll be creating a PR for soon, then another with more of the test framework code.

My branch with all of this is showing improved request per second (RPS) metrics, especially for chunked or enumerated response bodies...

MSP-Greg · 2021-09-14T16:00:28Z

The code here grouped the test results by body type (array, chunked, string). which requires one to compare 'across' the results. The code used and shown in #2696 (comment) groups the data by body size, which I find much easier to look at.

That, along with improvements, more shared code, etc, means that I'll close this in favor of code blocked by a few PR's...

MSP-Greg mentioned this pull request Apr 9, 2021

Refactor response writes, test refactors, fix UNPACK_TCP_STATE_FROM_TCP_INFO location #2554

Closed

8 tasks

nateberkopec added maintenance perf waiting-for-review Waiting on review from anyone labels Apr 10, 2021

MSP-Greg force-pushed the 00-chunked branch from 88d6566 to 9440417 Compare April 11, 2021 16:41

calvinxiao mentioned this pull request Apr 11, 2021

Use write_nonblock in fast_write #2597

Closed

7 tasks

MSP-Greg force-pushed the 00-chunked branch from 9440417 to 30ecef5 Compare April 11, 2021 23:59

calvinxiao reviewed Apr 12, 2021

View reviewed changes

MSP-Greg force-pushed the 00-chunked branch from 30ecef5 to 208298b Compare April 12, 2021 01:29

MSP-Greg force-pushed the 00-chunked branch from 208298b to a262e29 Compare April 13, 2021 14:57

MSP-Greg and others added 4 commits April 16, 2021 19:42

Add test and benchmark files

e0f60cb

request.rb - update fast_writes for chunked and/or enum bodies, use w…

67bbaa7

…rite_nonblock Co-authored-by: Calvin Xiao <calvin325@gmail.com>

io_buffer.rb - use StringIO instead of String

7dd2d76

minissl.rb - allow chaining with '<<', improve/simplify write

58664f8

MSP-Greg force-pushed the 00-chunked branch from a262e29 to 58664f8 Compare April 17, 2021 00:42

MSP-Greg marked this pull request as ready for review April 17, 2021 12:53

calvinxiao mentioned this pull request Apr 17, 2021

Defer checking if socket is closed #2602

Closed

7 tasks

MSP-Greg added the refactor label Apr 17, 2021

MSP-Greg mentioned this pull request Apr 19, 2021

Puma versions >= v5.0.3 throw Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE #2583

Closed

This was referenced May 5, 2021

Add rack_url_scheme to Puma::DSL, allows setting of rack.url_scheme header #2586

Merged

Puma::IOBuffer can be more efficient and used in more places #2457

Closed

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels May 17, 2021

nateberkopec added waiting-for-review Waiting on review from anyone and removed waiting-for-changes Waiting on changes from the requestor labels May 26, 2021

nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Jun 4, 2021

MSP-Greg mentioned this pull request Sep 14, 2021

io_buffer.rb, request.rb - improve handling with a body array/enumeration #2696

Closed

7 tasks

MSP-Greg closed this Sep 14, 2021

MSP-Greg deleted the 00-chunked branch November 2, 2021 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve client response code, chunked bodies #2595

Improve client response code, chunked bodies #2595

MSP-Greg commented Apr 9, 2021 •

edited

calvinxiao commented Apr 11, 2021

MSP-Greg commented Apr 11, 2021

MSP-Greg commented Apr 11, 2021 •

edited

calvinxiao Apr 12, 2021

MSP-Greg Apr 12, 2021

MSP-Greg commented Apr 12, 2021 •

edited

MSP-Greg commented Apr 17, 2021

nateberkopec commented May 31, 2021

nateberkopec commented May 31, 2021

MSP-Greg commented Jun 1, 2021

nateberkopec commented Aug 18, 2021

MSP-Greg commented Aug 18, 2021

MSP-Greg commented Sep 14, 2021

Improve client response code, chunked bodies #2595

Improve client response code, chunked bodies #2595

Conversation

MSP-Greg commented Apr 9, 2021 • edited

Description

Your checklist for this pull request

calvinxiao commented Apr 11, 2021

MSP-Greg commented Apr 11, 2021

MSP-Greg commented Apr 11, 2021 • edited

calvinxiao Apr 12, 2021

Choose a reason for hiding this comment

MSP-Greg Apr 12, 2021

Choose a reason for hiding this comment

MSP-Greg commented Apr 12, 2021 • edited

MSP-Greg commented Apr 17, 2021

nateberkopec commented May 31, 2021

nateberkopec commented May 31, 2021

MSP-Greg commented Jun 1, 2021

nateberkopec commented Aug 18, 2021

MSP-Greg commented Aug 18, 2021

MSP-Greg commented Sep 14, 2021

MSP-Greg commented Apr 9, 2021 •

edited

MSP-Greg commented Apr 11, 2021 •

edited

MSP-Greg commented Apr 12, 2021 •

edited