Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve client response code, chunked bodies #2595

Closed
wants to merge 4 commits into from

Conversation

MSP-Greg
Copy link
Member

@MSP-Greg MSP-Greg commented Apr 9, 2021

Description

When clients start writing a request, Puma does several things.

  1. Determines which client to process next
  2. Read and parse the request
  3. Send the request to the app
  4. Process the response

Step 3 is trivial, so improving Puma involves optimizing the other steps. Many of the benchmarks use a minimal response with few headers and a small, single string body. This drives the time needed for step 4 to zero, and hence measures steps 1 and 2.

This PR looks at the performance of step 4, using more typical responses, especially bodies that are not simple, small string arrays.

What this PR changes:

  1. Puma::IOBuffer is subclassed from StringIO. The basic API is not changed, and because it is more stream-like than a String, it isn't affected by encoding issues (see Puma versions >= v5.0.3 throw Encoding::CompatibilityError: incompatible character encodings: US-ASCII and UTF-16LE #2583). It is now used to assemble both the headers and the body.

  2. Puma::Request - added fast_write_body method. This is optimized for increasing the speed of writing the request, both for when body is an enum, and also if it is a File object.

  3. Current code wrote the headers, then the body. Updated code writes when the next response addition would result in over 128kB in the response buffer. This limit can be increased, I just used it as it was the largest chunk that my system seemed to read in the client code...

  4. Current code uses syswrite for most client socket operations. After a lot of benchmarking, and reviewing code elsewhere (like Unicorn), I'm not sure if there's a clear case for either syswrite or write. The code in the PR has been changed to write_nonblock.

  5. Several files have been included that allow benchmarking request/response performance. Scripts for using test/helpers/sockets.rb to summarize total response times and 'wrk' are included. See benchmarks/request_reponse_time_benchmarks.md for a doc regarding these additions.

Performance using included files is below. Some of the differences are insignificant, but the speed increase in 'Chunked Body' - '10kB Body' and especially '100kB Body' are considerable. The 'Chunked Body' rack app returns a 1kB byte string, so the '100kB Body' has 100 steps in the enumeration.

Data is included for this PR, PR 2597 (write_nonblock), and master. 'RPS' is requests per second, the timing info is in mS, timed from the start of the request write to receiving the full body. Note that when using sockets.rb / create_clients, one can hit a server hard enough that the request/response time will begin to increase and the RPS number will only show small increases.

SSL sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client
════════════════════════════════════════════════════════════════════════════ Chunked Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
 5,985   0.583   0.645   0.695   0.771   0.815   0.867   1.058   2.176   3.174  PR 2595
 2,208   3.431   3.529   3.652   3.857   3.968   4.104   4.566   5.268   5.827  PR 2597
 2,215   3.405   3.507   3.623   3.815   3.930   4.073   4.570   5.249   5.801  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
 5,731   0.624   0.674   0.715   0.777   0.815   0.872   1.087   2.345   3.511  PR 2595
   623  14.535  14.707  14.925  15.234  15.382  15.557  16.221  17.065  17.902  PR 2597
   622  14.483  14.687  14.912  15.242  15.396  15.571  16.273  17.200  17.962  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 1,847   3.954   4.093   4.269   4.521   4.659   4.805   5.268   5.982   6.759  PR 2595
    67  141.32  142.18  143.35  145.26  146.09  146.80  148.61  150.21  151.57  PR 2597
    67  141.59  142.42  143.51  145.16  145.84  146.55  148.51  150.19  151.55  Master


══════════════════════════════════════════════════════════════════════════════ Array Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
 5,993   0.591   0.651   0.700   0.773   0.813   0.863   1.042   2.182   3.223  PR 2595
 3 997   1.412   1.506   1.597   1.741   1.822   1.923   2.313   3.192   3.856  PR 2597
 4,007   1.412   1.507   1.593   1.741   1.817   1.919   2.317   3.176   3.831  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
 5,721   0.628   0.687   0.735   0.811   0.853   0.905   1.083   2.280   3.273  PR 2595
 1,598   5.039   5.154   5.298   5.516   5.632   5.776   6.315   7.042   7.685  PR 2597
 1,614   5.007   5.116   5.254   5.468   5.589   5.738   6.305   7.070   7.723  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 1,956   3.698   3.828   3.974   4.209   4.346   4.497   5.006   5.864   6.658  PR 2595
   200  46.873  47.263  47.791  48.604  48.988  49.395  50.396  51.328  52.309  PR 2597
   200  47.085  47.423  47.858  48.656  49.078  49.509  50.582  51.603  52.648  Master


═════════════════════════════════════════════════════════════════════════════ String Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
 5,951   0.586   0.655   0.709   0.789   0.831   0.882   1.062   2.219   3.164  PR 2595
 4,767   1.000   1.108   1.187   1.306   1.372   1.447   1.778   2.745   3.439  PR 2597
 4,768   0.997   1.105   1.184   1.303   1.365   1.442   1.765   2.730   3.456  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
 5,716   0.615   0.687   0.740   0.818   0.861   0.909   1.083   2.299   3.247  PR 2595
 4,593   1.036   1.140   1.213   1.331   1.398   1.481   1.850   2.853   3.603  PR 2597
 4,629   1.011   1.129   1.209   1.326   1.388   1.469   1.833   2.882   3.591  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 1,944   3.748   3.879   4.034   4.296   4.434   4.591   5.059   5.705   6.477  PR 2595
 1,793   4.190   4.311   4.475   4.761   4.896   5.061   5.549   6.199   6.889  PR 2597
 1,779   4.187   4.328   4.511   4.782   4.927   5.092   5.610   6.251   6.902  Master
TCP sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client
════════════════════════════════════════════════════════════════════════════ Chunked Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
10,485   0.285   0.308   0.336   0.383   0.411   0.464   0.812   0.986   1.160  PR 2595
10,420   0.286   0.310   0.339   0.389   0.421   0.481   0.823   0.991   1.169  PR 2597
 8,669   0.277   0.305   0.344   0.511   0.692   0.833   1.150   1.435   1.653  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,063   0.313   0.333   0.362   0.406   0.433   0.470   0.782   1.019   1.184  PR 2595
 9,812   0.310   0.335   0.365   0.415   0.444   0.494   0.846   1.047   1.236  PR 2597
 6,527   0.276   0.315   0.372   0.706   0.935   1.233   1.819   2.265   2.725  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,110   0.938   1.066   1.134   1.218   1.262   1.325   1.627   1.903   2.186  PR 2595
 4,965   0.976   1.078   1.147   1.241   1.296   1.382   1.744   2.004   2.320  PR 2597
 2,033   0.835   0.930   1.077   1.398   1.857   3.918   7.797  10.291  14.146  Master


══════════════════════════════════════════════════════════════════════════════ Array Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
10,527   0.289   0.312   0.341   0.387   0.415   0.460   0.773   0.978   1.142  PR 2595
10,372   0.288   0.312   0.341   0.390   0.421   0.475   0.797   0.997   1.171  PR 2597
 9,386   0.281   0.308   0.347   0.424   0.570   0.727   0.969   1.206   1.439  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,151   0.306   0.328   0.357   0.402   0.427   0.467   0.766   1.007   1.167  PR 2595
 9,992   0.304   0.329   0.358   0.407   0.435   0.482   0.817   1.022   1.208  PR 2597
 8,260   0.292   0.321   0.363   0.470   0.677   0.851   1.200   1.528   1.817  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,612   0.823   0.952   1.022   1.101   1.146   1.204   1.458   1.748   2.039  PR 2595
 5,666   0.849   0.953   1.015   1.094   1.135   1.194   1.479   1.738   1.991  PR 2597
 3,993   0.784   0.878   0.995   1.177   1.385   1.757   2.941   4.148   5.245  Master


═════════════════════════════════════════════════════════════════════════════ String Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
10,537   0.289   0.312   0.341   0.387   0.416   0.463   0.771   0.978   1.139  PR 2595
10,420   0.291   0.312   0.340   0.387   0.416   0.471   0.810   0.984   1.157  PR 2597
 9,543   0.287   0.313   0.348   0.418   0.519   0.686   0.944   1.131   1.362  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,131   0.309   0.332   0.361   0.407   0.434   0.471   0.760   1.002   1.163  PR 2595
10,084   0.307   0.329   0.358   0.405   0.432   0.476   0.801   1.022   1.184  PR 2597
 9,262   0.304   0.330   0.366   0.430   0.490   0.664   0.964   1.152   1.408  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,664   0.845   0.955   1.019   1.097   1.137   1.190   1.462   1.740   2.017  PR 2595
 5,620   0.849   0.960   1.024   1.102   1.143   1.199   1.467   1.766   2.059  PR 2597
 5,567   0.853   0.963   1.024   1.105   1.149   1.213   1.543   1.792   2.109  Master
Unix sockets - 10,000 requests - 10 loops of 100 clients * 10 requests per client
════════════════════════════════════════════════════════════════════════════ Chunked Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
11,566   0.280   0.300   0.325   0.364   0.384   0.411   0.629   0.763   0.913  PR 2595
11,361   0.289   0.308   0.332   0.371   0.391   0.417   0.638   0.780   0.945  PR 2597
 9,456   0.281   0.308   0.342   0.410   0.513   0.686   0.968   1.184   1.429  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,731   0.310   0.331   0.357   0.397   0.419   0.446   0.654   0.800   0.945  PR 2595
10,137   0.332   0.354   0.381   0.421   0.443   0.472   0.682   0.853   1.017  PR 2597
 6,927   0.305   0.334   0.379   0.511   0.753   1.002   1.597   2.053   2.445  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,293   0.928   1.054   1.118   1.195   1.233   1.281   1.526   1.726   1.916  PR 2595
 4,025   1.274   1.412   1.496   1.592   1.640   1.697   1.929   2.175   2.387  PR 2597
 1,945   0.938   1.072   1.265   1.835   3.104   4.112   7.347   9.340  12.977  Master


══════════════════════════════════════════════════════════════════════════════ Array Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
11,489   0.287   0.307   0.331   0.369   0.388   0.412   0.623   0.758   0.908  PR 2595
11,371   0.291   0.310   0.335   0.372   0.391   0.416   0.629   0.756   0.915  PR 2597
10,577   0.285   0.310   0.339   0.386   0.414   0.468   0.769   0.966   1.101  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,986   0.305   0.323   0.348   0.389   0.409   0.435   0.639   0.780   0.931  PR 2595
10,731   0.312   0.334   0.360   0.400   0.421   0.448   0.656   0.806   0.958  PR 2597
 9,017   0.301   0.326   0.363   0.427   0.497   0.676   1.002   1.281   1.525  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,850   0.824   0.954   1.009   1.078   1.113   1.158   1.380   1.549   1.723  PR 2595
 4,245   1.326   1.444   1.636   1.749   1.798   1.858   2.074   2.238   2.402  PR 2597
 3,489   1.222   1.430   1.642   1.834   1.941   2.098   2.796   3.937   5.012  Master


═════════════════════════════════════════════════════════════════════════════ String Body
  RPS      5%     10%     20%     40%     50%     60%     80%     90%     95%
──────────────────────────────────────────────────────────────────────────────   1kB Body
11,506   0.284   0.303   0.329   0.367   0.387   0.412   0.622   0.757   0.903  PR 2595
11,402   0.290   0.309   0.334   0.374   0.394   0.419   0.634   0.775   0.923  PR 2597
10,948   0.291   0.312   0.340   0.382   0.407   0.443   0.684   0.874   1.016  Master

──────────────────────────────────────────────────────────────────────────────  10kB Body
10,996   0.305   0.326   0.351   0.389   0.410   0.434   0.641   0.782   0.938  PR 2595
10,868   0.310   0.328   0.354   0.392   0.412   0.438   0.648   0.796   0.940  PR 2597
10,504   0.311   0.332   0.360   0.404   0.427   0.459   0.693   0.887   1.049  Master

────────────────────────────────────────────────────────────────────────────── 100kB Body
 5,824   0.824   0.952   1.009   1.079   1.116   1.163   1.390   1.565   1.743  PR 2595
 5,845   0.837   0.953   1.009   1.080   1.117   1.163   1.379   1.556   1.747  PR 2597
 5,715   0.813   0.953   1.024   1.099   1.138   1.192   1.416   1.611   1.841  Master

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@calvinxiao
Copy link
Contributor

I saw this after I submitted my PR #2597

I did some experiments:

  1. implement fast_write with write_nonblock as in my PR
  2. replace io.write strm.read with fast_write(io, strm.read)

For chunked body benchmark:

Using your branch:

  • 1KB Body, 7508.39 RPS
  • 10KB Body, 6859.80 RPS
  • 100KB Body, 4339.23 RPS

My experiments:

  • 1KB Body, 8340.42 RPS
  • 10KB Body, 8080.94 RPS
  • 100KB Body, 5340.13 RPS

@MSP-Greg
Copy link
Member Author

@calvinxiao

No problem. I've got a hybrid that I'm about to push. More then...

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Apr 11, 2021

@calvinxiao

Pushed. If you run the following commands from the Puma folder after compiling:

benchmarks/local/chunked_string_times.sh -l10 -c100 -r10 -s tcp -t5:5 -w2
benchmarks/local/chunked_string_times.sh -l10 -c100 -r10 -s unix -t5:5 -w2

You should get timing info that I find varies less on my system than wrk. You can also grab the diff for the 'Add test and benchmark files' commit and add it to master or your PR.

I you have time to run it, I'd be interested in your results...

A few (possibly crazy) thoughts:

  1. Since cork doesn't work with UNIXSockets, it would be nice to eliminate it, if it doesn't affect requests per second (RPS) or response times. I keep thinking 'this was designed for telnet'...

  2. I changed all the methods to use write_nonblock. For code that may get hit a lot, I'll trade a bit more complexity for fewer method calls, especially if they have begin/rescue statements.

It looks like there's an issue with macOS. I'll see what that's about. I'm Windows & WSL2/Ubuntu...

next if (byte_size = part.bytesize).zero?
running_len += byte_size
if running_len > BUFFER_LENGTH && byte_size != running_len
io.write_nonblock strm.read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to handle IO::WaitWritable and Errno::EINTR, also write_nonblock may perform partial write according to the documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That was a mistake. Fixed in push just now. And you're correct, write_nonblock & syswrite both may do partial writes for large strings.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Apr 12, 2021

Updated things to use write_nonblock. Added a third rackup file, and used in the benchmarks/local/chunked_string_times.sh script. It now runs against a chunked array, a normal array, and an array with one string.

The chunked and normal array have 1kB elements.

Also, I can't see any performance improvement with cork/uncork, so they're commented out for now.

General change is that headers are written to the IOBuffer, and it is written with the first body element/segment. Early-hints are written separately.

@MSP-Greg MSP-Greg marked this pull request as ready for review April 17, 2021 12:53
@MSP-Greg
Copy link
Member Author

I have tested this in several ways. In every test, it performs better than current code. In some tests, it may only be a few percent faster, in other tests, it's much more significant.

It no longer uses cork/uncork, that code is currently commented out. It should be removed at some point.

Also, I can't test on MacOS, so if anyone can test on that platform, results would be interesting to see.

@nateberkopec
Copy link
Member

So, the justification for removing cork is basically just "after this PR, it makes no difference to latency or throughput", correct?

@nateberkopec
Copy link
Member

Let's split the benchmark changes into a separate PR (because they look good and should just go in), and the test folder changes also into a separate PR and consider the lib directory changes by themselves.

@MSP-Greg
Copy link
Member Author

MSP-Greg commented Jun 1, 2021

I'm making a few changes in the test code, and documenting it. Give me a few days. More later, as I'll add the docs somewhere public...

@nateberkopec nateberkopec added waiting-for-changes Waiting on changes from the requestor and removed waiting-for-review Waiting on review from anyone labels Jun 4, 2021
@nateberkopec
Copy link
Member

Hey @MSP-Greg let me know if I can help any way here.

@MSP-Greg
Copy link
Member Author

let me know if I can help any way here.

Nobody can help me! Sorry, standard response from back in the crazy jazz musician days...

I have cleaned up much of the test code. Added doc, more sensible names for methods, an overall md file, etc.

I've got some benchmark files I'll be creating a PR for soon, then another with more of the test framework code.

My branch with all of this is showing improved request per second (RPS) metrics, especially for chunked or enumerated response bodies...

@MSP-Greg
Copy link
Member Author

The code here grouped the test results by body type (array, chunked, string). which requires one to compare 'across' the results. The code used and shown in #2696 (comment) groups the data by body size, which I find much easier to look at.

That, along with improvements, more shared code, etc, means that I'll close this in favor of code blocked by a few PR's...

@MSP-Greg MSP-Greg closed this Sep 14, 2021
@MSP-Greg MSP-Greg deleted the 00-chunked branch November 2, 2021 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance perf refactor waiting-for-changes Waiting on changes from the requestor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants