Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Puma::IOBuffer can be more efficient and used in more places #2457

Closed
WJWH opened this issue Oct 26, 2020 · 14 comments · Fixed by #2896
Closed

Puma::IOBuffer can be more efficient and used in more places #2457

WJWH opened this issue Oct 26, 2020 · 14 comments · Fixed by #2896
Labels
Milestone

Comments

@WJWH
Copy link
Contributor

WJWH commented Oct 26, 2020

At the moment, Puma::IOBuffer is just a subclass of String, with the append method overwritten and reset aliased to clear. Strangely, the overwritten append behavior seems to implement the same behavior as a normal String, but that is not important for this issue.

The main use of IOBuffer is during request serving where it is used to accumulate lines for the response status line and headers. During the course of a request, several lines are appended to the empty string and all in the end they are all fed to fast_write in a single call. Because a String in Ruby is implemented as a dynamic array of bytes representing codepoints, these repeated additions will cause the underlying array to be doubled in size and copied whenever the String requires more space. Because the initial size is very low and all the headers together can easily be a few hundred bytes, this doubling occurs several times, incurring allocations and copying each time. It's still amortized O(1), but the constant is higher than it could be.

A simple way to prevent this would be to change IOBuffer class so that it maintains a total_size along with an array of Strings that have been added so far. At the end when to_s is called on it, it allocates a single String with the correct capacity straight away and copies the entire array of strings into that.

The early hints construction in str_early_hints could probably also benefit from being buffered this way. I can create a PR if desired?

@nateberkopec
Copy link
Member

Certainly! For inclusion, we'll be looking at changes in results in the benchmarks folder. You can also write your own benchmark in that folder for a "worst case scenario" that you believe will showcase the changes the best.

@WJWH
Copy link
Contributor Author

WJWH commented Oct 26, 2020

I rewrote the IOBuffer class in several different ways but did not manage to get it faster than the current implementation. I suspect that the bookkeeping in Ruby land introduces more overhead than the String concatenation operations implemented in C, even if they are slightly more efficient algorithmically speaking.

With this in mind, I changed it back the original implementation but:

  • Preallocating a 4k string from the start.
  • Not iterating over appended strings with each but doing it in C with the concat function from String.
  • Don't clear the entire string (resetting the underlying size to zero) but allocate a new, preallocated String instead.

On the realistic_response benchmark I got results of [6081, 6251, 6181] req/s for my branch vs [6050, 6109, 6035] req/s for the code on master branch (running each version three times). This is technically 1.7% faster but the effect is quite small and the variability is quite high. It could probably be faster by dipping more into C but I quite like the readability and mostly-pure-Ruby-ness of the code as is so I'm not sure.

@MSP-Greg
Copy link
Member

Anyone ever try the NIO::ByteBuffer class?

@nateberkopec
Copy link
Member

Linking the original removal of the C extension here: #1980

@WJWH
Copy link
Contributor Author

WJWH commented Oct 27, 2020

@MSP-Greg I haven't used the NIO class in particular but this kind of buffers/stringbuilders were what inspired the issue in the first place. As I mentioned above I could not get the Ruby implementation to keep up with the C appending functions for String.

@nateberkopec Interesting to see the discussion on the other topic. For some reason it got into my head that this was pretty old code but it's only been a year or so. In the discussions it's mentioned that we have to support all the way back to Ruby 2.2, is there a list somewhere of supported Rubies? I'm not advocating we ruthlessly drop support but I do want to point out that even 2.4 is no longer supported by the core Ruby team and supporting older versions implicitly encourages people to run potentially unsecure Ruby versions.

After a good nights sleep I had some more ideas and will do some extra tests later this week.

EDIT: After some more digging, it seems that TCP_NODELAY is NOT set by default for most apps, since even though add_tcp_listener and add_ssl_listener have a optimize_for_latency parameter that defaults to true, in this bit in binder it practically always gets set to false unless people explicitly add low_latency=true to their bind uris. This would probably mean that in the vast majority of users will have Nagles algorithm enabled since the option is only mentioned in a single line of the DSL docs. Since the use of IOBuffer is preventing too many short writes, it might be interesting to see if we can just fast_write every string as it comes in and let the OS handle the packet coalescing.

@nateberkopec
Copy link
Member

is there a list somewhere of supported Rubies

The test suite is the spec - if it ain't in the test suite, we don't support it. We'll investigate the Ruby support issue again when we start thinking about Puma 6.0, but that is at least 1 month away.

@MSP-Greg
Copy link
Member

is there a list somewhere of supported Rubies

In terms of the supported version:

s.required_ruby_version = Gem::Requirement.new(">= 2.2")

As Nate mentioned, CI gives a good idea, as it may show platform info that can't be placed in the gemspec.

Sorry for mentioning NIO::ByteBuffer. At present, string operations aren't considered a bottleneck in request processing.

As to optimize_for_latency, CI has a matrix of three OS's, and there several Ruby platforms/versions. Regardless of the default setting, I think it's probably best to have a minimal number of writes. JFYI, CI creates client requests to test Puma, and we've found that syswrite does improve the stability, (as opposed to write or <<).

@WJWH
Copy link
Contributor Author

WJWH commented Nov 3, 2020

I finally got around to benchmarking a few more variants:

  • Changing IOBuffer to just an Array of strings and then iobuf.each {|str| fast_write(io,str)} had only about 30% the req/s of the current implementation, presumably due to the vastly higher number of syscalls.
  • Changing IOBuffer to just an Array of strings and using IO.write and IO.flush so that it falls through to the writev() syscall resulted in about 10% extra requests per second although I did not account for nonblocking IO and rescue-ing EWOULDBLOCK and friends. If it reduces reliability it's probably not worth it.

That leaves the few % to be gained from optimizing how the underlying String for the IOBuffer is initialized. Since Ruby 2.2 is still supported that won't work either since setting the initial capacity was only added later. I'll just add it to my longterm todo list and come back in a few years 🙂

@nateberkopec nateberkopec added this to the 6.0.0 milestone Nov 3, 2020
@nateberkopec
Copy link
Member

I'll mark as "6.0" since we'll consider dropping 2.2 support at that time.

@calvinxiao
Copy link
Contributor

calvinxiao commented Apr 11, 2021

I think append is just an alias of concat, oh it's mentioned in #1980

Benchmark shows the current append implementation is slower in args = ["Lorem", "ipsum", "dolor", "sit", "amet"]

Warming up --------------------------------------
              append   114.374k i/100ms
              concat   186.215k i/100ms
Calculating -------------------------------------
              append      1.151M (± 0.9%) i/s -      5.833M in   5.066979s
              concat      1.880M (± 0.9%) i/s -      9.497M in   5.051291s

Comparison:
              concat:  1880249.0 i/s
              append:  1151287.2 i/s - 1.63x  (± 0.00) slower

@FooBarWidget
Copy link

In Passenger we use writev() to perform a gathered write. This way we don't even have to allocate that final string.

Not sure whether Ruby core has support for writev nowadays. In Passenger we wrote our own support for it, through a native extension.

@MSP-Greg
Copy link
Member

MSP-Greg commented May 7, 2021

@FooBarWidget

Thanks for the post.

Not sure whether Ruby core has support for writev nowadays

IO.write has support for multiple arguments, but ssl sockets aren't true IO objects.

Given that, and also issues with encoding and bodies that need Puma to support 'chunking', I personally thought using a StringIO buffer was best, both for headers and the body. See PR #2595.

Again personally, I think a lot of code re IO and sockets was developed when memory and bandwidth were much lower than they are today.

@wjordan and others are looking at other options, like writev.

@nateberkopec
Copy link
Member

@MSP-Greg I think this is closed by #2896?

@MSP-Greg
Copy link
Member

MSP-Greg commented Sep 4, 2022

Yes. I'll add to the comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants