Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "interned" Strings when possible to avoid uneccessary heap allocation. #276

Closed
wants to merge 1 commit into from

Commits on Jul 20, 2021

  1. Use "interned" Strings when possible to avoid uneccessary heap allo…

    …cation.
    
    Fixes ohler55#275
    
    - Background info: https://en.wikipedia.org/wiki/String_interning
    - Available to C extensions since MRI Ruby 3.0:
      - https://bugs.ruby-lang.org/issues/13381
      - https://bugs.ruby-lang.org/issues/16029
    
    This change makes `Ox` use "interned" (frozen and deduplicated) `String`s anywhere possible,
    the same effect as calling `String#-@` except without the heap churn of `RVALUE` allocation:
    https://ruby-doc.org/core/String.html#method-i-2B-40
    
    - Uses the `HAVE_<funcname>` preprocessor macros to detect this functionality
      and avoid breaking older Rubies (confirmed with at least MRI 2.7.2).
    - I explicitly use interned `String`s anywhere an allocated `String` seemed
      entirely internal to `Ox`, e.g. in `sax_value_as_time` when allocating
      the `String` argument to `ox_time_class`.
    - Adds a new user-facing option `intern_strings` to control the behavior
      of `String` return values from user-facing functions, e.g. from `sax_as_s`.
    - Results in a ~25% reduction in startup GC pressure in my library! :D
    
    Inspired by similar changes to `json` and `msgpack`:
    
    - flori/json#451
    - msgpack/msgpack-ruby#196
    
    My goal is `Object` allocation reduction in the `Ox::Sax` handler of my MIME Type
    file-identification library caused by the large number of duplicate `String`s in the
    `shared-mime-info`-format XML it uses as a data source.
    
    I was already calling `#-@` (or using the `-some_str_variable` syntax) everywhere possible
    in my handler, but that only freezes and deduplicates an already-allocated `String`:
    
      irb(main):068:0> oid = proc { puts "#{_1} is object_id #{_1.object_id}" }
      => #<Proc:0x0000564d53eb6850 (irb):66>
      irb(main):069:0> lol = 'lol'.tap(&oid).-@.tap(&oid)
      lol is object_id 19800
      lol is object_id 19740
      => "lol"
      irb(main):070:0> -lol.tap(&oid).-@.tap(&oid)
      lol is object_id 19740
      lol is object_id 19740
      => "lol"
    
    That avoids duplicate retained `String`s but still causes my library to take
    a big GC hit right at startup when it loads its data.
    
    All the stats below were collected using SamSaffron's `memory_profiler`:
    https://github.com/SamSaffron/memory_profiler
    
    First, a sanity-check to make sure I didn't break MRI < 3.0,
    using MRI 2.7 + latest official `Ox` from RubyGems (2.14.5):
    
      [okeeblow@emi#CHECKING-YOU-OUT] ruby -v
      ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
      [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
      Total allocated: 20561351 bytes (401372 objects)
      Total retained:  1680971 bytes (22102 objects)
    
    versus same MRI 2.7 but with my patched `Ox`:
    
      [okeeblow@emi#CHECKING-YOU-OUT] gem install ../../ox/ox-2.14.5.gem
      [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
      Total allocated: 20561111 bytes (401366 objects)
      Total retained:  1680971 bytes (22102 objects)
    
    No difference between the two, showing that this patch is a no-op for pre-3.0.
    Sorry I did not test older versions of MRI or other Rubies like J/Truffle/etc
    since I don't have them on my system.
    
    Now the real gainz can be seen when comparing MRI 3.0 with unpatched `Ox`:
    
      [okeeblow@emi#CHECKING-YOU-OUT] bundle install
      Fetching gem metadata from https://rubygems.org/......
      …
      Installing ox 2.14.5 with native extensions
      Using checking-you-out 0.7.0 from source at `.`
      Bundle complete! 11 Gemfile dependencies, 19 gems now installed.
      [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
      Total allocated: 20081080 bytes (390133 objects)
      Total retained:  1713209 bytes (22095 objects)
    
    against same MRI 3.0 with my patched `Ox`:
    
      [okeeblow@emi#CHECKING-YOU-OUT] gem install ../../ox/ox-2.14.5.gem
      Building native extensions. This could take a while...
      Successfully installed ox-2.14.5
      Parsing documentation for ox-2.14.5
      unknown encoding name ""UTF-8"" for README.md, skipping
      Installing ri documentation for ox-2.14.5
      Done installing documentation for ox after 0 seconds
      1 gem installed
      [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
      Total allocated: 17298804 bytes (322860 objects)
      Total retained:  1713202 bytes (22073 objects)
    
    against same MRI 3.0, my patched `Ox`, and `Ox.parse_sax(intern_strings: true)`:
    
      [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
      Total allocated: 16414938 bytes (301382 objects)
      Total retained:  1713226 bytes (22073 objects)
    
    This shows that just having Ruby 3.0 available results in a huge win,
    and opting into immutable `String`s from `Value#as_s` helps me even more!
    
    Unit tests pass:
    
      [okeeblow@emi#test] ruby tests.rb
      Loaded suite tests
      Started
      .................
      16 tests, 40 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
      100% passed
      .....
      18 tests, 42 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
      100% passed
      ...............................................................................................................................
      151 tests, 249 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
      100% passed
    
    Apologies for possible indentation issues in this patch. There seem to be lots of existing
    lines with a mix of tabs and spaces — nbd, it happens — so I tried to match the surrounding
    lines everywhere I made an addition to make sure things line up and look okay in `git diff`.
    
    I only use `Ox` for Sax parsing, not for marshalling, so please scrutinize my changes
    in `gen_load.c` and friends extra hard in case I omitted any `intern_strings` option
    checks that could result in an unwary user getting an immutable `String` where
    there wasn't one before. The one change I had to make to a `String#force_encoding`-using
    test case is an example of what I want to avoid surprising anyone with.
    okeeblow committed Jul 20, 2021
    Configuration menu
    Copy the full SHA
    b0d9a8e View commit details
    Browse the repository at this point in the history