Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use "interned" (frozen and deduplicated) Strings in Ruby 3.0+ to minimize object allocations. #275

Open
okeeblow opened this issue Jul 20, 2021 · 3 comments

Comments

@okeeblow
Copy link
Contributor

I have a PR ready to go for this feature but this can be for discussion of the general concept.

okeeblow added a commit to okeeblow/ox that referenced this issue Jul 20, 2021
…cation.

Fixes ohler55#275

- Background info: https://en.wikipedia.org/wiki/String_interning
- Available to C extensions since MRI Ruby 3.0:
  - https://bugs.ruby-lang.org/issues/13381
  - https://bugs.ruby-lang.org/issues/16029

This change makes `Ox` use "interned" (frozen and deduplicated) `String`s anywhere possible,
the same effect as calling `String#-@` except without the heap churn of `RVALUE` allocation:
https://ruby-doc.org/core/String.html#method-i-2B-40

- Uses the `HAVE_<funcname>` preprocessor macros to detect this functionality
  and avoid breaking older Rubies (confirmed with at least MRI 2.7.2).
- I explicitly use interned `String`s anywhere an allocated `String` seemed
  entirely internal to `Ox`, e.g. in `sax_value_as_time` when allocating
  the `String` argument to `ox_time_class`.
- Adds a new user-facing option `intern_strings` to control the behavior
  of `String` return values from user-facing functions, e.g. from `sax_as_s`.
- Results in a ~25% reduction in startup GC pressure in my library! :D

Inspired by similar changes to `json` and `msgpack`:

- flori/json#451
- msgpack/msgpack-ruby#196

My goal is `Object` allocation reduction in the `Ox::Sax` handler of my MIME Type
file-identification library caused by the large number of duplicate `String`s in the
`shared-mime-info`-format XML it uses as a data source.

I was already calling `#-@` (or using the `-some_str_variable` syntax) everywhere possible
in my handler, but that only freezes and deduplicates an already-allocated `String`:

  irb(main):068:0> oid = proc { puts "#{_1} is object_id #{_1.object_id}" }
  => #<Proc:0x0000564d53eb6850 (irb):66>
  irb(main):069:0> lol = 'lol'.tap(&oid).-@.tap(&oid)
  lol is object_id 19800
  lol is object_id 19740
  => "lol"
  irb(main):070:0> -lol.tap(&oid).-@.tap(&oid)
  lol is object_id 19740
  lol is object_id 19740
  => "lol"

That avoids duplicate retained `String`s but still causes my library to take
a big GC hit right at startup when it loads its data.

All the stats below were collected using SamSaffron's `memory_profiler`:
https://github.com/SamSaffron/memory_profiler

First, a sanity-check to make sure I didn't break MRI < 3.0,
using MRI 2.7 + latest official `Ox` from RubyGems (2.14.5):

  [okeeblow@emi#CHECKING-YOU-OUT] ruby -v
  ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
  [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
  Total allocated: 20561351 bytes (401372 objects)
  Total retained:  1680971 bytes (22102 objects)

versus same MRI 2.7 but with my patched `Ox`:

  [okeeblow@emi#CHECKING-YOU-OUT] gem install ../../ox/ox-2.14.5.gem
  [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
  Total allocated: 20561111 bytes (401366 objects)
  Total retained:  1680971 bytes (22102 objects)

No difference between the two, showing that this patch is a no-op for pre-3.0.
Sorry I did not test older versions of MRI or other Rubies like J/Truffle/etc
since I don't have them on my system.

Now the real gainz can be seen when comparing MRI 3.0 with unpatched `Ox`:

  [okeeblow@emi#CHECKING-YOU-OUT] bundle install
  Fetching gem metadata from https://rubygems.org/......
  …
  Installing ox 2.14.5 with native extensions
  Using checking-you-out 0.7.0 from source at `.`
  Bundle complete! 11 Gemfile dependencies, 19 gems now installed.
  [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
  Total allocated: 20081080 bytes (390133 objects)
  Total retained:  1713209 bytes (22095 objects)

against same MRI 3.0 with my patched `Ox`:

  [okeeblow@emi#CHECKING-YOU-OUT] gem install ../../ox/ox-2.14.5.gem
  Building native extensions. This could take a while...
  Successfully installed ox-2.14.5
  Parsing documentation for ox-2.14.5
  unknown encoding name ""UTF-8"" for README.md, skipping
  Installing ri documentation for ox-2.14.5
  Done installing documentation for ox after 0 seconds
  1 gem installed
  [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
  Total allocated: 17298804 bytes (322860 objects)
  Total retained:  1713202 bytes (22073 objects)

against same MRI 3.0, my patched `Ox`, and `Ox.parse_sax(intern_strings: true)`:

  [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total
  Total allocated: 16414938 bytes (301382 objects)
  Total retained:  1713226 bytes (22073 objects)

This shows that just having Ruby 3.0 available results in a huge win,
and opting into immutable `String`s from `Value#as_s` helps me even more!

Unit tests pass:

  [okeeblow@emi#test] ruby tests.rb
  Loaded suite tests
  Started
  .................
  16 tests, 40 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
  100% passed
  .....
  18 tests, 42 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
  100% passed
  ...............................................................................................................................
  151 tests, 249 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
  100% passed

Apologies for possible indentation issues in this patch. There seem to be lots of existing
lines with a mix of tabs and spaces — nbd, it happens — so I tried to match the surrounding
lines everywhere I made an addition to make sure things line up and look okay in `git diff`.

I only use `Ox` for Sax parsing, not for marshalling, so please scrutinize my changes
in `gen_load.c` and friends extra hard in case I omitted any `intern_strings` option
checks that could result in an unwary user getting an immutable `String` where
there wasn't one before. The one change I had to make to a `String#force_encoding`-using
test case is an example of what I want to avoid surprising anyone with.
@okeeblow
Copy link
Contributor Author

For what it's worth Ox is already the fastest Ruby XML parser for my use case even without this change. My library wouldn't be feasible without it since it's the only one that can parse shared-mime-info faster than ruby-mime-types starts up, so thank you! https://github.com/okeeblow/DistorteD/blob/master/CHECKING-YOU-OUT/docs/XML.md#Benchmarks

@ohler55
Copy link
Owner

ohler55 commented Jul 20, 2021

I'll look over this tonight. Good to know about the latests string optimizations.

okeeblow added a commit to okeeblow/ox that referenced this issue Jul 26, 2021
…ax` example.

For ohler55#277

I've been experimenting with `Ractor`s since upgrading to Ruby 3.0 for ohler55#275
but quickly ran into `Ractor::UnsafeError` when trying to call `Ox::sax_parse` in anything except the main `Ractor`.

Per Ruby's `Ractor` C extension documentation (link below):
"By default, all C extensions are recognized as `Ractor`-unsafe. If C extension becomes `Ractor`-safe,
the extension should call `rb_ext_ractor_safe(true)` at the `Init_` function and all defined method marked as `Ractor`-safe.
`Ractor`-unsafe C-methods only been called from main-ractor. If non-main ractor calls it, then `Ractor::UnsafeError` is raised."

I don't like to open seemingly-large feature requests like this without making some attempt at it myself first,
and luckily it seems like `Ox::sax_parse` Just Works™ since I marked it `Ractor`-safe, even with the `class_cache`.
Confirming this safety, making any remaining changes to `Ox::Sax`, and expanding this to the non-`Sax` parts of `Ox`
are all unfortunately out of my depth as a n00b C coder, so I would appreciate if you could take this over if it interests you.
I am happy with just `Sax` support since I have no current need for marshalling,
but I imagine other `Ox` users wouldn't be satisfied if stratified.

In this commit:
- Enable `rb_ext_ractor_safe` preprocessor macro via `have_func` in `extconf.rb`.
- Mark `Init_Ox` and `ox_sax_parse` as `Ractor` -safe.
- Add a new `Ractor`-based `Ox::Sax` example exercising both parallel and serial `Ox::Sax` handler `Ractor`s
  to parse data from `shared-mime-info` XML files many users likely already have on their systems.

Official `Ractor` info:

- "Ractor: a proposal for a new concurrent abstraction without thread-safety issues": https://bugs.ruby-lang.org/issues/17100 (ruby/ruby#3365)
- Ruby's official `Ractor` documentation: https://docs.ruby-lang.org/en/master/doc/ractor_md.html
- "A way to mark C extensions as thread-safe, Ractor-safe, or unsafe": https://bugs.ruby-lang.org/issues/17307 (ruby/ruby#3824)
- Ruby's C Extension `Ractor` documention covering `rb_ext_ractor_safe`: https://docs.ruby-lang.org/en/master/doc/extension_rdoc.html#label-Appendix+F.+Ractor+support
- A `Ractor` C Extension from the creator of `Ractor` that might serve as a useful example: https://github.com/ko1/ractor-tvar

Blogs:

- "Ractors: Multi-Core Parallel Processing Comes to Ruby 3": https://www.ruby3.dev/ruby-3-fundamentals/2021/01/27/ractors-multi-core-parallel-processing-in-ruby-3/
- "Ruby Ractor Experiments: Safe async communication" :https://ivoanjo.me/blog/2021/02/14/ractor-experiments-safe-async/
- "Playing with Ruby Ractors": https://billy-ruffian.co.uk/playing-with-ruby-ractors/
- "How Fast are Ractors?": https://www.fastruby.io/blog/ruby/performance/how-fast-are-ractors.html (https://github.com/noahgibbs/ractor_basic_benchmarks/tree/main/benchmarks)

Before this change:

```
[okeeblow@emi#CHECKING-YOU-OUT] time ./bin/checking-you-out ~/224031-dot-jpg
Received /home/okeeblow/.local/share/mime/packages/user-extension-rsrc.xml
/home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:375:in `sax_parse': ractor unsafe method called from not main ractor (Ractor::UnsafeError)
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:375:in `block in open'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:331:in `open'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:331:in `open'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:24:in `block (2 levels) in remember_me'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:19:in `loop'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:19:in `block in remember_me'
<internal:ractor>:583:in `send': The incoming-port is already closed (Ractor::ClosedError)
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:86:in `block in extended'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:63:in `each'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:63:in `each_with_object'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:63:in `extended'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/inner_spirit.rb:216:in `extend'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/inner_spirit.rb:216:in `<top (required)>'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out.rb:10:in `require_relative'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out.rb:10:in `<top (required)>'
        from ./bin/checking-you-out:3:in `require_relative'
        from ./bin/checking-you-out:3:in `<main>'
./bin/checking-you-out ~/224031-dot-jpg  0.12s user 0.05s system 100% cpu 0.168 total
```

My new `Ox::Sax` Ractor example script's usage:

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb
Please provide the path to a `shared-mime-info` XML package and some media-type query arguments (e.g. 'image/jpeg')
```

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb /usr/share/mime/packages/freedesktop.org.xml
Please provide some media-type query arguments (e.g. 'image/jpeg')
```

Finding all-extant types:

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb /usr/share/mime/packages/freedesktop.org.xml image/jpeg font/ttf application/xhtml+xml image/x-pict
Parallel Ractors
["Worker 0 gave us JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]",
 "Worker 1 gave us TrueType 字型 (font/ttf) [.ttf]",
 "Worker 2 gave us XHTML 網頁 (application/xhtml+xml) [.xhtml,.xht]",
 "Worker 3 gave us Macintosh Quickdraw/PICT 繪圖 (image/x-pict) [.pct,.pict,.pict1,.pict2]"]

Serial Ractor
"ONLY ONE OX gave us [#<CYO JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]>, #<CYO TrueType 字型 (font/ttf) [.ttf]>, #<CYO XHTML 網頁 (application/xhtml+xml) [.xhtml,.xht]>, #<CYO Macintosh Quickdraw/PICT 繪圖 (image/x-pict) [.pct,.pict,.pict1,.pict2]>]"
```

…and not finding invalid ones:

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb /usr/share/mime/packages/freedesktop.org.xml lol/rofl fart/butt image/jpeg
Parallel Ractors
["Worker 0 gave us nothing",
 "Worker 1 gave us nothing",
 "Worker 2 gave us JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]"]

Serial Ractor
"ONLY ONE OX gave us [nil, nil, #<CYO JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]>]"
```

Unit tests pass.
ohler55 pushed a commit that referenced this issue Jul 27, 2021
…ax` example. (#278)

For #277

I've been experimenting with `Ractor`s since upgrading to Ruby 3.0 for #275
but quickly ran into `Ractor::UnsafeError` when trying to call `Ox::sax_parse` in anything except the main `Ractor`.

Per Ruby's `Ractor` C extension documentation (link below):
"By default, all C extensions are recognized as `Ractor`-unsafe. If C extension becomes `Ractor`-safe,
the extension should call `rb_ext_ractor_safe(true)` at the `Init_` function and all defined method marked as `Ractor`-safe.
`Ractor`-unsafe C-methods only been called from main-ractor. If non-main ractor calls it, then `Ractor::UnsafeError` is raised."

I don't like to open seemingly-large feature requests like this without making some attempt at it myself first,
and luckily it seems like `Ox::sax_parse` Just Works™ since I marked it `Ractor`-safe, even with the `class_cache`.
Confirming this safety, making any remaining changes to `Ox::Sax`, and expanding this to the non-`Sax` parts of `Ox`
are all unfortunately out of my depth as a n00b C coder, so I would appreciate if you could take this over if it interests you.
I am happy with just `Sax` support since I have no current need for marshalling,
but I imagine other `Ox` users wouldn't be satisfied if stratified.

In this commit:
- Enable `rb_ext_ractor_safe` preprocessor macro via `have_func` in `extconf.rb`.
- Mark `Init_Ox` and `ox_sax_parse` as `Ractor` -safe.
- Add a new `Ractor`-based `Ox::Sax` example exercising both parallel and serial `Ox::Sax` handler `Ractor`s
  to parse data from `shared-mime-info` XML files many users likely already have on their systems.

Official `Ractor` info:

- "Ractor: a proposal for a new concurrent abstraction without thread-safety issues": https://bugs.ruby-lang.org/issues/17100 (ruby/ruby#3365)
- Ruby's official `Ractor` documentation: https://docs.ruby-lang.org/en/master/doc/ractor_md.html
- "A way to mark C extensions as thread-safe, Ractor-safe, or unsafe": https://bugs.ruby-lang.org/issues/17307 (ruby/ruby#3824)
- Ruby's C Extension `Ractor` documention covering `rb_ext_ractor_safe`: https://docs.ruby-lang.org/en/master/doc/extension_rdoc.html#label-Appendix+F.+Ractor+support
- A `Ractor` C Extension from the creator of `Ractor` that might serve as a useful example: https://github.com/ko1/ractor-tvar

Blogs:

- "Ractors: Multi-Core Parallel Processing Comes to Ruby 3": https://www.ruby3.dev/ruby-3-fundamentals/2021/01/27/ractors-multi-core-parallel-processing-in-ruby-3/
- "Ruby Ractor Experiments: Safe async communication" :https://ivoanjo.me/blog/2021/02/14/ractor-experiments-safe-async/
- "Playing with Ruby Ractors": https://billy-ruffian.co.uk/playing-with-ruby-ractors/
- "How Fast are Ractors?": https://www.fastruby.io/blog/ruby/performance/how-fast-are-ractors.html (https://github.com/noahgibbs/ractor_basic_benchmarks/tree/main/benchmarks)

Before this change:

```
[okeeblow@emi#CHECKING-YOU-OUT] time ./bin/checking-you-out ~/224031-dot-jpg
Received /home/okeeblow/.local/share/mime/packages/user-extension-rsrc.xml
/home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:375:in `sax_parse': ractor unsafe method called from not main ractor (Ractor::UnsafeError)
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:375:in `block in open'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:331:in `open'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival/mr_mime.rb:331:in `open'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:24:in `block (2 levels) in remember_me'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:19:in `loop'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:19:in `block in remember_me'
<internal:ractor>:583:in `send': The incoming-port is already closed (Ractor::ClosedError)
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:86:in `block in extended'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:63:in `each'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:63:in `each_with_object'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/ghost_revival.rb:63:in `extended'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/inner_spirit.rb:216:in `extend'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out/inner_spirit.rb:216:in `<top (required)>'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out.rb:10:in `require_relative'
        from /home/okeeblow/Works/DistorteD/CHECKING-YOU-OUT/lib/checking-you-out.rb:10:in `<top (required)>'
        from ./bin/checking-you-out:3:in `require_relative'
        from ./bin/checking-you-out:3:in `<main>'
./bin/checking-you-out ~/224031-dot-jpg  0.12s user 0.05s system 100% cpu 0.168 total
```

My new `Ox::Sax` Ractor example script's usage:

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb
Please provide the path to a `shared-mime-info` XML package and some media-type query arguments (e.g. 'image/jpeg')
```

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb /usr/share/mime/packages/freedesktop.org.xml
Please provide some media-type query arguments (e.g. 'image/jpeg')
```

Finding all-extant types:

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb /usr/share/mime/packages/freedesktop.org.xml image/jpeg font/ttf application/xhtml+xml image/x-pict
Parallel Ractors
["Worker 0 gave us JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]",
 "Worker 1 gave us TrueType 字型 (font/ttf) [.ttf]",
 "Worker 2 gave us XHTML 網頁 (application/xhtml+xml) [.xhtml,.xht]",
 "Worker 3 gave us Macintosh Quickdraw/PICT 繪圖 (image/x-pict) [.pct,.pict,.pict1,.pict2]"]

Serial Ractor
"ONLY ONE OX gave us [#<CYO JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]>, #<CYO TrueType 字型 (font/ttf) [.ttf]>, #<CYO XHTML 網頁 (application/xhtml+xml) [.xhtml,.xht]>, #<CYO Macintosh Quickdraw/PICT 繪圖 (image/x-pict) [.pct,.pict,.pict1,.pict2]>]"
```

…and not finding invalid ones:

```
[okeeblow@emi#ox] ./examples/sax_ractor.rb /usr/share/mime/packages/freedesktop.org.xml lol/rofl fart/butt image/jpeg
Parallel Ractors
["Worker 0 gave us nothing",
 "Worker 1 gave us nothing",
 "Worker 2 gave us JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]"]

Serial Ractor
"ONLY ONE OX gave us [nil, nil, #<CYO JPEG 影像 (image/jpeg) [.jpeg,.jpg,.jpe]>]"
```

Unit tests pass.
@ohler55
Copy link
Owner

ohler55 commented Apr 1, 2022

Can this be closed? Is it still relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants