Use rb_enc_interned_str when available #196

casperisfine · 2020-11-30T10:59:41Z

I just got ruby/ruby#3786 merged, which mean in the upcoming Ruby 3.0, C-extensions will be able to return already interned strings with 0 allocations.

For now I use it only in msgpack_buffer_read_top_as_string because since we have access to the full C string, it's fairly easy to integrate. However to use it in read_raw_body_cont we'd need to do the string buffering in C, I'm open to try but I figured a simpler PR would be easier to review, and also that larger strings are less likely to benefit from this.

@tagomoris what do you think ?

For reference here's the json version of this PR: flori/json#451

cc @XrXr @tenderlove @rafaelfranca

casperisfine · 2020-11-30T11:02:15Z

ext/msgpack/unpacker.c

@@ -583,7 +525,7 @@ static int read_primitive(msgpack_unpacker_t* uk)
                READ_CAST_BLOCK_OR_RETURN_EOF(cb, uk, 4);
                uint32_t count = _msgpack_be32(cb->u32);
                if(count == 0) {
-                    return object_complete_string(uk, rb_str_buf_new(0));
+                    return object_complete(uk, rb_utf8_str_new_static("", 0));


Two things here.

Since I removed object_complete_string and object_complete_binary we have to use API that set the proper encoding directly.

And also rb_str_buf_new is meant for string that will be concatenated to, so they are instantiated with a capacity of 63 bytes (so 103bytes total):

#define STR_BUF_MIN_SIZE 63

Since here we only want an empty string, it's a bit of a waste.

I may be misunderstanding your comment. Do you mean:

the previous code using rb_str_buf_new instantiated 63 bytes larger value

the newer code will not consume additional 63 bytes
Is that correct?

The first one. rb_str_buf_new(0) will always instantiate a string with 63 bytes of capacity, so a total of 103 bytes.

The new code will directly right size the string, so saving up to 63 bytes.

casperisfine · 2020-11-30T11:06:34Z

ext/msgpack/buffer.h


+    result = rb_str_new(b->read_buffer, length);
+#ifdef COMPAT_HAVE_ENCODING


Not sure what the support matrix is, but if we could get rid of COMPAT_HAVE_ENCODING we could more easily directly instantiate the string with rb_utf8_str_new which would be preferable.

If this for Ruby 1.8?

Yes. Ruby 1.8 is of course out of the support list of msgpack-ruby.
I strongly believe that we can remove COMPAT_HAVE_ENCODING, but it should be out of this pull request.

I can try to do it if you'd like, we could then rebase this PR.

That is definitely helpful to me :D

casperisfine · 2021-01-07T09:17:56Z

@tagomoris any chance this PR could be considered?

tagomoris · 2021-01-08T02:07:14Z

I'll make sure to check this change this week 👀

casperisfine · 2021-01-08T08:12:40Z

Thank you!

tagomoris

Totally, this change looks good to me.

tagomoris · 2021-01-12T13:25:57Z

ext/msgpack/buffer.h


+    result = rb_str_new(b->read_buffer, length);
+#ifdef COMPAT_HAVE_ENCODING


tagomoris · 2021-01-12T14:02:31Z

ext/msgpack/unpacker.c

@@ -583,7 +525,7 @@ static int read_primitive(msgpack_unpacker_t* uk)
                READ_CAST_BLOCK_OR_RETURN_EOF(cb, uk, 4);
                uint32_t count = _msgpack_be32(cb->u32);
                if(count == 0) {
-                    return object_complete_string(uk, rb_str_buf_new(0));
+                    return object_complete(uk, rb_utf8_str_new_static("", 0));


I may be misunderstanding your comment. Do you mean:

the previous code using rb_str_buf_new instantiated 63 bytes larger value

the newer code will not consume additional 63 bytes
Is that correct?

casperisfine · 2021-01-12T14:26:31Z

this change looks good to me.

Awesome. I'll wait on #198 then, and rebase ?

tagomoris · 2021-01-12T14:34:35Z

Merged #198

tagomoris · 2021-01-12T14:35:00Z

Sorry, I'll do further operations (my) tomorrow.

casperisfine · 2021-01-12T14:36:08Z

Sorry, I'll do further operations (my) tomorrow.

No problem. Thanks for the reviews!

I just rebased the PR.

tagomoris

LGTM

tagomoris · 2021-01-13T00:11:18Z

Thank you!
My plan is to release this change with other changes about #192 and confirmation about M1 mac build, but please let me know if you want this change earlier.

casperisfine · 2021-01-13T08:06:24Z

but please let me know if you want this change earlier.

Thanks for asking. There's no big rush. I can point to the git repo anyway.

Thanks a lot for the merge!

…cation. Fixes ohler55#275 - Background info: https://en.wikipedia.org/wiki/String_interning - Available to C extensions since MRI Ruby 3.0: - https://bugs.ruby-lang.org/issues/13381 - https://bugs.ruby-lang.org/issues/16029 This change makes `Ox` use "interned" (frozen and deduplicated) `String`s anywhere possible, the same effect as calling `String#-@` except without the heap churn of `RVALUE` allocation: https://ruby-doc.org/core/String.html#method-i-2B-40 - Uses the `HAVE_<funcname>` preprocessor macros to detect this functionality and avoid breaking older Rubies (confirmed with at least MRI 2.7.2). - I explicitly use interned `String`s anywhere an allocated `String` seemed entirely internal to `Ox`, e.g. in `sax_value_as_time` when allocating the `String` argument to `ox_time_class`. - Adds a new user-facing option `intern_strings` to control the behavior of `String` return values from user-facing functions, e.g. from `sax_as_s`. - Results in a ~25% reduction in startup GC pressure in my library! :D Inspired by similar changes to `json` and `msgpack`: - flori/json#451 - msgpack/msgpack-ruby#196 My goal is `Object` allocation reduction in the `Ox::Sax` handler of my MIME Type file-identification library caused by the large number of duplicate `String`s in the `shared-mime-info`-format XML it uses as a data source. I was already calling `#-@` (or using the `-some_str_variable` syntax) everywhere possible in my handler, but that only freezes and deduplicates an already-allocated `String`: irb(main):068:0> oid = proc { puts "#{_1} is object_id #{_1.object_id}" } => #<Proc:0x0000564d53eb6850 (irb):66> irb(main):069:0> lol = 'lol'.tap(&oid).-@.tap(&oid) lol is object_id 19800 lol is object_id 19740 => "lol" irb(main):070:0> -lol.tap(&oid).-@.tap(&oid) lol is object_id 19740 lol is object_id 19740 => "lol" That avoids duplicate retained `String`s but still causes my library to take a big GC hit right at startup when it loads its data. All the stats below were collected using SamSaffron's `memory_profiler`: https://github.com/SamSaffron/memory_profiler First, a sanity-check to make sure I didn't break MRI < 3.0, using MRI 2.7 + latest official `Ox` from RubyGems (2.14.5): [okeeblow@emi#CHECKING-YOU-OUT] ruby -v ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux] [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total Total allocated: 20561351 bytes (401372 objects) Total retained: 1680971 bytes (22102 objects) versus same MRI 2.7 but with my patched `Ox`: [okeeblow@emi#CHECKING-YOU-OUT] gem install ../../ox/ox-2.14.5.gem [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total Total allocated: 20561111 bytes (401366 objects) Total retained: 1680971 bytes (22102 objects) No difference between the two, showing that this patch is a no-op for pre-3.0. Sorry I did not test older versions of MRI or other Rubies like J/Truffle/etc since I don't have them on my system. Now the real gainz can be seen when comparing MRI 3.0 with unpatched `Ox`: [okeeblow@emi#CHECKING-YOU-OUT] bundle install Fetching gem metadata from https://rubygems.org/...... … Installing ox 2.14.5 with native extensions Using checking-you-out 0.7.0 from source at `.` Bundle complete! 11 Gemfile dependencies, 19 gems now installed. [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total Total allocated: 20081080 bytes (390133 objects) Total retained: 1713209 bytes (22095 objects) against same MRI 3.0 with my patched `Ox`: [okeeblow@emi#CHECKING-YOU-OUT] gem install ../../ox/ox-2.14.5.gem Building native extensions. This could take a while... Successfully installed ox-2.14.5 Parsing documentation for ox-2.14.5 unknown encoding name ""UTF-8"" for README.md, skipping Installing ri documentation for ox-2.14.5 Done installing documentation for ox after 0 seconds 1 gem installed [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total Total allocated: 17298804 bytes (322860 objects) Total retained: 1713202 bytes (22073 objects) against same MRI 3.0, my patched `Ox`, and `Ox.parse_sax(intern_strings: true)`: [okeeblow@emi#CHECKING-YOU-OUT] ./bin/are-we-unallocated-yet|grep Total Total allocated: 16414938 bytes (301382 objects) Total retained: 1713226 bytes (22073 objects) This shows that just having Ruby 3.0 available results in a huge win, and opting into immutable `String`s from `Value#as_s` helps me even more! Unit tests pass: [okeeblow@emi#test] ruby tests.rb Loaded suite tests Started ................. 16 tests, 40 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications 100% passed ..... 18 tests, 42 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications 100% passed ............................................................................................................................... 151 tests, 249 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications 100% passed Apologies for possible indentation issues in this patch. There seem to be lots of existing lines with a mix of tabs and spaces — nbd, it happens — so I tried to match the surrounding lines everywhere I made an addition to make sure things line up and look okay in `git diff`. I only use `Ox` for Sax parsing, not for marshalling, so please scrutinize my changes in `gen_load.c` and friends extra hard in case I omitted any `intern_strings` option checks that could result in an unwary user getting an immutable `String` where there wasn't one before. The one change I had to make to a `String#force_encoding`-using test case is an example of what I want to avoid surprising anyone with.

casperisfine commented Nov 30, 2020

View reviewed changes

casperisfine force-pushed the use-rb_enc_interned_str branch from 483917c to 45afbbe Compare November 30, 2020 11:48

tagomoris approved these changes Jan 12, 2021

View reviewed changes

casperisfine mentioned this pull request Jan 12, 2021

Remove COMPAT_HAVE_ENCODING ifdef #198

Merged

Use rb_enc_interned_str when available

6b9915e

casperisfine force-pushed the use-rb_enc_interned_str branch from 45afbbe to 6b9915e Compare January 12, 2021 14:35

tagomoris approved these changes Jan 13, 2021

View reviewed changes

tagomoris merged commit 3469ab0 into msgpack:master Jan 13, 2021

casperisfine mentioned this pull request Jan 27, 2021

Set string encoding after reading a heap_mapped_string #201

Merged

marcotc mentioned this pull request Jan 28, 2021

Failing to install for Ruby <= 2.1 since 1.4.0 #205

Closed

okeeblow mentioned this pull request Jul 20, 2021

Use "interned" Strings when possible to avoid uneccessary heap allocation. ohler55/ox#276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use rb_enc_interned_str when available #196

Use rb_enc_interned_str when available #196

casperisfine commented Nov 30, 2020

casperisfine Nov 30, 2020

tagomoris Jan 12, 2021

casperisfine Jan 12, 2021

casperisfine Nov 30, 2020

tagomoris Jan 12, 2021

tagomoris Jan 12, 2021

casperisfine Jan 12, 2021

tagomoris Jan 12, 2021

casperisfine commented Jan 7, 2021

tagomoris commented Jan 8, 2021

casperisfine commented Jan 8, 2021

tagomoris left a comment

tagomoris Jan 12, 2021

tagomoris Jan 12, 2021

casperisfine commented Jan 12, 2021

tagomoris commented Jan 12, 2021

tagomoris commented Jan 12, 2021

casperisfine commented Jan 12, 2021

tagomoris left a comment

tagomoris commented Jan 13, 2021

casperisfine commented Jan 13, 2021


		result = rb_str_new(b->read_buffer, length);
		#ifdef COMPAT_HAVE_ENCODING

Use rb_enc_interned_str when available #196

Use rb_enc_interned_str when available #196

Conversation

casperisfine commented Nov 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

casperisfine commented Jan 7, 2021

tagomoris commented Jan 8, 2021

casperisfine commented Jan 8, 2021

tagomoris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

casperisfine commented Jan 12, 2021

tagomoris commented Jan 12, 2021

tagomoris commented Jan 12, 2021

casperisfine commented Jan 12, 2021

tagomoris left a comment

Choose a reason for hiding this comment

tagomoris commented Jan 13, 2021

casperisfine commented Jan 13, 2021