Make invalid Unicode data raise when encoding through Oj::Rails::Encoder #912

KJTsanaktsidis · 2024-02-01T11:28:00Z

This is a potential fix for #911. Currently, whether or not Oj::Rails::Encoder raises on invalid unicode data depends on the value of ActiveSupport.escape_html_entities_in_json. In order to accurately mimic the behaviour of stock Rails with the stock json gem, it should in fact raise an exception regardless.

I've so far deliberately copied rather than shared functionality that's shared between RailsEsc and RailsXEsc mode, because I wasn't quite sure how to factor the similarities out. We can leave it like this, or I'm happy to take pointers on a way to factor this down better.

I added a testcase for invalid Unicode to the Rails 6 & 7 encoding tests, and also parameterised the existing unicode-related tests to make sure they work correctly with both settings of ActiveSupport.escape_html_entities_in_json.

ohler55 · 2024-02-01T15:13:48Z

I'll have to spend a little time looking at the changes. It make me uncomfortable that some of the existing tests were removed. Did you run any benchmarks to see what impact on performance the change had?

KJTsanaktsidis · 2024-02-01T17:25:35Z

I definitely shuffled some tests around but I didn’t mean to remove any. What did I delete? I’ll fix that for sure :)

good call on benchmarking - I’ll put something together today.

ohler55 · 2024-02-01T17:31:37Z

Maybe it was the moving around part that made it appear as if some tests were removed. I will look more carefully.

ohler55 · 2024-02-11T16:47:00Z

Were you able to put together a benchmark and fix the clang formatting issues?

KJTsanaktsidis · 2024-02-12T10:55:36Z

Sorry, I haven’t gotten to that - I do plan to in the next couple of days!

These tests were not even loading Oj::Rails; they were definitely not actually testing the Oj rails shim.

Activesupport & JSON gem will raise an exception when trying to an encode an object containing a string with invalid byte sequences for the string's encoding. Oj correctly raises if escaspe_html_entites_in_json is enabled, but if that's disabled, the invalid byte sequence is copied directly to the output. Use the same logic to validate unicode in that case as well.

KJTsanaktsidis · 2024-02-13T00:41:30Z

OK, I've fixed the clang-format problems, and I put together a benchmark: https://gist.github.com/KJTsanaktsidis/f85be084d61aca54f8493ab63fe0707f

Without this patch:

Calculating -------------------------------------
long_7bit_printable_string with RailsXEsc mode
                          2.133k (± 4.0%) i/s -     10.835k in   5.088119s
long_7bit_printable_string with RailsEsc mode
                          1.914k (± 3.3%) i/s -      9.750k in   5.099691s
long_ascii_string with RailsXEsc mode
                        163.294 (± 1.8%) i/s -    832.000 in   5.096756s
long_ascii_string with RailsEsc mode
                        175.841 (± 2.3%) i/s -    884.000 in   5.029591s
long_angle_bracket_string with RailsXEsc mode
                        275.981 (± 2.9%) i/s -      1.400k in   5.077551s
long_angle_bracket_string with RailsEsc mode
                          1.951k (± 3.3%) i/s -      9.945k in   5.103185s
long_utf8_multibyte_string with RailsXEsc mode
                        359.297 (± 2.2%) i/s -      1.800k in   5.012476s
long_utf8_multibyte_string with RailsEsc mode
                        654.515 (± 2.3%) i/s -      3.315k in   5.067395s

With this patch:

Calculating -------------------------------------
long_7bit_printable_string with RailsXEsc mode
                          2.120k (± 3.0%) i/s -     10.761k in   5.080555s
long_7bit_printable_string with RailsEsc mode
                          2.130k (± 3.0%) i/s -     10.812k in   5.081807s
long_ascii_string with RailsXEsc mode
                        169.397 (± 2.4%) i/s -    848.000 in   5.009189s
long_ascii_string with RailsEsc mode
                        179.139 (± 2.2%) i/s -    901.000 in   5.032772s
long_angle_bracket_string with RailsXEsc mode
                        284.548 (± 2.5%) i/s -      1.430k in   5.028596s
long_angle_bracket_string with RailsEsc mode
                          2.119k (± 3.1%) i/s -     10.750k in   5.079096s
long_utf8_multibyte_string with RailsXEsc mode
                        425.998 (± 2.3%) i/s -      2.142k in   5.031145s
long_utf8_multibyte_string with RailsEsc mode
                        426.126 (± 2.3%) i/s -      2.142k in   5.029241s

The only substantial difference is that the "lots of multibyte characters" case with this patch now takes the same amount of time regardless of using RailsEsc mode or RailsXEsc mode, whereas before it was faster in RailsEsc mode (because it wasn't validating any of the characters). But in non-multibyte-heavy workloads it seems the same.

ohler55 · 2024-02-13T23:40:53Z

Benchmarks look good. I'll start a more detailed review to get this merged.

test/activesupport6/encoding_test.rb

ext/oj/dump.c

ohler55 · 2024-02-14T00:20:08Z

LGTM other than one open question.

ohler55 · 2024-02-14T22:22:59Z

Thanks for the work. I know I was a little picky. Maybe too much so, sorry.

KJTsanaktsidis · 2024-02-15T04:11:03Z

Thanks for the work. I know I was a little picky. Maybe too much so, sorry.

Not at all! Thanks for your attention on this.

My best idea is...
Rename rails_xss_friendly_size to size_t required_buffer_size_for_escaped_string(const uint8_t *str, size_t len, const char *cmap, bool *has_hi_out). Make it accept the cmap to use as a parameter, essentially, and pass hi as an explicit out-param rather than smuggling it out via the sign bit
Call required_buffer_size_for_escaped_string in case RailsXEsc and case RailsEsc with different cmaps (rails_xss_friendly_chars vs rails_friendly_chars)

Do you want me to open another PR with those changes?

KJTsanaktsidis mentioned this pull request Feb 1, 2024

Rejecting invalid UTF-8 without opting in to XSS character escaping #911

Open

KJTsanaktsidis added 2 commits February 13, 2024 09:54

Actually run activesupport7 tests with Oj

742daa7

These tests were not even loading Oj::Rails; they were definitely not actually testing the Oj rails shim.

KJTsanaktsidis force-pushed the ktsanaktsidis/make_invalid_unicode_raise branch from ad29f92 to eb3febb Compare February 13, 2024 00:38

ohler55 reviewed Feb 13, 2024

View reviewed changes

test/activesupport6/encoding_test.rb Show resolved Hide resolved

ohler55 reviewed Feb 14, 2024

View reviewed changes

ext/oj/dump.c Show resolved Hide resolved

ohler55 approved these changes Feb 14, 2024

View reviewed changes

ohler55 merged commit 46b3d4d into ohler55:develop Feb 14, 2024
41 of 43 checks passed

KJTsanaktsidis deleted the ktsanaktsidis/make_invalid_unicode_raise branch February 15, 2024 04:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make invalid Unicode data raise when encoding through Oj::Rails::Encoder #912

Make invalid Unicode data raise when encoding through Oj::Rails::Encoder #912

KJTsanaktsidis commented Feb 1, 2024

ohler55 commented Feb 1, 2024

KJTsanaktsidis commented Feb 1, 2024

ohler55 commented Feb 1, 2024

ohler55 commented Feb 11, 2024

KJTsanaktsidis commented Feb 12, 2024

KJTsanaktsidis commented Feb 13, 2024

ohler55 commented Feb 13, 2024

ohler55 commented Feb 14, 2024

ohler55 commented Feb 14, 2024

KJTsanaktsidis commented Feb 15, 2024

Make invalid Unicode data raise when encoding through Oj::Rails::Encoder #912

Make invalid Unicode data raise when encoding through Oj::Rails::Encoder #912

Conversation

KJTsanaktsidis commented Feb 1, 2024

ohler55 commented Feb 1, 2024

KJTsanaktsidis commented Feb 1, 2024

ohler55 commented Feb 1, 2024

ohler55 commented Feb 11, 2024

KJTsanaktsidis commented Feb 12, 2024

KJTsanaktsidis commented Feb 13, 2024

ohler55 commented Feb 13, 2024

ohler55 commented Feb 14, 2024

ohler55 commented Feb 14, 2024

KJTsanaktsidis commented Feb 15, 2024