Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make invalid Unicode data raise when encoding through Oj::Rails::Encoder #912

Conversation

KJTsanaktsidis
Copy link
Contributor

This is a potential fix for #911. Currently, whether or not Oj::Rails::Encoder raises on invalid unicode data depends on the value of ActiveSupport.escape_html_entities_in_json. In order to accurately mimic the behaviour of stock Rails with the stock json gem, it should in fact raise an exception regardless.

I've so far deliberately copied rather than shared functionality that's shared between RailsEsc and RailsXEsc mode, because I wasn't quite sure how to factor the similarities out. We can leave it like this, or I'm happy to take pointers on a way to factor this down better.

I added a testcase for invalid Unicode to the Rails 6 & 7 encoding tests, and also parameterised the existing unicode-related tests to make sure they work correctly with both settings of ActiveSupport.escape_html_entities_in_json.

@ohler55
Copy link
Owner

ohler55 commented Feb 1, 2024

I'll have to spend a little time looking at the changes. It make me uncomfortable that some of the existing tests were removed. Did you run any benchmarks to see what impact on performance the change had?

@KJTsanaktsidis
Copy link
Contributor Author

I definitely shuffled some tests around but I didn’t mean to remove any. What did I delete? I’ll fix that for sure :)

good call on benchmarking - I’ll put something together today.

@ohler55
Copy link
Owner

ohler55 commented Feb 1, 2024

Maybe it was the moving around part that made it appear as if some tests were removed. I will look more carefully.

@ohler55
Copy link
Owner

ohler55 commented Feb 11, 2024

Were you able to put together a benchmark and fix the clang formatting issues?

@KJTsanaktsidis
Copy link
Contributor Author

Sorry, I haven’t gotten to that - I do plan to in the next couple of days!

These tests were not even loading Oj::Rails; they were definitely not
actually testing the Oj rails shim.
Activesupport & JSON gem will raise an exception when trying to an
encode an object containing a string with invalid byte sequences for the
string's encoding. Oj correctly raises if escaspe_html_entites_in_json
is enabled, but if that's disabled, the invalid byte sequence is copied
directly to the output.

Use the same logic to validate unicode in that case as well.
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/make_invalid_unicode_raise branch from ad29f92 to eb3febb Compare February 13, 2024 00:38
@KJTsanaktsidis
Copy link
Contributor Author

OK, I've fixed the clang-format problems, and I put together a benchmark: https://gist.github.com/KJTsanaktsidis/f85be084d61aca54f8493ab63fe0707f

Without this patch:

Calculating -------------------------------------
long_7bit_printable_string with RailsXEsc mode
                          2.133k (± 4.0%) i/s -     10.835k in   5.088119s
long_7bit_printable_string with RailsEsc mode
                          1.914k (± 3.3%) i/s -      9.750k in   5.099691s
long_ascii_string with RailsXEsc mode
                        163.294 (± 1.8%) i/s -    832.000 in   5.096756s
long_ascii_string with RailsEsc mode
                        175.841 (± 2.3%) i/s -    884.000 in   5.029591s
long_angle_bracket_string with RailsXEsc mode
                        275.981 (± 2.9%) i/s -      1.400k in   5.077551s
long_angle_bracket_string with RailsEsc mode
                          1.951k (± 3.3%) i/s -      9.945k in   5.103185s
long_utf8_multibyte_string with RailsXEsc mode
                        359.297 (± 2.2%) i/s -      1.800k in   5.012476s
long_utf8_multibyte_string with RailsEsc mode
                        654.515 (± 2.3%) i/s -      3.315k in   5.067395s

With this patch:

Calculating -------------------------------------
long_7bit_printable_string with RailsXEsc mode
                          2.120k (± 3.0%) i/s -     10.761k in   5.080555s
long_7bit_printable_string with RailsEsc mode
                          2.130k (± 3.0%) i/s -     10.812k in   5.081807s
long_ascii_string with RailsXEsc mode
                        169.397 (± 2.4%) i/s -    848.000 in   5.009189s
long_ascii_string with RailsEsc mode
                        179.139 (± 2.2%) i/s -    901.000 in   5.032772s
long_angle_bracket_string with RailsXEsc mode
                        284.548 (± 2.5%) i/s -      1.430k in   5.028596s
long_angle_bracket_string with RailsEsc mode
                          2.119k (± 3.1%) i/s -     10.750k in   5.079096s
long_utf8_multibyte_string with RailsXEsc mode
                        425.998 (± 2.3%) i/s -      2.142k in   5.031145s
long_utf8_multibyte_string with RailsEsc mode
                        426.126 (± 2.3%) i/s -      2.142k in   5.029241s

The only substantial difference is that the "lots of multibyte characters" case with this patch now takes the same amount of time regardless of using RailsEsc mode or RailsXEsc mode, whereas before it was faster in RailsEsc mode (because it wasn't validating any of the characters). But in non-multibyte-heavy workloads it seems the same.

@ohler55
Copy link
Owner

ohler55 commented Feb 13, 2024

Benchmarks look good. I'll start a more detailed review to get this merged.

ext/oj/dump.c Show resolved Hide resolved
@ohler55
Copy link
Owner

ohler55 commented Feb 14, 2024

LGTM other than one open question.

@ohler55 ohler55 merged commit 46b3d4d into ohler55:develop Feb 14, 2024
41 of 43 checks passed
@ohler55
Copy link
Owner

ohler55 commented Feb 14, 2024

Thanks for the work. I know I was a little picky. Maybe too much so, sorry.

@KJTsanaktsidis
Copy link
Contributor Author

Thanks for the work. I know I was a little picky. Maybe too much so, sorry.

Not at all! Thanks for your attention on this.

My best idea is...
Rename rails_xss_friendly_size to size_t required_buffer_size_for_escaped_string(const uint8_t *str, size_t len, const char *cmap, bool *has_hi_out). Make it accept the cmap to use as a parameter, essentially, and pass hi as an explicit out-param rather than smuggling it out via the sign bit
Call required_buffer_size_for_escaped_string in case RailsXEsc and case RailsEsc with different cmaps (rails_xss_friendly_chars vs rails_friendly_chars)

Do you want me to open another PR with those changes?

@KJTsanaktsidis KJTsanaktsidis deleted the ktsanaktsidis/make_invalid_unicode_raise branch February 15, 2024 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants