Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: utf8 -> utf16 decoding bug on surrogate pairs #1486

Merged
merged 4 commits into from Oct 9, 2020

Conversation

turbio
Copy link
Contributor

@turbio turbio commented Sep 10, 2020

This fixes #1473

The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.

The chunking mechanism was added to prevent stack overflows when calling formCharCode with too many args. From some simple benchmarking it doesn't seem like putting utf16 code units in an array and spreading that into fromCharCode is helping performance much anyway. I've removed the chunking code which simplifies it significantly and fixes the particular issue we've encountered.

Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js

turbio and others added 2 commits September 10, 2020 14:29
This fixes protobufjs#1473

The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.

It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly.

Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js
chunk[i++] = t;
else if (t > 191 && t < 224)
chunk[i++] = (t & 31) << 6 | buffer[start++] & 63;
else if (t > 239 && t < 365) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of this function is that the passed in buffer is always an array of uint8s, each byte being a utf8 code unit. The condition t < 365 is a bit odd under this assumption, from profiling it actually forces v8 to deopt when t is compared to a non uint8 value. Alternatively I may have grossly misunderstood how this is expected to work.

@alexander-fenster
Copy link
Contributor

Hi @turbio,

Would it be possible to have a test that shows the problem included in the PR?

Thank you!

@turbio
Copy link
Contributor Author

turbio commented Sep 22, 2020

@alexander-fenster yup!

added a test which fails on the old utf8_read function.

Copy link
Contributor

@alexander-fenster alexander-fenster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @turbio!

@alexander-fenster alexander-fenster merged commit 75172cd into protobufjs:master Oct 9, 2020
@alexander-fenster alexander-fenster changed the title fix utf8 -> utf16 decoding bug on surrogate pairs fix: utf8 -> utf16 decoding bug on surrogate pairs Oct 9, 2020
masad-frost added a commit to replit/crosis that referenced this pull request Dec 29, 2020
protobufjs/protobuf.js#1486 released in protobufjs 6.10.2
and updated in @replit/protocol in replit/protocol#9 and released in 0.2.15
This was referenced May 20, 2022
@github-actions github-actions bot mentioned this pull request Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

utf8.read function producing wrong strings
2 participants