fix: utf8 -> utf16 decoding bug on surrogate pairs #1486

turbio · 2020-09-10T21:33:30Z

This fixes #1473

The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.

The chunking mechanism was added to prevent stack overflows when calling formCharCode with too many args. From some simple benchmarking it doesn't seem like putting utf16 code units in an array and spreading that into fromCharCode is helping performance much anyway. I've removed the chunking code which simplifies it significantly and fixes the particular issue we've encountered.

Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js

This fixes protobufjs#1473 The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk. It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly. Here's a repro of the existing encoding bug in a fuzzing suite https://repl.it/@turbio/oh-no-our-strings#decoder.js

turbio · 2020-09-10T22:26:36Z

lib/utf8/index.js

-            chunk[i++] = t;
-        else if (t > 191 && t < 224)
-            chunk[i++] = (t & 31) << 6 | buffer[start++] & 63;
-        else if (t > 239 && t < 365) {


My understanding of this function is that the passed in buffer is always an array of uint8s, each byte being a utf8 code unit. The condition t < 365 is a bit odd under this assumption, from profiling it actually forces v8 to deopt when t is compared to a non uint8 value. Alternatively I may have grossly misunderstood how this is expected to work.

alexander-fenster · 2020-09-18T22:48:27Z

Hi @turbio,

Would it be possible to have a test that shows the problem included in the PR?

Thank you!

turbio · 2020-09-22T21:15:52Z

@alexander-fenster yup!

added a test which fails on the old utf8_read function.

alexander-fenster

Thanks @turbio!

protobufjs/protobuf.js#1486 released in protobufjs 6.10.2 and updated in @replit/protocol in replit/protocol#9 and released in 0.2.15

turbio and others added 2 commits September 10, 2020 14:29

fix lint

696acac

turbio commented Sep 10, 2020

View reviewed changes

add test case for surrogate pair bug

de742f2

Merge branch 'master' into patch-1

8b58788

alexander-fenster approved these changes Oct 9, 2020

View reviewed changes

alexander-fenster merged commit 75172cd into protobufjs:master Oct 9, 2020

alexander-fenster changed the title ~~fix utf8 -> utf16 decoding bug on surrogate pairs~~ fix: utf8 -> utf16 decoding bug on surrogate pairs Oct 9, 2020

masad-frost added a commit to replit/crosis that referenced this pull request Dec 29, 2020

No need to monkey patch protobufjs anymore

cf1eb5d

protobufjs/protobuf.js#1486 released in protobufjs 6.10.2 and updated in @replit/protocol in replit/protocol#9 and released in 0.2.15

daviderenger mentioned this pull request Mar 1, 2021

utf8.read function producing wrong strings #1473

Closed

jcready mentioned this pull request Oct 30, 2021

Use TextDecoder API for decoding UTF-8 from binary data timostamm/protobuf-ts#184

Closed

This was referenced May 20, 2022

chore(6.x): release 6.11.0 #1736

Closed

chore(6.x): release 6.12.0 #1738

Closed

sync-by-unito bot mentioned this pull request Jun 2, 2022

Bump protobufjs from 6.10.1 to 6.11.3 privacyresearchgroup/libsignal-protocol-typescript#77

Open

github-actions bot mentioned this pull request Jul 8, 2022

chore: release master #1771

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: utf8 -> utf16 decoding bug on surrogate pairs #1486

fix: utf8 -> utf16 decoding bug on surrogate pairs #1486

turbio commented Sep 10, 2020

turbio Sep 10, 2020

alexander-fenster commented Sep 18, 2020

turbio commented Sep 22, 2020

alexander-fenster left a comment

fix: utf8 -> utf16 decoding bug on surrogate pairs #1486

fix: utf8 -> utf16 decoding bug on surrogate pairs #1486

Conversation

turbio commented Sep 10, 2020

turbio Sep 10, 2020

Choose a reason for hiding this comment

alexander-fenster commented Sep 18, 2020

turbio commented Sep 22, 2020

alexander-fenster left a comment

Choose a reason for hiding this comment