Provide an option to replace lone surrogates in strings with replacement characters #827

lucacasonato · 2021-11-23T11:46:35Z

This is a continuation of #495.

JavaScript engines unilaterally agree that lone surrogates are valid inside of JSON strings (V8: https://bugs.chromium.org/p/v8/issues/detail?id=11193, SpiderMonkey: https://bugzilla.mozilla.org/show_bug.cgi?id=1496747, JSC: try JSON.stringify('\u{1f3b5}'[0]) in the Safari console). All of their JSON parsers accept lone surrogates (try JSON.stringify(JSON.parse(JSON.stringify('\u{1f3b5}'[0]))) in their relevant DevTools consoles).

serde_json always errors on lone surrogates right now. This is understandable in the context of JS representing it's strings as WTF-16, while in Rust they are always valid UTF-8. The unfortunate reality is that JS engines produce JSON text with these lone surrogates though, and these JSON messages need to be parsed sometimes. As serde_json aims to bridge the gap between JSON and Rust, it would be great if serde_json provided a way to decode strings with lone surrogates, where the surrogates are replaced by the replacement char instead of erroring.

What about something along the lines of serde_json::from_str_lossy?

The text was updated successfully, but these errors were encountered:

dtolnay · 2021-11-23T17:11:06Z

Those inputs are already supported by deserializing to byte strings. You can't directly put the deserialization result into a String because of the utf-8 invariant, but you can deserialize to byte strings and perform the lossy conversion yourself if you need it.

fn main() {
    let bytes = [b'"', "\u{1f3b5}".as_bytes()[0], b'"'];
    assert!(serde_json::from_slice::<String>(&bytes).is_err());
    let v = serde_json::from_slice::<serde_bytes::ByteBuf>(&bytes).unwrap();
    println!("{:?}", v);
}

lucacasonato · 2021-11-23T17:39:23Z

Oh that's very interesting - I was not aware strings could be deserialized into byte buffers. Unfortunately I don't think this solves the problem though, as the embedder now needs to handle dealing with escape sequences manually (essentially maintaining a copy of the code from serde_json) that does this.

Your example is also not entirely accurate to what a JS engine will emit: it emits lone surrogates and other malformed code points as escape sequences in the JSON, rather than as the code points themselves (https://github.com/tc39/proposal-well-formed-stringify). For example: JSON.stringify('\u{1f3b5}'[0]) will return the string "\ud83c". A more accurate Rust example would be. This example showcases the issue well:

fn main() {
    let str = "\"\\ud83c\"";
    assert!(serde_json::from_str::<String>(str).is_err());
    let v = serde_json::from_str::<serde_bytes::ByteBuf>(&str).unwrap();
    println!("{:?}", v);
}

An implementation inside of serde_json would be great, as it has all the infrastructure and code necessary to reliably and accurately parse out the string escape sequences. Alternatively it would be great if this code was exposed, with a flag to switch the behaviour that occurs when lone surrogates are encountered (error or replace).

dtolnay · 2021-11-23T18:30:57Z

the embedder now needs to handle dealing with escape sequences manually (essentially maintaining a copy of the code from serde_json) that does this

I don't think this is accurate, or maybe I don't understand what you mean. In println!("{:?}", serde_json::from_str::<serde_bytes::ByteBuf>("\"\\u0001\\n\"")) you can see those escape sequences are still expanded by serde_json during deserialization.

For "\"\\ud83c\"", I would accept a PR to make this decode to [237, 160, 188] as a byte string.

lucacasonato · 2021-11-24T12:52:28Z

I don't think this is accurate, or maybe I don't understand what you mean. In println!("{:?}", serde_json::from_str::<serde_bytes::ByteBuf>(""\u0001\n"")) you can see those escape sequences are still expanded by serde_json during deserialization.

Ah, my bad. I had misread the source code.

For "\"\\ud83c\"", I would accept a PR to make this decode to [237, 160, 188] as a byte string.

I'll try throw up a PR later today.

lucacasonato · 2021-11-24T13:07:38Z

@dtolnay I am not super familiar with all of this. Could you elaborate how one would get from an escape sequence \ud83c to the three bytes [237, 160, 188]? The escape sequence encodes the bytes [216, 60]. How do I get from those two bytes, to the three bytes you mentioned?

Sorry if this is a really dumb question!

lucacasonato · 2021-11-24T13:15:37Z

Oh nvm, figured it out! The 237 160 188 is the UTF-8 representation of the U+D83C unicode codepoint.

lucacasonato · 2021-11-25T13:19:41Z

Thanks for landing @dtolnay. This only gets me about half way there though, as contrary to what I had thought, RawValue cannot contain values with escaped lone surrogates.

Test case:

#[cfg(feature = "raw_value")]
#[test]
fn test_raw_de_lone_surrogate() {
    use serde_json::value::RawValue;

    let res = from_str::<Box<RawValue>>(r#""\ud83c""#);

    assert!(res.is_ok());
}

I was under the impression that RawValue would preserve the input text, so would not try to decode escape sequences.

Is this something you think could be changed? If so, what would be the best way to go about it? I'd be happy to create another PR.

lucacasonato · 2021-11-25T13:27:06Z

Ah - ignore_escape is the culprit. I guess this should be changed to also accept lone surrogates.

lucacasonato · 2021-11-25T13:35:35Z

I wonder if it should just be changed to not validate the codes in \u at all. This will be done by the "real" parse later. During ignore_escape we are just trying to consume the tokens so we can find the end of the string. If the values here are valid or not is not really relevant at this stage I think.

lucacasonato · 2021-11-25T15:44:48Z

After initial integration, back with some feedback from the real world: I think deserialization of \ud83c to [237, 160, 188] is conceptually an interesting idea, but is fundamentally flawed. A UTF-8 parser in "lossy" (replacement) mode will deserialize [237, 160, 188] into three replacement chars, not one.

No self respecting UTF-8 encoder will encode U+D83C into [237, 160, 188]. They will encode it into [239, 191, 189] (the encoded form of the replacement char), or they will error.

I think my earlier patch that deserializes \ud83c into a byte buffer of [237, 160, 188] should be reverted, as I think it is more wrong than correct.

Instead the proper solution would be to introduce a LossyString type that replaces lone surrogates with the replacement char. With #830 this could be implemented outside of serde_json with a little bit of copied code (probably fine). Ideally that new type would live inside serde_json though.

dtolnay closed this as completed Nov 23, 2021

lucacasonato mentioned this issue Nov 24, 2021

Deserialize lone surrogates into byte bufs #828

Merged

lucacasonato mentioned this issue Nov 25, 2021

Allow lone surrogates in raw values #830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an option to replace lone surrogates in strings with replacement characters #827

Provide an option to replace lone surrogates in strings with replacement characters #827

lucacasonato commented Nov 23, 2021

dtolnay commented Nov 23, 2021

lucacasonato commented Nov 23, 2021

dtolnay commented Nov 23, 2021

lucacasonato commented Nov 24, 2021

lucacasonato commented Nov 24, 2021

lucacasonato commented Nov 24, 2021

lucacasonato commented Nov 25, 2021 •

edited

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

Provide an option to replace lone surrogates in strings with replacement characters #827

Provide an option to replace lone surrogates in strings with replacement characters #827

Comments

lucacasonato commented Nov 23, 2021

dtolnay commented Nov 23, 2021

lucacasonato commented Nov 23, 2021

dtolnay commented Nov 23, 2021

lucacasonato commented Nov 24, 2021

lucacasonato commented Nov 24, 2021

lucacasonato commented Nov 24, 2021

lucacasonato commented Nov 25, 2021 • edited

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021

lucacasonato commented Nov 25, 2021 •

edited