Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an option to replace lone surrogates in strings with replacement characters #827

Closed
lucacasonato opened this issue Nov 23, 2021 · 10 comments

Comments

@lucacasonato
Copy link
Contributor

This is a continuation of #495.

JavaScript engines unilaterally agree that lone surrogates are valid inside of JSON strings (V8: https://bugs.chromium.org/p/v8/issues/detail?id=11193, SpiderMonkey: https://bugzilla.mozilla.org/show_bug.cgi?id=1496747, JSC: try JSON.stringify('\u{1f3b5}'[0]) in the Safari console). All of their JSON parsers accept lone surrogates (try JSON.stringify(JSON.parse(JSON.stringify('\u{1f3b5}'[0]))) in their relevant DevTools consoles).

serde_json always errors on lone surrogates right now. This is understandable in the context of JS representing it's strings as WTF-16, while in Rust they are always valid UTF-8. The unfortunate reality is that JS engines produce JSON text with these lone surrogates though, and these JSON messages need to be parsed sometimes. As serde_json aims to bridge the gap between JSON and Rust, it would be great if serde_json provided a way to decode strings with lone surrogates, where the surrogates are replaced by the replacement char instead of erroring.

What about something along the lines of serde_json::from_str_lossy?

@dtolnay
Copy link
Member

dtolnay commented Nov 23, 2021

Those inputs are already supported by deserializing to byte strings. You can't directly put the deserialization result into a String because of the utf-8 invariant, but you can deserialize to byte strings and perform the lossy conversion yourself if you need it.

fn main() {
    let bytes = [b'"', "\u{1f3b5}".as_bytes()[0], b'"'];
    assert!(serde_json::from_slice::<String>(&bytes).is_err());
    let v = serde_json::from_slice::<serde_bytes::ByteBuf>(&bytes).unwrap();
    println!("{:?}", v);
}

@dtolnay dtolnay closed this as completed Nov 23, 2021
@lucacasonato
Copy link
Contributor Author

Oh that's very interesting - I was not aware strings could be deserialized into byte buffers. Unfortunately I don't think this solves the problem though, as the embedder now needs to handle dealing with escape sequences manually (essentially maintaining a copy of the code from serde_json) that does this.

Your example is also not entirely accurate to what a JS engine will emit: it emits lone surrogates and other malformed code points as escape sequences in the JSON, rather than as the code points themselves (https://github.com/tc39/proposal-well-formed-stringify). For example: JSON.stringify('\u{1f3b5}'[0]) will return the string "\ud83c". A more accurate Rust example would be. This example showcases the issue well:

fn main() {
    let str = "\"\\ud83c\"";
    assert!(serde_json::from_str::<String>(str).is_err());
    let v = serde_json::from_str::<serde_bytes::ByteBuf>(&str).unwrap();
    println!("{:?}", v);
}

An implementation inside of serde_json would be great, as it has all the infrastructure and code necessary to reliably and accurately parse out the string escape sequences. Alternatively it would be great if this code was exposed, with a flag to switch the behaviour that occurs when lone surrogates are encountered (error or replace).

@dtolnay
Copy link
Member

dtolnay commented Nov 23, 2021

the embedder now needs to handle dealing with escape sequences manually (essentially maintaining a copy of the code from serde_json) that does this

I don't think this is accurate, or maybe I don't understand what you mean. In println!("{:?}", serde_json::from_str::<serde_bytes::ByteBuf>("\"\\u0001\\n\"")) you can see those escape sequences are still expanded by serde_json during deserialization.

For "\"\\ud83c\"", I would accept a PR to make this decode to [237, 160, 188] as a byte string.

@lucacasonato
Copy link
Contributor Author

I don't think this is accurate, or maybe I don't understand what you mean. In println!("{:?}", serde_json::from_str::<serde_bytes::ByteBuf>(""\u0001\n"")) you can see those escape sequences are still expanded by serde_json during deserialization.

Ah, my bad. I had misread the source code.

For "\"\\ud83c\"", I would accept a PR to make this decode to [237, 160, 188] as a byte string.

I'll try throw up a PR later today.

@lucacasonato
Copy link
Contributor Author

@dtolnay I am not super familiar with all of this. Could you elaborate how one would get from an escape sequence \ud83c to the three bytes [237, 160, 188]? The escape sequence encodes the bytes [216, 60]. How do I get from those two bytes, to the three bytes you mentioned?

Sorry if this is a really dumb question!

@lucacasonato
Copy link
Contributor Author

Oh nvm, figured it out! The 237 160 188 is the UTF-8 representation of the U+D83C unicode codepoint.

@lucacasonato
Copy link
Contributor Author

lucacasonato commented Nov 25, 2021

Thanks for landing @dtolnay. This only gets me about half way there though, as contrary to what I had thought, RawValue cannot contain values with escaped lone surrogates.

Test case:

#[cfg(feature = "raw_value")]
#[test]
fn test_raw_de_lone_surrogate() {
    use serde_json::value::RawValue;

    let res = from_str::<Box<RawValue>>(r#""\ud83c""#);

    assert!(res.is_ok());
}

I was under the impression that RawValue would preserve the input text, so would not try to decode escape sequences.

Is this something you think could be changed? If so, what would be the best way to go about it? I'd be happy to create another PR.

@lucacasonato
Copy link
Contributor Author

Ah - ignore_escape is the culprit. I guess this should be changed to also accept lone surrogates.

@lucacasonato
Copy link
Contributor Author

I wonder if it should just be changed to not validate the codes in \u at all. This will be done by the "real" parse later. During ignore_escape we are just trying to consume the tokens so we can find the end of the string. If the values here are valid or not is not really relevant at this stage I think.

@lucacasonato
Copy link
Contributor Author

After initial integration, back with some feedback from the real world: I think deserialization of \ud83c to [237, 160, 188] is conceptually an interesting idea, but is fundamentally flawed. A UTF-8 parser in "lossy" (replacement) mode will deserialize [237, 160, 188] into three replacement chars, not one.

No self respecting UTF-8 encoder will encode U+D83C into [237, 160, 188]. They will encode it into [239, 191, 189] (the encoded form of the replacement char), or they will error.

I think my earlier patch that deserializes \ud83c into a byte buffer of [237, 160, 188] should be reverted, as I think it is more wrong than correct.

Instead the proper solution would be to introduce a LossyString type that replaces lone surrogates with the replacement char. With #830 this could be implemented outside of serde_json with a little bit of copied code (probably fine). Ideally that new type would live inside serde_json though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants