New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an option to replace lone surrogates in strings with replacement characters #827
Comments
Those inputs are already supported by deserializing to byte strings. You can't directly put the deserialization result into a String because of the utf-8 invariant, but you can deserialize to byte strings and perform the lossy conversion yourself if you need it. fn main() {
let bytes = [b'"', "\u{1f3b5}".as_bytes()[0], b'"'];
assert!(serde_json::from_slice::<String>(&bytes).is_err());
let v = serde_json::from_slice::<serde_bytes::ByteBuf>(&bytes).unwrap();
println!("{:?}", v);
} |
Oh that's very interesting - I was not aware strings could be deserialized into byte buffers. Unfortunately I don't think this solves the problem though, as the embedder now needs to handle dealing with escape sequences manually (essentially maintaining a copy of the code from Your example is also not entirely accurate to what a JS engine will emit: it emits lone surrogates and other malformed code points as escape sequences in the JSON, rather than as the code points themselves (https://github.com/tc39/proposal-well-formed-stringify). For example: fn main() {
let str = "\"\\ud83c\"";
assert!(serde_json::from_str::<String>(str).is_err());
let v = serde_json::from_str::<serde_bytes::ByteBuf>(&str).unwrap();
println!("{:?}", v);
} An implementation inside of |
I don't think this is accurate, or maybe I don't understand what you mean. In For |
Ah, my bad. I had misread the source code.
I'll try throw up a PR later today. |
@dtolnay I am not super familiar with all of this. Could you elaborate how one would get from an escape sequence Sorry if this is a really dumb question! |
Oh nvm, figured it out! The |
Thanks for landing @dtolnay. This only gets me about half way there though, as contrary to what I had thought, Test case: #[cfg(feature = "raw_value")]
#[test]
fn test_raw_de_lone_surrogate() {
use serde_json::value::RawValue;
let res = from_str::<Box<RawValue>>(r#""\ud83c""#);
assert!(res.is_ok());
} I was under the impression that Is this something you think could be changed? If so, what would be the best way to go about it? I'd be happy to create another PR. |
Ah - |
I wonder if it should just be changed to not validate the codes in |
After initial integration, back with some feedback from the real world: I think deserialization of No self respecting UTF-8 encoder will encode U+D83C into I think my earlier patch that deserializes Instead the proper solution would be to introduce a |
This is a continuation of #495.
JavaScript engines unilaterally agree that lone surrogates are valid inside of JSON strings (V8: https://bugs.chromium.org/p/v8/issues/detail?id=11193, SpiderMonkey: https://bugzilla.mozilla.org/show_bug.cgi?id=1496747, JSC: try
JSON.stringify('\u{1f3b5}'[0])
in the Safari console). All of their JSON parsers accept lone surrogates (tryJSON.stringify(JSON.parse(JSON.stringify('\u{1f3b5}'[0])))
in their relevant DevTools consoles).serde_json
always errors on lone surrogates right now. This is understandable in the context of JS representing it's strings as WTF-16, while in Rust they are always valid UTF-8. The unfortunate reality is that JS engines produce JSON text with these lone surrogates though, and these JSON messages need to be parsed sometimes. Asserde_json
aims to bridge the gap between JSON and Rust, it would be great ifserde_json
provided a way to decode strings with lone surrogates, where the surrogates are replaced by the replacement char instead of erroring.What about something along the lines of
serde_json::from_str_lossy
?The text was updated successfully, but these errors were encountered: