Fix UTF-8 unsoundness in string::merge #194

mzabaluev · 2019-06-09T20:25:27Z

Fixes #193

In case of an encoding error in string::merge, the appended string value is left with broken UTF-8 in case of an error. The same can happen if any of the Buf methods panics. This results in UB if the string value is used after the error return or in unwind, respectively. Change the implementation to truncate the string back to valid UTF-8 content in any code path that does not go through validation of the newly appended bytes.

Test that adding an invalid UTF-8 sequence results in an error, and that the string is reverted to its state prior to the merge call.

mzabaluev · 2019-06-09T20:28:10Z

Note that the merge implementation does not clear the existing value, as the protobuf spec suggests it should. If it did, the guard code would be simpler and marginally more efficient.

nrc

This looks good, thanks! I would be tempted to add a safe wrapper function which takes a closure to operate on the Vec<u8>, then commits so that the unsafe code is encapsulated.

nrc · 2019-06-10T08:58:05Z

src/encoding.rs

+
+        fn commit(mut self) -> Result<(), DecodeError> {
+            let new_bytes = &self.string.as_bytes()[self.committed_len..];
+            str::from_utf8(new_bytes)


I think you need to check the whole string, not just the new bytes here, since when the user has mutable access to the internal string, they can modify as well as append.

In a general purpose StringGuard I would do, but here we can use the knowledge that bytes::merge would not change the already filled part of the vector, and nothing else can modify it while it is borrowed.

But again, clearing the string value before the new wire data are merged, and also on StringGuard drop without commit, would be simpler and not rely on the other function's behavior to be safe. I just don't understand why this implementation concatenates non-repeating fields of types string and bytes.

clearing the string value before the new wire data are merged... would be simpler

Also in the BytesString (or StrChunk) rework in #190, I would just suck the data into Bytes and convert that into the new string value with a safe UTF-8 content check rather than fiddle with unsafe mutable representations.

I just don't understand why this implementation concatenates non-repeating fields of types string and bytes.

It's entirely possible that this is done by accident, and should be fixed. Do you have a reference for where the spec / PB docs mention that merging will overwrite strings?

In the developer guide on encoding:

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees.

danburkert · 2019-06-10T16:45:31Z

src/encoding.rs

+    // If an unwind happens before the check is performed and the result
+    // is committed, the string is truncated to the previously valid length.
+    struct StringGuard<'a> {
+        string: &'a mut String,


From my (admittedly weak) understanding of the unsafe rules, it's instant UB to create a str/String containing invalid UTF-8. Given that, I think this pattern is still unsafe, since it's appending bytes to a String, then checking validity. Given my understanding, I think the only safe way to do what we are trying to do here is accumulate into a Vec<u8>, check if it's valid utf-8, and then and only then convert to String.

As far as I understand, it's not UB until the invalid data are accessed as str/String; but, theoretically, .as_bytes() or even .as_mut_vec() might be in violation of this.

danburkert · 2019-06-10T16:55:08Z

src/encoding.rs

-            super::bytes::merge(wire_type, value.as_mut_vec(), buf)?;
-            str::from_utf8(value.as_bytes())
-                .map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;
+            super::bytes::merge(wire_type, guard.as_mut_vec(), buf)?;


I think using some clever buffer swaps we can simplify this and remove the need for unsafe entirely, here's what I'm thinking:

pub fn merge<B>(wire_type: WireType, value: &mut String, buf: &mut B) -> Result<(), DecodeError> where B: Buf, { let mut value_bytes = mem::replace(value, String::new()).into_bytes(); bytes::merge(wire_type, &mut value_bytes, buf)?; *value = String::from_utf8(value_bytes) .map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?; Ok(()) }

That can even be simplified a bit more if we decide we need to overwrite on merge instead of append. WDYT?

@danburkert I think your implementation is good, unless somebody cares about the existing string data being lost on merge failure.

As suggested by Dan Burkert: https://github.com/danburkert/prost/pull/194#discussion_r292096009 The existing string value is dropped in case of a merge failure, but this is better than exposing an invalid value.

danburkert · 2019-06-11T05:40:33Z

Thanks again @mzabaluev, this is a great catch!

3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]

Shnatsel · 2019-08-31T19:26:36Z

This RP seems to fix unsoundness that can be triggered in practice. Please file a security advisory at https://github.com/RustSec/advisory-db so that dependents can check if their crate is affected and upgrade.

3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]

3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before tokio-rs#194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // tokio-rs#194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]

mzabaluev added 2 commits June 9, 2019 22:53

Unit test for string::merge failure

8c6f640

Test that adding an invalid UTF-8 sequence results in an error, and that the string is reverted to its state prior to the merge call.

mzabaluev mentioned this pull request Jun 9, 2019

Experiment with using bytes::Bytes to back bytes and string fields #190

Closed

nrc reviewed Jun 10, 2019

View reviewed changes

danburkert reviewed Jun 10, 2019

View reviewed changes

Safe reimplementation of string::merge

8762942

As suggested by Dan Burkert: https://github.com/danburkert/prost/pull/194#discussion_r292096009 The existing string value is dropped in case of a merge failure, but this is better than exposing an invalid value.

danburkert merged commit 5a25f27 into tokio-rs:master Jun 11, 2019

danburkert mentioned this pull request Jun 17, 2019

Reintroduce unsafe in encoding::string::merge #196

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-8 unsoundness in string::merge #194

Fix UTF-8 unsoundness in string::merge #194

mzabaluev commented Jun 9, 2019

mzabaluev commented Jun 9, 2019

nrc left a comment

nrc Jun 10, 2019

mzabaluev Jun 10, 2019

mzabaluev Jun 10, 2019

danburkert Jun 10, 2019

mzabaluev Jun 10, 2019

danburkert Jun 10, 2019

mzabaluev Jun 10, 2019

danburkert Jun 10, 2019

mzabaluev Jun 10, 2019

danburkert commented Jun 11, 2019

Shnatsel commented Aug 31, 2019

Fix UTF-8 unsoundness in string::merge #194

Fix UTF-8 unsoundness in string::merge #194

Conversation

mzabaluev commented Jun 9, 2019

mzabaluev commented Jun 9, 2019

nrc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danburkert commented Jun 11, 2019

Shnatsel commented Aug 31, 2019