New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UTF-8 unsoundness in string::merge #194
Conversation
In case of an encoding error in string::merge, the appended string value is left with broken UTF-8 in case of an error. The same can happen if any of the Buf methods panics. This results in UB if the string value is used after the error return or in unwind, respectively. Change the implementation to truncate the string back to valid UTF-8 content in any code path that does not go through validation of the newly appended bytes.
Test that adding an invalid UTF-8 sequence results in an error, and that the string is reverted to its state prior to the merge call.
Note that the merge implementation does not clear the existing value, as the protobuf spec suggests it should. If it did, the guard code would be simpler and marginally more efficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks! I would be tempted to add a safe wrapper function which takes a closure to operate on the Vec<u8>
, then commits so that the unsafe code is encapsulated.
src/encoding.rs
Outdated
|
||
fn commit(mut self) -> Result<(), DecodeError> { | ||
let new_bytes = &self.string.as_bytes()[self.committed_len..]; | ||
str::from_utf8(new_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to check the whole string, not just the new bytes here, since when the user has mutable access to the internal string, they can modify as well as append.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a general purpose StringGuard
I would do, but here we can use the knowledge that bytes::merge
would not change the already filled part of the vector, and nothing else can modify it while it is borrowed.
But again, clearing the string value before the new wire data are merged, and also on StringGuard
drop without commit, would be simpler and not rely on the other function's behavior to be safe. I just don't understand why this implementation concatenates non-repeating fields of types string
and bytes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clearing the string value before the new wire data are merged... would be simpler
Also in the BytesString
(or StrChunk
) rework in #190, I would just suck the data into Bytes
and convert that into the new string value with a safe UTF-8 content check rather than fiddle with unsafe mutable representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't understand why this implementation concatenates non-repeating fields of types string and bytes.
It's entirely possible that this is done by accident, and should be fixed. Do you have a reference for where the spec / PB docs mention that merging will overwrite strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the developer guide on encoding:
Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees.
src/encoding.rs
Outdated
// If an unwind happens before the check is performed and the result | ||
// is committed, the string is truncated to the previously valid length. | ||
struct StringGuard<'a> { | ||
string: &'a mut String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my (admittedly weak) understanding of the unsafe rules, it's instant UB to create a str
/String
containing invalid UTF-8. Given that, I think this pattern is still unsafe, since it's appending bytes to a String
, then checking validity. Given my understanding, I think the only safe way to do what we are trying to do here is accumulate into a Vec<u8>
, check if it's valid utf-8, and then and only then convert to String
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, it's not UB until the invalid data are accessed as str
/String
; but, theoretically, .as_bytes()
or even .as_mut_vec()
might be in violation of this.
src/encoding.rs
Outdated
super::bytes::merge(wire_type, value.as_mut_vec(), buf)?; | ||
str::from_utf8(value.as_bytes()) | ||
.map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?; | ||
super::bytes::merge(wire_type, guard.as_mut_vec(), buf)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using some clever buffer swaps we can simplify this and remove the need for unsafe
entirely, here's what I'm thinking:
pub fn merge<B>(wire_type: WireType, value: &mut String, buf: &mut B) -> Result<(), DecodeError>
where
B: Buf,
{
let mut value_bytes = mem::replace(value, String::new()).into_bytes();
bytes::merge(wire_type, &mut value_bytes, buf)?;
*value = String::from_utf8(value_bytes)
.map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;
Ok(())
}
That can even be simplified a bit more if we decide we need to overwrite on merge instead of append. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danburkert I think your implementation is good, unless somebody cares about the existing string data being lost on merge failure.
As suggested by Dan Burkert: https://github.com/danburkert/prost/pull/194#discussion_r292096009 The existing string value is dropped in case of a merge failure, but this is better than exposing an invalid value.
Thanks again @mzabaluev, this is a great catch! |
3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
This RP seems to fix unsoundness that can be triggered in practice. Please file a security advisory at https://github.com/RustSec/advisory-db so that dependents can check if their crate is affected and upgrade. |
3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before #194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // #194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
3ab36fda9a1788e0 introduced a ~3% regression on some decoding benchmarks. My best guess is that swapping the value temporarily with a new string is not being optimized well. This commit reverts to using String::as_mut_vec, but carefully clears the string on decode failure. It also fixes up the associated regression test, which wasn't properly covering the malformed UTF-8 case. // Before tokio-rs#194 dataset/google_message3_2/decode time: [219.45 ms 219.48 ms 219.51 ms] thrpt: [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s] // tokio-rs#194 dataset/google_message3_2/decode time: [226.08 ms 226.11 ms 226.18 ms] thrpt: [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s] // This commit dataset/google_message3_2/decode time: [219.05 ms 219.76 ms 221.78 ms] thrpt: [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
Fixes #193