Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF-8 unsoundness in string::merge #194

Merged
merged 3 commits into from
Jun 11, 2019
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
56 changes: 50 additions & 6 deletions src/encoding.rs
Original file line number Diff line number Diff line change
Expand Up @@ -716,6 +716,44 @@ macro_rules! length_delimited {
pub mod string {
use super::*;

// String::as_mut_vec is unsafe because it doesn't check that the bytes
// inserted into the resulting vec are valid UTF-8. We check
// after fully extending the buffer in order to ensure this is safe.
// If an unwind happens before the check is performed and the result
// is committed, the string is truncated to the previously valid length.
struct StringGuard<'a> {
string: &'a mut String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my (admittedly weak) understanding of the unsafe rules, it's instant UB to create a str/String containing invalid UTF-8. Given that, I think this pattern is still unsafe, since it's appending bytes to a String, then checking validity. Given my understanding, I think the only safe way to do what we are trying to do here is accumulate into a Vec<u8>, check if it's valid utf-8, and then and only then convert to String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, it's not UB until the invalid data are accessed as str/String; but, theoretically, .as_bytes() or even .as_mut_vec() might be in violation of this.

committed_len: usize,
}

impl<'a> Drop for StringGuard<'a> {
fn drop(&mut self) {
self.string.truncate(self.committed_len);
}
}

impl<'a> StringGuard<'a> {
fn new(string: &'a mut String) -> Self {
let committed_len = string.len();
StringGuard {
string,
committed_len,
}
}

unsafe fn as_mut_vec(&mut self) -> &mut Vec<u8> {
self.string.as_mut_vec()
}

fn commit(mut self) -> Result<(), DecodeError> {
let new_bytes = &self.string.as_bytes()[self.committed_len..];
str::from_utf8(new_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to check the whole string, not just the new bytes here, since when the user has mutable access to the internal string, they can modify as well as append.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a general purpose StringGuard I would do, but here we can use the knowledge that bytes::merge would not change the already filled part of the vector, and nothing else can modify it while it is borrowed.

But again, clearing the string value before the new wire data are merged, and also on StringGuard drop without commit, would be simpler and not rely on the other function's behavior to be safe. I just don't understand why this implementation concatenates non-repeating fields of types string and bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearing the string value before the new wire data are merged... would be simpler

Also in the BytesString (or StrChunk) rework in #190, I would just suck the data into Bytes and convert that into the new string value with a safe UTF-8 content check rather than fiddle with unsafe mutable representations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't understand why this implementation concatenates non-repeating fields of types string and bytes.

It's entirely possible that this is done by accident, and should be fixed. Do you have a reference for where the spec / PB docs mention that merging will overwrite strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the developer guide on encoding:

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees.

.map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;
self.committed_len += new_bytes.len();
Ok(())
}
}

pub fn encode<B>(tag: u32, value: &String, buf: &mut B)
where
B: BufMut,
Expand All @@ -728,14 +766,11 @@ pub mod string {
where
B: Buf,
{
let mut guard = StringGuard::new(value);
unsafe {
// String::as_mut_vec is unsafe because it doesn't check that the bytes
// inserted into it the resulting vec are valid UTF-8. We check
// explicitly in order to ensure this is safe.
super::bytes::merge(wire_type, value.as_mut_vec(), buf)?;
str::from_utf8(value.as_bytes())
.map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;
super::bytes::merge(wire_type, guard.as_mut_vec(), buf)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using some clever buffer swaps we can simplify this and remove the need for unsafe entirely, here's what I'm thinking:

    pub fn merge<B>(wire_type: WireType, value: &mut String, buf: &mut B) -> Result<(), DecodeError>
    where
        B: Buf,
    {
        let mut value_bytes = mem::replace(value, String::new()).into_bytes();
        bytes::merge(wire_type, &mut value_bytes, buf)?;
        *value = String::from_utf8(value_bytes)
            .map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;
        Ok(())
    }

That can even be simplified a bit more if we decide we need to overwrite on merge instead of append. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danburkert I think your implementation is good, unless somebody cares about the existing string data being lost on merge failure.

}
guard.commit()?;
Ok(())
}

Expand Down Expand Up @@ -1307,6 +1342,15 @@ mod test {
}
}

#[test]
fn string_merge_failure() {
let mut s = String::new();
let mut buf = Cursor::new(b"\x80\x80");
let r = string::merge(WireType::LengthDelimited, &mut s, &mut buf);
r.expect_err("must be an error");
assert!(s.is_empty());
}

#[test]
fn varint() {
fn check(value: u64, encoded: &[u8]) {
Expand Down