Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF-8 unsoundness in string::merge #194

Merged
merged 3 commits into from Jun 11, 2019

Conversation

mzabaluev
Copy link
Contributor

Fixes #193

In case of an encoding error in string::merge, the appended string value
is left with broken UTF-8 in case of an error. The same can happen if
any of the Buf methods panics. This results in UB if the string value is used
after the error return or in unwind, respectively.

Change the implementation to truncate the string back to valid UTF-8
content in any code path that does not go through validation of the
newly appended bytes.
Test that adding an invalid UTF-8 sequence results in an error,
and that the string is reverted to its state prior to the merge call.
@mzabaluev
Copy link
Contributor Author

Note that the merge implementation does not clear the existing value, as the protobuf spec suggests it should. If it did, the guard code would be simpler and marginally more efficient.

Copy link
Contributor

@nrc nrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks! I would be tempted to add a safe wrapper function which takes a closure to operate on the Vec<u8>, then commits so that the unsafe code is encapsulated.

src/encoding.rs Outdated

fn commit(mut self) -> Result<(), DecodeError> {
let new_bytes = &self.string.as_bytes()[self.committed_len..];
str::from_utf8(new_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to check the whole string, not just the new bytes here, since when the user has mutable access to the internal string, they can modify as well as append.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a general purpose StringGuard I would do, but here we can use the knowledge that bytes::merge would not change the already filled part of the vector, and nothing else can modify it while it is borrowed.

But again, clearing the string value before the new wire data are merged, and also on StringGuard drop without commit, would be simpler and not rely on the other function's behavior to be safe. I just don't understand why this implementation concatenates non-repeating fields of types string and bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearing the string value before the new wire data are merged... would be simpler

Also in the BytesString (or StrChunk) rework in #190, I would just suck the data into Bytes and convert that into the new string value with a safe UTF-8 content check rather than fiddle with unsafe mutable representations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just don't understand why this implementation concatenates non-repeating fields of types string and bytes.

It's entirely possible that this is done by accident, and should be fixed. Do you have a reference for where the spec / PB docs mention that merging will overwrite strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the developer guide on encoding:

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees.

src/encoding.rs Outdated
// If an unwind happens before the check is performed and the result
// is committed, the string is truncated to the previously valid length.
struct StringGuard<'a> {
string: &'a mut String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my (admittedly weak) understanding of the unsafe rules, it's instant UB to create a str/String containing invalid UTF-8. Given that, I think this pattern is still unsafe, since it's appending bytes to a String, then checking validity. Given my understanding, I think the only safe way to do what we are trying to do here is accumulate into a Vec<u8>, check if it's valid utf-8, and then and only then convert to String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, it's not UB until the invalid data are accessed as str/String; but, theoretically, .as_bytes() or even .as_mut_vec() might be in violation of this.

src/encoding.rs Outdated
super::bytes::merge(wire_type, value.as_mut_vec(), buf)?;
str::from_utf8(value.as_bytes())
.map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;
super::bytes::merge(wire_type, guard.as_mut_vec(), buf)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using some clever buffer swaps we can simplify this and remove the need for unsafe entirely, here's what I'm thinking:

    pub fn merge<B>(wire_type: WireType, value: &mut String, buf: &mut B) -> Result<(), DecodeError>
    where
        B: Buf,
    {
        let mut value_bytes = mem::replace(value, String::new()).into_bytes();
        bytes::merge(wire_type, &mut value_bytes, buf)?;
        *value = String::from_utf8(value_bytes)
            .map_err(|_| DecodeError::new("invalid string value: data is not UTF-8 encoded"))?;
        Ok(())
    }

That can even be simplified a bit more if we decide we need to overwrite on merge instead of append. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danburkert I think your implementation is good, unless somebody cares about the existing string data being lost on merge failure.

As suggested by Dan Burkert:
https://github.com/danburkert/prost/pull/194#discussion_r292096009

The existing string value is dropped in case of a merge failure,
but this is better than exposing an invalid value.
@danburkert danburkert merged commit 5a25f27 into tokio-rs:master Jun 11, 2019
@danburkert
Copy link
Collaborator

Thanks again @mzabaluev, this is a great catch!

danburkert added a commit that referenced this pull request Jun 17, 2019
3ab36fda9a1788e0 introduced a ~3% regression on some decoding
benchmarks. My best guess is that swapping the value temporarily with a
new string is not being optimized well. This commit reverts to using
String::as_mut_vec, but carefully clears the string on decode failure.
It also fixes up the associated regression test, which wasn't properly
covering the malformed UTF-8 case.

// Before #194
dataset/google_message3_2/decode
                        time:   [219.45 ms 219.48 ms 219.51 ms]
                        thrpt:  [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s]

// #194
dataset/google_message3_2/decode
                        time:   [226.08 ms 226.11 ms 226.18 ms]
                        thrpt:  [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s]

// This commit
dataset/google_message3_2/decode
                        time:   [219.05 ms 219.76 ms 221.78 ms]
                        thrpt:  [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
danburkert added a commit that referenced this pull request Jun 17, 2019
3ab36fda9a1788e0 introduced a ~3% regression on some decoding
benchmarks. My best guess is that swapping the value temporarily with a
new string is not being optimized well. This commit reverts to using
String::as_mut_vec, but carefully clears the string on decode failure.
It also fixes up the associated regression test, which wasn't properly
covering the malformed UTF-8 case.

    // Before #194
    dataset/google_message3_2/decode
                            time:   [219.45 ms 219.48 ms 219.51 ms]
                            thrpt:  [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s]

    // #194
    dataset/google_message3_2/decode
                            time:   [226.08 ms 226.11 ms 226.18 ms]
                            thrpt:  [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s]

    // This commit
    dataset/google_message3_2/decode
                            time:   [219.05 ms 219.76 ms 221.78 ms]
                            thrpt:  [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
danburkert added a commit that referenced this pull request Jun 17, 2019
3ab36fda9a1788e0 introduced a ~3% regression on some decoding
benchmarks. My best guess is that swapping the value temporarily with a
new string is not being optimized well. This commit reverts to using
String::as_mut_vec, but carefully clears the string on decode failure.
It also fixes up the associated regression test, which wasn't properly
covering the malformed UTF-8 case.

    // Before #194
    dataset/google_message3_2/decode
                            time:   [219.45 ms 219.48 ms 219.51 ms]
                            thrpt:  [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s]

    // #194
    dataset/google_message3_2/decode
                            time:   [226.08 ms 226.11 ms 226.18 ms]
                            thrpt:  [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s]

    // This commit
    dataset/google_message3_2/decode
                            time:   [219.05 ms 219.76 ms 221.78 ms]
                            thrpt:  [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
@Shnatsel
Copy link

This RP seems to fix unsoundness that can be triggered in practice. Please file a security advisory at https://github.com/RustSec/advisory-db so that dependents can check if their crate is affected and upgrade.

danburkert added a commit that referenced this pull request Jan 12, 2020
3ab36fda9a1788e0 introduced a ~3% regression on some decoding
benchmarks. My best guess is that swapping the value temporarily with a
new string is not being optimized well. This commit reverts to using
String::as_mut_vec, but carefully clears the string on decode failure.
It also fixes up the associated regression test, which wasn't properly
covering the malformed UTF-8 case.

    // Before #194
    dataset/google_message3_2/decode
                            time:   [219.45 ms 219.48 ms 219.51 ms]
                            thrpt:  [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s]

    // #194
    dataset/google_message3_2/decode
                            time:   [226.08 ms 226.11 ms 226.18 ms]
                            thrpt:  [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s]

    // This commit
    dataset/google_message3_2/decode
                            time:   [219.05 ms 219.76 ms 221.78 ms]
                            thrpt:  [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
danburkert added a commit that referenced this pull request Jan 12, 2020
3ab36fda9a1788e0 introduced a ~3% regression on some decoding
benchmarks. My best guess is that swapping the value temporarily with a
new string is not being optimized well. This commit reverts to using
String::as_mut_vec, but carefully clears the string on decode failure.
It also fixes up the associated regression test, which wasn't properly
covering the malformed UTF-8 case.

    // Before #194
    dataset/google_message3_2/decode
                            time:   [219.45 ms 219.48 ms 219.51 ms]
                            thrpt:  [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s]

    // #194
    dataset/google_message3_2/decode
                            time:   [226.08 ms 226.11 ms 226.18 ms]
                            thrpt:  [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s]

    // This commit
    dataset/google_message3_2/decode
                            time:   [219.05 ms 219.76 ms 221.78 ms]
                            thrpt:  [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
danburkert added a commit that referenced this pull request Jan 12, 2020
3ab36fda9a1788e0 introduced a ~3% regression on some decoding
benchmarks. My best guess is that swapping the value temporarily with a
new string is not being optimized well. This commit reverts to using
String::as_mut_vec, but carefully clears the string on decode failure.
It also fixes up the associated regression test, which wasn't properly
covering the malformed UTF-8 case.

    // Before #194
    dataset/google_message3_2/decode
                            time:   [219.45 ms 219.48 ms 219.51 ms]
                            thrpt:  [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s]

    // #194
    dataset/google_message3_2/decode
                            time:   [226.08 ms 226.11 ms 226.18 ms]
                            thrpt:  [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s]

    // This commit
    dataset/google_message3_2/decode
                            time:   [219.05 ms 219.76 ms 221.78 ms]
                            thrpt:  [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
dunmatt pushed a commit to dunmatt/prost that referenced this pull request Apr 1, 2020
3ab36fda9a1788e0 introduced a ~3% regression on some decoding
benchmarks. My best guess is that swapping the value temporarily with a
new string is not being optimized well. This commit reverts to using
String::as_mut_vec, but carefully clears the string on decode failure.
It also fixes up the associated regression test, which wasn't properly
covering the malformed UTF-8 case.

    // Before tokio-rs#194
    dataset/google_message3_2/decode
                            time:   [219.45 ms 219.48 ms 219.51 ms]
                            thrpt:  [458.64 MiB/s 458.72 MiB/s 458.78 MiB/s]

    // tokio-rs#194
    dataset/google_message3_2/decode
                            time:   [226.08 ms 226.11 ms 226.18 ms]
                            thrpt:  [445.12 MiB/s 445.25 MiB/s 445.32 MiB/s]

    // This commit
    dataset/google_message3_2/decode
                            time:   [219.05 ms 219.76 ms 221.78 ms]
                            thrpt:  [453.94 MiB/s 458.12 MiB/s 459.61 MiB/s]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unsound use of String::as_mut_vec
4 participants