Add some safety docs #720

Manishearth · 2024-04-26T21:28:25Z

I was performing a safety review of this crate and found the existing string comments a bit hard to follow. In particular, a lot of them refer to functions outside of the module for their invariants, which feels sketchy, until you realize that most of them are relying on relatively local invariants based on constants.

I added explicit safety comments for all of the unsafe in toml_edit. However I'm unable to figure out why mlb_quotes and mll_quotes are safe. There is no call to bytes there, and it seems to be using a terminated parser.

crates/toml_edit/src/parser/numbers.rs

        .map(|b: &[u8]| unsafe { from_utf8_unchecked(b, "`digit` and `_` filter out non-ASCII") })
        .parse_next(input)
+    /// Safety-usable invariant upheld by only using `digit` and `_` in the parser


crates/toml_edit/src/parser/numbers.rs

        .map(|b: &[u8]| unsafe { from_utf8_unchecked(b, "`digit` and `_` filter out non-ASCII") })
        .parse_next(input)
+    /// Safety-usable invariant upheld by only using `digit` and `_` in the parser


epage · 2024-04-26T21:48:23Z

crates/toml_edit/src/parser/datetime.rs

+        // Safety: `digit` only produces ASCII
        .map(|b: &[u8]| unsafe { from_utf8_unchecked(b, "`is_ascii_digit` filters out on-ASCII") })


This is duplicating the debug assertion message. Is that a specific special comment syntax you are needing?

The comment is stale from our conversion from is_ascii_digit to DIGIT

@epage Yeah so this comment is twofold, for one I wasn't sure what the debug message was referring to, and the other thing is that it is a bit of a norm to justify safety with a Safety: comment. I think the debug message mostly satisfies the need, but I got super confused as an unsafe reviewer so decided to just do it the proper way.

I can understand the standard convention but we are double checking the invariants in debug builds and I want those panics to carry over what assumption is being broken, so we have the message already and I don't want to be doubling up on the message. Or put another way, this is an extra level of protection over a simple comment.

epage · 2024-04-26T21:49:18Z

crates/toml_edit/src/parser/key.rs

@@ -101,6 +102,7 @@ pub(crate) fn is_unquoted_char(c: u8) -> bool {
    UNQUOTED_CHAR.contains_token(c)
 }

+// Safety-usable invariant: UNQUOTED_CHAR is only ASCII ranges


Why do we need this comment?

It's useful to make safety as local as possible; split out what needs to be verified. If some function is depending on some other function's behavior, it's useful to document that on the function (or constant). For two reasons:

when verifying, you don't have to chase down invariants and make sense of them each time

when modifying code, it's clear what additional properties one must uphold.

I'm following a higher standard of unsafe code commenting, one that I do not necessarily apply during review but try to follow to make the lives of future reviewers easier when properly unsafe commenting a crate.

(The general framing for that higher standard is "unsafe code should be documented such that an unsafe reviewer can easily validate the code whilst maintaining minimal state in their head")

So this isn't 100% necessary, but I felt if I was adding comments I might as well do this anyway.

My concern is that this is likely to go stale without a linter that can process the links of safety invariants.

Well, yes. Ideally people maintaining unsafe code also keep this up to date, but if not, it'll get caught on future unsafe reviews when updates are pushed.

Basically, had these comments been there, my safety review would have been much less work. If these comments are there and incorrect, it would still be less work because it's quite easy to verify incorrectness when it's local.

epage · 2024-04-27T00:40:13Z

crates/toml_edit/src/parser/strings.rs

+            // Safety: ???
            .map(|b| unsafe { from_utf8_unchecked(b, "`bytes` out non-ASCII") })


t. However I'm unable to figure out why mlb_quotes and mll_quotes are safe. There is no call to bytes there, and it seems to be using a terminated parser.

terminated says to ignore the second parser, only returning the result of the first parser. So in this case, we are effectively calling from_utf8_unchecked("\""). So the invariant we are upholding is that " is ASCII.

epage · 2024-04-27T00:41:07Z

btw some core assumptions to the parser are

While we parse using bytes, the only content that can enter this parser started as &str and so we can assume the data is valid UTF-8
At every point, we should only be splitting text on character boundaries

So while the other safety comments are more specific, we have an extra layer of invariants helping out.

Manishearth · 2024-05-02T20:52:35Z

While we parse using bytes, the only content that can enter this parser started as &str and so we can assume the data is valid UTF-8

At every point, we should only be splitting text on character boundaries

Yes, however these assumptions are extremely nonlocal and thus overall not very useful for unsafe review.

Manishearth · 2024-05-02T20:55:06Z

Fixed comments about terminated

github-advanced-security bot found potential problems Apr 26, 2024

View reviewed changes

epage reviewed Apr 26, 2024

View reviewed changes

Add some safety docs

c3d0870

Manishearth force-pushed the safety branch from de9fe50 to c3d0870 Compare April 26, 2024 22:02

epage reviewed Apr 27, 2024

View reviewed changes

fix terminated

f3b6185

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some safety docs #720

Add some safety docs #720

Manishearth commented Apr 26, 2024

epage Apr 26, 2024

Manishearth Apr 26, 2024

epage Apr 27, 2024

epage Apr 26, 2024

Manishearth Apr 26, 2024

epage Apr 27, 2024

Manishearth Apr 27, 2024

Manishearth Apr 27, 2024

epage Apr 27, 2024

epage commented Apr 27, 2024

Manishearth commented May 2, 2024

Manishearth commented May 2, 2024

		// Safety: `digit` only produces ASCII
		.map(\|b: &[u8]\| unsafe { from_utf8_unchecked(b, "`is_ascii_digit` filters out on-ASCII") })

		// Safety: ???
		.map(\|b\| unsafe { from_utf8_unchecked(b, "`bytes` out non-ASCII") })

Add some safety docs #720

Are you sure you want to change the base?

Add some safety docs #720

Conversation

Manishearth commented Apr 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epage commented Apr 27, 2024

Manishearth commented May 2, 2024

Manishearth commented May 2, 2024