Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way to "forge" provenance? #466

Open
joshlf opened this issue Oct 3, 2023 · 11 comments
Open

Any way to "forge" provenance? #466

joshlf opened this issue Oct 3, 2023 · 11 comments

Comments

@joshlf
Copy link

joshlf commented Oct 3, 2023

As part of google/zerocopy#170, zerocopy is trying to figure out how to support users who need to dereference pointers received over FFI (including in a kernel (or kernel emulator), where the other side is a user-space process providing pointers into memory which the kernel (emulator) has access to). These pointers are passed as untyped bytes, as C void pointers, etc. The user needs to perform some validation (bounds checking etc) and then dereference these pointers. As discussed in the linked issue, this presents a serious footgun since the pointers may not have valid provenance after being round-tripped through a byte representation or other untyped representation.

We're hoping to support this use case by providing an API which can convert &[u8] or some other untyped representation to &T where T contains raw pointers that can be soundly dereferenced. In order to do this, we need to ensure that the following holds: If a user has obtained an object via FFI or some other not-visible-to-Rust mechanism, if our mechanism converts that object to one or more raw pointers, and if certain facts hold about the pointers (they've been bounds checked, they point into "external" memory, etc), then dereferencing those pointers is sound.

My question is: Is it possible, inside of a function with the signature &T -> &U where T may be [u8] or some other "untyped" representation and U contains raw pointers, to ensure that such future operations will be sound? I assume that this question is equivalent to asking: Is it possible to forge provenance such that the compiler will understand future pointer operations to have valid provenance, and thus be sound. But maybe that's not the whole story? I assume this is at a minimum possible to do inside the compiler or inside the standard library, as these need to support extern "C" fn, syscalls, etc.

@chorman0773
Copy link
Contributor

chorman0773 commented Oct 3, 2023

If the compiler cannot see what FFI code (or any unanalyzed code in general) can do, it will assume that code does anything that code could legally do.

When doing untyped copies (via copy_nonoverlapping/copy) or typed copies as certain types (MaybeUninit<u8> is the main one, and arrays thereof, but really any union type w/o any interior or tail padding will most likely have this behaviour), the bytes are preserved exactly from the source to the destination, including any provenance.

Inventing provenance from thin air, such as from a memory mapped I/O device, or, in the case of a kernel, the basic kernel allocator, can be done by an as cast or strict_provenance function (I'm not sure if this function is defined yet, or what it's current name is). In both of these cases, the provenance cannot refer to any allocations "owned" by the AM (though an as cast can also pick up on any of those allocations which have been exposed). Pointers to userspace would presumably already be allocated, so the provenance does exist.
Obviously you cannot prove to the compiler that it would be sound, because these mechanisms do not necessarily cause them to be sound (for example, casting an unmapped address to a pointer won't give you a valid provenance, because if you access that address, an exception will get thrown) but does make the compiler assume it does (or, at least, in the case of the as cast, it may have). The compiler is unlikely to get angry unless it knows for a fact it is not valid to do so (for example, when you access the address, it causes an exception).

@RalfJung
Copy link
Member

RalfJung commented Oct 4, 2023

I'm having a hard time understanding what exactly you are asking.

Generally the first strategy should be to try to preserve provenance. The type for "untyped memory that may contain provenance" is MaybeUninit<u8>. That type can be used to pass provenance from C code to Rust code and back. Using u8 doesn't work since integers do not carry provenance. But of course that only applies for by-value data: passing a struct with a [u8; 64] field will only preserve the address but not the provenance of any pointer in that buffer, but a struct can contain a *mut [u8; 64] pointer and then you can later offset and cast that to *mut *const () and still do loads with provenance.

If preserving provenance is not possible, the alternative is expose_addr/from_exposed_addr. That's the only "forging" of provenance. Not being otherwise forgeable is literally the only reason provenance is useful.

I think a more concrete example (as small as possible :) would help. I don't know if you are asking for versions of expose_addr/from_exposed_addr that work on arbitrary buffers (which seems like a reasonable operation but we'd have to figure out the API) or something more radical.

@joshlf
Copy link
Author

joshlf commented Oct 22, 2023

Here's a (made-up) example. Imagine that I'm implementing a kernel, and I have a syscall with the following signature:

#[repr(C)]
struct Buffer {
    base: *const u8,
    len: usize,
    next: *const Buffer,
}

extern "C" fn do_thing_with_buffers(buf: Buffer);

do_thing_with_buffers will be invoked from outside of Rust (ie, by userland).

There are two scenarios we can imagine:

  • The provided pointers point into kernel space. The kernel performs bounds checking and, once it succeeds, uses the pointers like normal.
  • The provided pointers point into user space. The kernel is able to translate the pointers into pointers whose addresses are valid in the context of the kernel's address space (i.e. by translating to kernel virtual memory addresses which correspond to the same physical memory pages that the userland pointers refer to), after which it can use the pointers like normal.

The question is: In each of the two cases, how can we produce pointers with valid provenance that can be used as vanilla pointers in the kernel's address space? I intentionally chose Buffer to be a linked list so that we need to contend with nested pointers.

@RalfJung
Copy link
Member

The syscall interface is an asm-level FFI boundary. There's no provenance there.

To access user memory, the kernel needs to either keep track the provenance that is used for "all userland memory", or just make it exposed. In the former case, it can use user_provenance.with_addr(addr_from_syscall_buffer) to obtain a pointer with the right provenance; in the latter case it can use ptr::from_exposed_addr(addr_from_syscall_buffer).

For accessing kernel memory, it's the same story. Presumably here it will know the pointer points into some buffer, so it should probably do buffer.as_ptr().with_addr(addr_from_syscall_buffer) -- that will result in the same pointer that we would have gotten if the user just provided an offset and we did buffer[offset].

@comex
Copy link

comex commented Oct 28, 2023

To access user memory, the kernel needs to either keep track the provenance that is used for "all userland memory", or just make it exposed.

That leaves the question of what it means to "make [some memory] exposed" when you are a kernel and are directly manipulating the page tables to map data at address X. I think the correct answer is to just treat the data you mapped as already exposed, right? But is that documented somewhere?

This is similar to other cases like

  • microcontrollers with a fixed memory map where you might just have some hardcoded integer address you want to cast to a pointer
  • some userland low-level memory mapping APIs, e.g. ones that use integer rather than pointer types to represent addresses

@RalfJung
Copy link
Member

That leaves the question of what it means to "make [some memory] exposed" when you are a kernel and are directly manipulating the page tables to map data at address X. I think the correct answer is to just treat the data you mapped as already exposed, right?

Exposing is a "ghost operation", so any inline asm block (without nomem/readonly) can be declared to "expose" any allocation that it has access to.

But is that documented somewhere?

No... we don't have enough established consensus about what provenance is and how it works to even really start officialy documenting this. :/

@joshlf
Copy link
Author

joshlf commented Oct 29, 2023

But is that documented somewhere?

No... we don't have enough established consensus about what provenance is and how it works to even really start officialy documenting this. :/

One thing we'd like to do in zerocopy eventually is be able to make a complete guarantee (where "complete" means "no holes in our logic") that all code is sound based only on the language semantics. Historically, we've taken an approach of trying to get "lower bounds" documented in the Reference or stdlib docs (ie, "if your code does this, it's definitely sound; if it doesn't do this, it might still be sound, but we can't guarantee it"). It feels like that approach would be appropriate here: we could agree on a strictest-possible definition of strict provenance, and add it to the Reference and say, "as long as your program abides by these rules, it's definitely sound." That doesn't preclude the ability to relax the rules in the future - and we can disclaim as much in the text. It also doesn't require us to stabilize an API for provenance (ie, feature(strict_provenance)). It's almost as forwards-compatible as the current state of affairs. The only way in which it constrains us is that we could never require code be more strict than strict provenance in the future, but it sounds like nobody wants that anyway.

@RalfJung
Copy link
Member

RalfJung commented Oct 29, 2023 via email

@saethlin
Copy link
Member

I for one would be quite happy to see the strict provenance APIs stabilized, with that one omitted for now. It's not a big deal to just do null().with_addr instead.

@RalfJung
Copy link
Member

RalfJung commented Oct 29, 2023 via email

@joshlf
Copy link
Author

joshlf commented Oct 29, 2023

@joshlf reading the discussion here again, it seems for your question we would have to stabilize the idea of exposed provenance. That's a bigger ask than the core of strict provenance. from_exposed_addr with its angelic choice is the most sketchy part of our entire op.sem... I don't see us guaranteeing much about int2ptr casts any time soon, unfortunately. That operation is deeply cursed and hideously hard to specify well.

For this specific issue, I agree. However, we're also generally interested in stabilizing strict provenance - even the subset that doesn't include the discussion in this issue. It'd be a huge step forward because it's the last significant part of Rust's memory model that zerocopy needs to rely on which is still unspecified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants