Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could we have Rust without provenance? #287

Closed
RalfJung opened this issue Jun 13, 2021 · 38 comments
Closed

Could we have Rust without provenance? #287

RalfJung opened this issue Jun 13, 2021 · 38 comments
Labels
A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow)

Comments

@RalfJung
Copy link
Member

RalfJung commented Jun 13, 2021

Every time I make provenance-based arguments, some people dispute that the need for provenance exists in the first place. I disagree with them, and I am not alone in this, but it is a very reasonable objection nevertheless.

I will not be able to fairly represent the arguments of the no-provenance fraction, so I hope @digama0 will reply here summarizing the key arguments and explaining how alias analysis can reasonably be done in a compiler without provenance -- or, alternatively, how a "Rust without provenance" could be compiled to a backend like LLVM that demonstrably does have provenance as support for its alias analysis.

@RalfJung RalfJung added the A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow) label Jun 13, 2021
@digama0
Copy link

digama0 commented Jun 13, 2021

To begin with, the model I proposed in the linked thread is not about eliminating all provenance. It is about treating the *mut T type as equivalent to usize, as a provenance free integer, while treating &mut T and &T and &Cell<T> and Box<T> as references with stacked borrows style provenance.

The lowering of all the operations into LLVM is fairly easy to guess from this description:

  • (x: &mut T) as *mut T is ptrtoint(x)
  • *(x: *mut T) is *inttoptr(x)
  • (x: *mut T) as usize is x
  • (x: usize) as *mut T is x

An alternative lowering maps *mut T to LLVM pointers, but in this case one has to be careful to erase provenance of all pointers on entry to a function and when reading new values (roughly in the same places as the SB retag operation). The calling convention for rust functions would ensure that one only passes LLVM-provenance-free pointers to a function, so adding an additional erase_provenance(x) on all pointers x: *T coming into an LLVM function is valid.

The main issue with this proposal is that it prevents most alias analysis on such pointers. Here is an example:

fn foo() {
  let mut x = 0;
  let mut y = 0;
  let x_ptr = &mut x as *mut u8;
  let y_ptr = &mut y as *mut u8;
  *y_ptr = 1;
  *x_ptr = 2;
  println!("{}", *y); // rewrite `*y` to 1? Yes!
}

It isn't possible to perform alias analysis directly here based on the separateness of the allocations to deduce that the write to x doesn't invalidate y, but the information is local enough that we can still figure out what the write will hit. If we offset the pointer by a compiler-unknown amount, alias analysis is defeated:

fn foo(n: usize) {
  let mut x = 0;
  let mut y = 0;
  let x_ptr = (&mut x as *mut u8).wrapping_add(n);
  let y_ptr = &mut y as *mut u8;
  *y_ptr = 1;
  *x_ptr = 2;
  println!("{}", *y); // rewrite `*y` to 1? No!
}

Here, LLVM style provenance would permit the rewrite, but when pointers are just integers, it is possible for n to translate x_ptr to equal y_ptr in which case the two pointers are aliasing.

I contend that this problem is far less disruptive for Rust than it would be in C/C++. Rust already has a much better alias analysis, in the form of references and stacked borrows semantics, so pointers are usually used in low level code in small functions that don't directly benefit much from such optimizations. However, issues start to crop up when these functions are inlined, and I think the main question will hinge on the performance of commonly used functions like IndexMut for Vec that could add raw pointer operations into idiomatic Rust code.

@digama0
Copy link

digama0 commented Jun 13, 2021

I think the main question will hinge on the performance of commonly used functions like IndexMut for Vec that could add raw pointer operations into idiomatic Rust code.

Actually IndexMut for Vec isn't a problem, because it can be done entirely with references: &mut Vec<T> -> &mut [T] -> &mut T -> write, so we actually know that the write is inbounds of the relevant allocation. That means that in let mut x = vec![0]; let y = vec![0]; x[i] = 1; use(y[0]) we can still optimize y[0] to 0 even though we have an indirect write, because that write first had to get mutable access to the value, which is prevented by y's ownership of its data.

@thomcc
Copy link
Member

thomcc commented Jun 13, 2021

I think this would cause performance problems for iterators, like slice::Iter, and various data structures.

But I also think we could provide an opt-in type could be provided for such cases — something like Unique (probably something distinct from Unique though) that behaves like a pointer type with the "current" provenance semantics in order to get more optimizations.

In general, I am heavily in favor of this investigation. The current semantics lead to a sort of "spooky action at a distance" that I thought I had left behind when switching to a computer science major years ago at university (😉). It's very hard to reason about in practice, and a staggering amount of unsafe code in the wild gets it wrong. (That code is likely only saved by the fact that we don't do anything too aggressively inter-procedural)

@Diggsey
Copy link

Diggsey commented Jun 13, 2021

It would definitely be interesting to experiment with a model where alias and/or provenance information can be explicitly provided by the programmer for raw pointers via some kind of attribute.

Given that raw pointers are very rarely used in Rust, it could make more sense to go this route rather than having complicated and confusing provenance rules.

One could even imagine re-defining references to be equivalent to pointers with a particularly restrictive combination of those attributes applied.

@RalfJung
Copy link
Member Author

RalfJung commented Jun 14, 2021

To begin with, the model I proposed in the linked thread is not about eliminating all provenance. It is about treating the *mut T type as equivalent to usize, as a provenance free integer, while treating &mut T and &T and &Cell and Box as references with stacked borrows style provenance.

Okay. (I thought you had proposed multiple different models, but I am okay with focusing on that one for now.)

But I don't think this solves #286; now we just have that same problem with reference-to-int (and even reference-to-ptr) transmutes. As long as there is provenance anywhere in the logic, this problem will arise.

(When I said "pointer" in #286, I meant all types carrying pointer values -- references, raw pointers, function pointers. I should have been more clear about this.)

@RalfJung
Copy link
Member Author

Regarding the proposal itself, I am skeptical because one reason people reach to unsafe code is performance -- so if the pointer model we give them there is even less optimizable than what C does, that will be a problem. For example, the only reason wrapping_offset even exists is that enough people thought it is really important to preserve provenance.

If we have two kinds of raw pointers (one with and one without provenance), then we get plenty of subtle issues when converting between the two -- in fact, those will be exactly the same issues that we currently have when converting between pointers and integers. So in a sense we can already have a "raw pointer without provenance" as a user-defined type:

struct PtrWithoutProvenance<T>(usize, PhantomData<*mut T>);

I don't quite see how turning this into a primitive type would make the language easier to learn.

In general, I am heavily in favor of this investigation. The current semantics lead to a sort of "spooky action at a distance" that I thought I had left behind when switching to a computer science major years ago at university (wink).

I am not sure what you are referring to by "action at a distance" here. The underlying problem is that many people have a fundamentally wrong model of what pointers look like. Naturally, when you come in with an incorrect mental model, the behavior of the "real thing" will look spooky and non-sensical.

The "only" thing people need to accept and learn when using raw pointers is that each pointer is tied to a particular allocated object. A pointer consists of some address it points to, plus some unique ID identifying the object it is associated with. Whenever a pointer is used, if the addresses it accesses are outside of the object identified by that ID, we have UB. I don't think this is complicated, and there is nothing going on "at a distance" here. The problem is not that these rules are complicated, the problem is that people don't expect any such rules, because there is a lot of false information and bad teaching material out there that makes people believe pointers in C or Rust work the same as addresses in assembly.

@thomcc
Copy link
Member

thomcc commented Jun 14, 2021

But I don't think this solves #286; now we just have that same problem with reference-to-int (and even reference-to-ptr) transmutes. As long as there is provenance anywhere in the logic, this problem will arise.

Yeah, I suspected this was the case in general, and was in the middle of writing this last night before I went to bed. I didn't post it because I had wanted to look at it with clear eyes, but I think it's mostly still relevant, although part is not (or at least, it's confirmed), and so you should be aware of that going in.


(Begin post written before most recent replies)

That said, would pointers not having provenance fix the "spooky action at a distance" issue that I'm alluding to? I think the answer is no, although perhaps it would reduce the severity of it.

Like, I could be wrong (I have a track record of being... very wrong about this stuff, let's say) but I think this does very little if anything to fix the most common issues I've seen, which is where some reference is created (either accidentally, because of ergonomics, etc) and it invalidates the provenance of a pointer.

I think the only way to fix that issue is reducing the provenance of references. I'm fairly confident we don't want that, or at least it's not being proposed here...

That said, I guess I do think it's worth seeing the performance impact of a "Rust without provenance" would be. I'm guessing all the talk of it being unoptimizable means we expect it to be worse than "C++ with -fno-strict-aliasing"? Is there alias analysis that C/C++ compilers will do that's not the TBAA sort? Sorry, every compiler I've worked on was very heavily in the "very much a DSL" camp, and even then it was quite a long time ago.

If C++-sans-TBAA is the same: in practice... honestly strict aliasing doesn't help much at all — It helps some, but they're both still pretty fast without it. When I wrote more C++, I usually would turn it off, just because I don't want to worry about UB in a library accidentally biting me, and because I won't miss the ~5% speed hit (this is about what I've seen on average, and if you google around that's in the same ballpark as others seem to have seen also, so seems plausible).

That said, even a 5% speed is a tough pill to swallow across all code... Especially since some code is will get hit much harder, and this comes with the threat of making it harder to optimize things in the future... Realistically, for Rust it's probably higher than 5% too no? — I think we generally believe that Rust's alias analysis is much more helpful than C++'s especially where being able to remove redundant loads from inner loops are concerned, which could have a big impact...

OTOH it's just 5% though, it might not be worth it when you compare to the sorts of algorithmic-ish optimizations that libraries would be able to perform without those restrictions. Increased flexibility there would be very helpful, and could easily make up a perf difference like that.

Anyway all of that is just speculating, but without more details its hard for someone like me to try this out. Is removing noalias from the llvm that Rustc genreates enough? Do we need to remove dereferencable too? Or do LLVMs pointers already imply some amount of provenance, and to avoid that, we need to move everything to integers.

(End post written before most recent replies)


So yeah, after some though:

  1. I think we should be pretty clear about what code we want to enable when proposing something like this. (Doing so will be critical to "sell it" more broadly, especially if it comes with any kind of performance hit)

    This is especially true for the half-measures like "removing pointer prov. but keeping it for refs", which seemed more promising until I thought about it, and now seem even less useful that Ralf has chimed in.

  2. Does anybody know if Rust without prov. would be equivalent in terms of optimizability to, say, C/C++ with -fno-strict-aliasing? I'm assuming there's more to it.

    If there's not, I think it's pretty misleading to say something like "it is not possible to write a reasonably optimizing compiler for a model where pointers have no provenance" — in practice that drastically overstates how much those optimizations help at all. IIRC, MSVC didn't even start doing them at all until a year or two ago.

    That's not to say we can definitely give it up, but just that it's not so far-gone that it should be thrown out from the start.

  3. If we wanted to try this, what's needed?

    • Disabling noalias (for both &T and &mut T) (this is easy)
    • Do we need to remove dereferencable too? (this is also easy, I think, but it seems like it wouldnt be needed)
      • Is there anything else? Presumably align and nonnull can stay?
    • Do we need to go all the way to casting every pointer to int and back? It was kind of implied earlier but I'm not sure it was. I'm guessing if it's needed, it's to avoid some "latent" provenance already present in LLVM-IR pointers?

One final note I have to say something about though:

Regarding the proposal itself, I am skeptical because one reason people reach to unsafe code is performance -- so if the pointer model we give them there is even less optimizable than what C does, that will be a problem

This is sort of not right. I do often reach for unsafe for performance, but the performance I get is usually not from compiler optimizations, it's various sorts of algorithmic ones. Let's stretch a bit and call "explicit vectorization" or other stuff of that ilk an algorithmic optimization, but it often is a real algorithmic change too — often in the form of a smarter data structure, or more clever synchronization, and that sort of thing.

In general, I reach for it when the optimizer has already failed me, and I'm taking matters into my own hands. For sure, there are cases where I do a thing with unsafe, it's to help the optimizer out, but I don't think this is really the common case at all. I can't speak for other authors of unsafe rust, but I think this is pretty typical.

This is largely because the sorts of speed-ups you can get from doing something in a smarter way... They're real, easy to measure, easy to iterate on, and almost always much larger than almost anything a smarter compiler will give you. (Well, unless you're saying something like "this will turn your code into -O0 code" which I have a... hard time believing).

@chorman0773
Copy link
Contributor

chorman0773 commented Jun 15, 2021

I'm guessing all the talk of it being unoptimizable means we expect it to be worse than "C++ with -fno-strict-aliasing"? Is there alias analysis that C/C++ compilers will do that's not the TBAA sort?

Yes, same kinds of things rust can do on raw pointers: ie. can this x pointer access/modify y object.
A simple example is (C++ Code, though same applies for C):

void bar(int* i);
int foo(){
     int x;
     int y;
     int* px = &x;
     int* py = &y;
     *px = 5;
     bar(py);
     return x; 
}

That can be transformed to

int foo(){
     int y;
     bar(&y);
     return 5;
}

because of provenance, and local alias analysis. IMO, provenance is a source of local reasoning about foreign code. Without provenance, one could argue it's valid for the py passed to bar to access px, thus rendering the assignment ellision invalid, because bar may access x and thus, because the compiler cannot inspect bar, must assume it does so.

Likewise, any time you have a translation unit boundery in a rust program, provenance becomes necessary to ensure the scope of accesses. In rustc currently, translation unit bounderies are crates. In lccc, the intention is to convert each source file into a translation unit, to improve parallelization (though this can be disabled).

@Diggsey
Copy link

Diggsey commented Jun 15, 2021

@chorman0773 in your example, it seems like you can achieve the same optimization using escape-analysis alone.

While bar could materialize any random address and write to it, the compiler can see that the address of neither x nor px escapes from the function, so bar has no way of knowing for sure that it's writing to x.

@chorman0773
Copy link
Contributor

While bar could materialize any random address and write to it, the compiler can see that the address of neither x nor px escapes from the function, so bar has no way of knowing for sure that it's writing to x.

I'd argue that escape-analysis is a form of provenance. The same applies if there is a deterministic way to reach x's address from y, but not the actual object (for example, if they are adjacent fields with the same access modifier in a non-standard layout structure).

@RalfJung
Copy link
Member Author

RalfJung commented Jun 20, 2021

@thomcc

That said, I guess I do think it's worth seeing the performance impact of a "Rust without provenance" would be. I'm guessing all the talk of it being unoptimizable means we expect it to be worse than "C++ with -fno-strict-aliasing"?

This is an interesting question. Performance would probably be the same as "C/C++ without provenance", which very different from "C/C++ with no-strict-aliasing". I don't think the experiment of building such a compiler has been done. Such a compiler would essentially have to treat every local that ever has its address taken as escaped, even if that address is not given away, since someone else might "guess" the right address. (You can sometimes argue that guessing is impossible, but that is non-trivial. This is the argument @Diggsey also makes.)

You seem to think provenance in C/C++ is only about TBAA. That is not so, even C/C++ sans TBAA have provenance. It is reflected in the standard where it says that doing pointer arithmetic on a pointer cannot be used to create a pointer that points to a different allocated object. In C/C++/Rust without provenance, we would have to account for the fact that any kind of pointer arithmetic can compute pointers to any allocated object. So, basically, once you did ptr.offset(foo), now ptr potentially aliases anything, even local variables of the current function.

If we wanted to try this, what's needed?

We'd need a compiler that does not exploit provenance in its alias analysis. I do not know of such a compiler even for C/C++.

@Diggsey
Compilers currently assume that if ptr1 and ptr2 point to different allocated objects, then ptr1.offset will also never point to the allocated object of ptr2. That assumption is simply wrong if you remove provenance from the Abstract Machine. ptr1.offset(ptr2.offset_from(ptr1)) will alias ptr2. Put differently, the escape analysis made by typical compilers has to be made weaker (consider more things as "escaped") to be correct in a provenance-free world.

I would love to see the experiment of building such a compiler actually being made! This discussion is frustratingly void of good data. We have plenty of theorems and counterexamples showing what is correct and what is not correct, but we lack good indications for what the performance trade-offs are in this landscape of possible Abstract Machines. The only datum we can really go off of is that pretty much all compilers have an escape analysis that is correct only in the presence of explicit provenance in the Abstract Machine, so clearly these compiler authors thought they need that stronger analysis.

Of course, provenance in Rust goes way further than that -- all the Stacked Borrows optimizations would of course also be lost in a totally provenance-free world.

@chorman0773

I'd argue that escape-analysis is a form of provenance.

This mixes up operational semantics and static analysis.
The operational semantics is the "source of truth", it defines what the program might do when being executed. A static analysis is an approximation that the compiler computes of these possible program behaviors; it is justified against the operational semantics. A static analysis can, in principle, be proven correct; an operational semantics cannot be proven correct since it defines what "correct" even means when talking about program executions.

So, if the operational semantics (the Abstract Machine) does not have provenance, a static analysis cannot magically "add" it.
That said, of course the notion of a "data/control dependency" can be defined in any operational semantics, and can be useful for analysis, so maybe that's what you mean here. However, it is important to separate "explicit" provenance that exists in the Abstract Machine of the language "by fiat", from "implicit"/"analysis-only" notions that are defined by static analysis and can be proven correct against the operational semantics of the Abstract Machine.

To give a concrete example: if the abstract machine does not have provenance, then no amount of escape analysis can recover the Stacked Borrows optimizations.

@bjorn3
Copy link
Member

bjorn3 commented Jun 20, 2021

I would love to see the experiment of building such a compiler actually being made!

cg_clif is currently such a compiler. Cranelift itself doesn't have any provenance. It doesn't have much optimizations either, so it is much slower than what a compiler without provenance could theoretically be though.

@RalfJung
Copy link
Member Author

Cranelift will be [even more] interesting to watch for its future development then. :)
(Also I have a nagging worry here that people used to work in compilers with provenance -- so, presumably the vast majority of compiler engineers -- will introduce provenance-exploiting optimizations to cranelift without even realizing...)

@bjorn3
Copy link
Member

bjorn3 commented Jun 20, 2021

Cranelift is mostly meant for Webassembly compilation, where there is only a single object (the linear memory), so provenance won't help anything. I have been wanting to add some memory optimizations as passes before passing the clif ir to Cranelift, but at the moment I don't know enough how to implement such optimizations yet.

@RalfJung
Copy link
Member Author

Right, that's fair -- wasm doesn't have provenance. :)

@digama0
Copy link

digama0 commented Jun 20, 2021

In C/C++/Rust without provenance, we would have to account for the fact that any kind of pointer arithmetic can compute pointers to any allocated object. So, basically, once you did ptr.offset(foo), now ptr potentially aliases anything, even local variables of the current function.

Stacked borrows (without pointer provenance) already prevents this. A local that is not made accessible to pointers (i.e. &x is okay but not x as *const _) cannot be interfered with by pointers. That is, while you can synthesize a pointer to x by guessing, you can't dereference it to read or write x without causing UB, because x's memory says that pointer reads are not permitted.

This is certainly less than full pointer provenance will give you but it's more than you seem to give credit for.

@RalfJung
Copy link
Member Author

RalfJung commented Jun 20, 2021

Stacked borrows (without pointer provenance)

This sequence of words does not make any sense. Stacked Borrows is all about provenance.
I have no idea what Stacked Borrows without provenance could look like (it could look like many different things, no idea which one you have in your mind), but it would be nothing at all like Stacked Borrows.

Note that the Stacked Borrows "predecessor", Types as Contracts, had no provenance. It did not go very well. Tons of existing unsafe code in the standard library was rejected.

@digama0
Copy link

digama0 commented Jun 20, 2021

I already explained this in my first post on this thread. I have no desire to remove provenance for references, only raw pointers. I'm not sure why the discussion always seems to be drawn to the extreme position. (Even when I have talked about removing all provenance, it was to replace it with permissions, which are mostly equivalent to reference provenance.) Well, maybe you are responding to other variations on the proposal but I hope it's not being straw-manned.

Regarding "types as contracts", could you elaborate here on how it differs from Stacked Borrows and what kinds of problems arose? To me they look roughly equivalent.

You say that tons of existing unsafe code in the standard library was rejected by Types as Contracts, but I don't see how making rust provenance-free on its own (following any of the proposals in flight) would cause this. Eliminating provenance only makes more things defined, which can invalidate optimizations and maybe the safety arguments around certain unsafe functions but should not cause correct unsafe code in Rust-with-provenance to become UB.

@bjorn3
Copy link
Member

bjorn3 commented Jun 20, 2021

See https://www.ralfj.de/blog/2017/08/11/types-as-contracts-evaluation.html and https://www.ralfj.de/blog/2018/08/07/stacked-borrows.html To quote from the later:

An “access”-based model is one where certain properties – in this case, mutable references being unique and shared references pointing to read-only memory – are only enforced when the reference is actually used to access memory. In contrast, a “validity”-based model requires these properties to always hold for all references that could be used. In both cases, violating a property that the model requires to hold is undefined behavior..

Essentially, with a validity-based model like “Types as Contracts”, the basic idea is that all data is always valid according to the type it is given. Enforcing the restrictions of such a model (e.g., when checking whether a program has undefined behavior) amounts to eagerly checking all reachable data for validity. An access-based model, on the other hand, only requires data to be valid when used. Enforcing it amounts to lazily checking the bare minimum at each operation.

Validity-based models have several advantages: Eager checking means we can typically identify which code is actually responsible for producing the “bad” value. “All data must always be valid” is also easier to explain than a long list of operations and the kind of restrictions they place upon the data.

However, in Rust, we cannot talk about references and whether the are valid at their given type without talking about lifetimes. With “Types as Contracts”, the exact place where a lifetime ended turned out to be really important. Not only did this make the specification complex and hard to understand; the implementation in Miri also had to actively work against the compiler’s general intention to forget about lifetimes as early as possible. With non-lexical lifetimes, the “end” of a lifetime is not even so clearly defined any more.

The last paragraph was I think the biggest problem. Lifetimes are normally completely erased before codegen, but with "Types as Contracts", lifetimes are very important. Stacked borrows however doesn't need lifetimes at all, but instead is pretty much a runtime counterpart to lifetimes.

@Diggsey
Copy link

Diggsey commented Jun 20, 2021

You can sometimes argue that guessing is impossible, but that is non-trivial. This is the argument @Diggsey also makes.

Yeah that paper is exactly what I was suggesting 😄 - however, I would argue that the "twin allocation" strategy is not necessary. I think it would be sufficient to say that this kind of reasoning (based on exhaustion of the entire address space) is simply not valid, and the compiler will optimize programs assuming that you are not doing this reasoning.

Whilst this is a strange rule to have, it's also a rule that you can't really violate by accident. You have to really go out of your way to try and exploit this, and even then, in most systems you don't have access to the entire address space in the first place.

Put differently, the escape analysis made by typical compilers has to be made weaker (consider more things as "escaped") to be correct in a provenance-free world.

Yep, but there's also fewer ways for unsafe code to cause UB 😛.

all the Stacked Borrows optimizations would of course also be lost in a totally provenance-free world.

So, reading this line from the SB paper:

Stacked Borrows also makes no attempt to distinguish different raw pointers pointing to the same thing: raw pointers are untagged.

Doesn't that mean that the SB operational semantics only requires provenance for references?

@scottmcm
Copy link
Member

I have no desire to remove provenance for references, only raw pointers.

Given how many low-level things are implemented with pointers, how would this affect rust code, @digama0? Would slice iterators lose provenance because it's done using pointers? Would mem::swap and [T]::swap lose provenance because they're implemented using pointers? Can Box<T> even ever have provenance in such a world, when the allocator returns a pointer? Making more intrinsics for all these things seems suboptimal...

@chorman0773
Copy link
Contributor

chorman0773 commented Jun 20, 2021 via email

@digama0
Copy link

digama0 commented Jun 21, 2021

Given how many low-level things are implemented with pointers, how would this affect rust code, @digama0? Would slice iterators lose provenance because it's done using pointers? Would mem::swap and [T]::swap lose provenance because they're implemented using pointers? Can Box<T> even ever have provenance in such a world, when the allocator returns a pointer? Making more intrinsics for all these things seems suboptimal...

These are good questions and I'm not sure I have all the answers. Regarding mem::swap, it would only matter if the two mutable references are themselves pointing to something with provenance, so for example exchanging the contents of two &mut &mut u8's. Since we're talking about sound code here, it's difficult to imagine some situation where the provenance would make a difference, which is to say, the two &mut u8's are necessarily pointing to disjoint locations. But if we imagine what has to happen inside swap itself, I guess we have something like this:

fn swap(x: &mut &mut u8, y: &mut &mut u8) {
  let x_ptr = x as *mut &mut u8;
  let y_ptr = y as *mut &mut u8;
  let temp: &mut u8 = *x_ptr;
  std::ptr::write(x_ptr, *y_ptr);
  std::ptr::write(y_ptr, temp);
}

Now, if pointers are provenance free, then that means that x_ptr and y_ptr are just numbers and so things like the order of operations here can't be freely adjusted. However, the provenance we are interested in preserving are the &mut u8 values behind x and y, and in particular temp is a valid &mut u8 which is written in a provenance preserving way to the memory of y_ptr. So the copy from *x to *y and vice versa should be provenance preserving, which is what we expect of mem::swap. I think a similar argument will show that [T]::swap also preserves provenance.

Can Box<T> even ever have provenance in such a world, when the allocator returns a pointer?

This is more interesting. I think Box<T> has to be magic (a.k.a compiler builtin) for it to have provenance, but I agree that it really should have provenance. Ralf has said in the past that it's basically the same as &mut T except for the way that dropping is handled, and I think this is a good model; it can't be built directly from &mut T because there is no lifetime, however. Then again, Box is currently already magic, so this on it's own doesn't seem too problematic.

As for how we can somehow vivify a provenance containing box from an allocator, this probably gets into the part of int-to-ptr casts that cause all the problems. What I would say is that the allocator is passing ownership of the data to the box, but the integer itself that is passed (i.e. the pointer) does not communicate this information. What I would like to happen here is that the allocator creates a pointer to the memory, in the process marking the memory as open for pointer access, and then the receiver of the pointer uses it to create a Box<T>, which asserts at that point that the memory is now under the unique jurisdiction of the box. Subsequent pointer access to that memory (if the allocator pointer was intercepted and copied elsewhere before being boxed) is not UB, but will invalidate the box, causing UB to occur when the box is later used or dropped.

@RalfJung
Copy link
Member Author

RalfJung commented Jun 21, 2021

@digama0

I already explained this in my first post on this thread.

Okay, sorry, I misunderstood. "pointer provenance" often means "provenance on all pointer(-like) values"; I did not realize that you meant specifically Rusts raw pointers. We are also discussing another model in the same thread that has no provenance at all (wasm-style), and were hypothesizing about the impact on compilers. We need names for all these models...

I already responded to that model ("Stacked-Borrows-only-ref"?) earlier, and still stand by that response: I think this solves none of the problems, since we still have some values with provenance and some values without provenance, so conversions between the two are "strangely behaved". I fail to see how having "strangely behaved" raw ptr <-> reference conversion is any better than having "strangely behaved" integer <-> raw ptr conversion.

@Diggsey

Yeah that paper is exactly what I was suggesting smile - however, I would argue that the "twin allocation" strategy is not necessary. I think it would be sufficient to say that this kind of reasoning (based on exhaustion of the entire address space) is simply not valid, and the compiler will optimize programs assuming that you are not doing this reasoning.

😂
Sorry, but this is really funny.^^ You cannot 'simply' say "this argument is not valid" unless you can find a flaw in my reasoning -- and I can prove correctness of this argument, so I doubt there is such a flaw. Mathematical arguments like this show necessary conclusions that follow from the given definitions; we are not at liberty to reject the conclusion if we want to stick to the definitions. The way to make the argument not valid is by changing the definition (of the Abstract Machine) such that the proof no longer works. That's what we did for that paper.
The only alternative I can think of is to impose additional restrictions to the compiler correctness statement itself, such as "we don't talk about program executions where the address space is more than half-way exhausted". Maybe that's what you meant. That might seem appealing at first, but it is equivalent to saying "exhausting more than half the address space is UB". Which is, ugh, not good, I think?

But anyway, properly justifying optimizations based on non-determinism is mostly orthogonal to this discussion I think. I am just pointing out that it is hard. The more interesting question is: we certainly lose some of the things that typical escape and alias analyses do; how bad does that loss hurt?

Doesn't that mean that the SB operational semantics only requires provenance for references?

SB as described in the paper and as implemented in Miri does not really have provenance for raw ptrs, indeed. This is sufficient for all the SB optimizations. (It also doesn't solve #286, it just shifts the problem to "reference-to-raw-ptr transmutes".) This lack of provenance on raw ptrs is incompatible with our LLVM codegen backend, however (Cc #86). The -Zmiri-track-raw-pointers flag enables proper provenance on raw pointers.

@digama0
Copy link

digama0 commented Jun 21, 2021

I already responded to that model ("Stacked-Borrows-only-ref"?) earlier, and still stand by that response: I think this solves none of the problems, since we still have some values with provenance and some values without provenance, so conversions between the two are "strangely behaved". I fail to see how having "strangely behaved" raw ptr <-> reference conversion is any better than having "strangely behaved" integer <-> raw ptr conversion.

I don't really disagree with this analysis, especially since I think a reasonable refinement of my model is to incorporate LLVM style pointer provenance for an off the beaten track type for experts who really want to write unsafe code with provenance based optimizations (as suggested earlier), and that model would be basically the same as the current rust model except the names of types are switched around a bit. But my takeaway from this is that the two approaches are on roughly the same footing re: theoretical analysis.

But if we ignore the existence of such a hypothetical LLVM style pointer type, then we only have to worry about SB references and how they can be forged from integers, and I don't think we need a twinsem-style solution for that, because SB references solve the guessing problem by marking the memory itself as uniquely owned by some tag and thus barring wild pointer access.

The only alternative I can think of is to impose additional restrictions to the compiler correctness statement itself, such as "we don't talk about program executions where the address space is more than half-way exhausted". Maybe that's what you meant. That might seem appealing at first, but it is equivalent to saying "exhausting more than half the address space is UB". Which is, ugh, not good, I think?

As long as you can reliably signal abnormal termination by resource exhaustion, I don't think it has to be UB. In any case, one has to deal with resource exhaustion at some point, because the abstract machine states that pointers have finite width so address space exhaustion at least has to exist. As for "half the address space", I think that's just a peculiarity of twinsem, no? You should be able to use almost all of the address space before resource exhaustion becomes inevitable.

@Diggsey
Copy link

Diggsey commented Jun 21, 2021

Sorry, but this is really funny.^^ You cannot 'simply' say "this argument is not valid"

I think you are misrepresenting what I'm saying. I never said your argument is invalid: it's perfectly valid, I'm saying that the case it protects against will never occur in a real/useful program, so it would be far better to define programs which hit this edge-case as being invalid rather than to either accept that 50% of our memory is unusable or to discard this approach entirely.

The only alternative I can think of is to impose additional restrictions to the compiler correctness statement itself, such as "we don't talk about program executions where the address space is more than half-way exhausted". Maybe that's what you meant. That might seem appealing at first, but it is equivalent to saying "exhausting more than half the address space is UB". Which is, ugh, not good, I think?

This kind of thing is what I meant, but I think there are better rules that could be used. There clearly exist some programs that use more than 50% of the available address space, and yet cannot "infer" the location of any specific allocation, so we know there's at least some room for improvement. It also seems like we should just be able to "forbid negative reasoning" on pointers.

SB as described in the paper and as implemented in Miri does not really have provenance for raw ptrs, indeed.

Great 😄

It also doesn't solve #286, it just shifts the problem to "reference-to-raw-ptr transmutes".

Well, that shift may be a solution. To me at least, making reference-to-raw-ptr transmutes UB (or maybe vice versa, we only need one direction to be UB right?) is far more paletteable than making integer-to-raw-ptr transmutes UB, and I would expect a lot of people to feel the same way.

This lack of provenance on raw ptrs is incompatible with our LLVM codegen backend

I think that for those pointers which we deem to not escape, we can continue to emit them as pointers-with-provenance to LLVM, no? And for those pointers which do escape, we can use LLVM integers to represent them.

@thomcc
Copy link
Member

thomcc commented Jun 21, 2021

To me at least, making reference-to-raw-ptr transmutes UB (or maybe vice versa, we only need one direction to be UB right?) is far more paletteable than making integer-to-raw-ptr transmutes UB, and I would expect a lot of people to feel the same way.

I don't have strong feelings about ref-to-raw-ptr, but raw-ptr-to-ref transmutes I do have strong feelings about because of const eval.

Because &*ptr has remained forbidden in const eval, but transmuting (say) *const T to &T is not, it is quite common in macros and the like, so I think it would be a big problem to make that direction UB.

@digama0
Copy link

digama0 commented Jun 21, 2021

Well, that shift may be a solution. To me at least, making reference-to-raw-ptr transmutes UB (or maybe vice versa, we only need one direction to be UB right?) is far more paletteable than making integer-to-raw-ptr transmutes UB, and I would expect a lot of people to feel the same way.

I don't think reference-to-raw-ptr casts can be made UB, they are done all the time. This is the easy direction though, this just drops the provenance. The hard direction is raw-ptr-to-ref, which is also done all the time, but I think it is sufficient here to make a new tag and modify memory to push the tag onto the borrow stacks of the affected locations.

Once you have casts, you can do transmutes as well by just doing the same thing when you perform a read at the wrong type.

@Diggsey
Copy link

Diggsey commented Jun 21, 2021

@digama0 my understanding is that casts were never the issue here, it's only a problem when you are writing to a memory location at one type and reading it at a different type. As long as there is an explicit conversion step then there is no problem.

Once you have casts, you can do transmutes as well by just doing the same thing when you perform a read at the wrong type.

In general, you can't know if the read is of the wrong type. At least not without special-casing transmute in some way in the compiler.

@digama0
Copy link

digama0 commented Jun 21, 2021

In general, you can't know if the read is of the wrong type. At least not without special-casing transmute in some way in the compiler.

This is an abstract machine operation. It knows whether the target memory location has stored provenance or not. This has nothing to do with the compiler.

Writing a memory location at one type and reading it at another type is equivalent to a cast - you have exactly the same information, except that rather than knowing both types at once, you have to use in-memory provenance to communicate the difference. In the case of transmuting from &mut T to/from usize, we don't care about any fancy properties of the types, so it suffices to store the SB tag in memory and drop it when you read the value back out in the ref-to-int case, and forge a new unique borrow to the memory location in the int-to-ref case. You would go through exactly the same steps for a cast between these types.

@RalfJung
Copy link
Member Author

RalfJung commented Jun 22, 2021

I said

SB as described in the paper and as implemented in Miri does not really have provenance for raw ptrs, indeed

I spoke a bit too fast: raw pointers still have basic C-like provenance, as in, they remember the allocated object they belong to. But with default flags, raw pointers do not have extra Stacked Borrows provenance.

So, in this implementation, ptr <-> int transmutes are still very much a problem. (ptr <-> ref transmutes are fine.)

@digama0

But if we ignore the existence of such a hypothetical LLVM style pointer type, then we only have to worry about SB references and how they can be forged from integers, and I don't think we need a twinsem-style solution for that, because SB references solve the guessing problem by marking the memory itself as uniquely owned by some tag and thus barring wild pointer access.

We still need it for analyzing programs that do turn a reference to a ptr and that to an integer but keep that all local to the current function so we can still have total knowledge about who can know about this address. Though I suppose one could say, that's such a niche case, it's okay to lose those optimizations.

The more interesting problem is that the reason this works is that in Miri, casting a reference to a raw ptr has a side-effect somewhat similar to the "exposed" flag of the PNVI-ae proposal for C. This doesn't really square with LLVM treating those casts as pure and effect-free.

As for "half the address space", I think that's just a peculiarity of twinsem, no?

It's what you need to make the pointer guessing argument without twinsem -- because then you can refine into twinsem and do the argument there. I don't know any other way to make this argument correctly (as in, mathematically sound, without hand-waving).

@Diggsey

I think you are misrepresenting what I'm saying. I never said your argument is invalid: it's perfectly valid, I'm saying that the case it protects against will never occur in a real/useful program, so it would be far better to define programs which hit this edge-case as being invalid rather than to either accept that 50% of our memory is unusable or to discard this approach entirely.

I can only interpret this as suggesting that programs that use up "too much" of the address space are UB. (Not sure what else you could mean by "invalid".) That would be a disaster since it breaks Rusts soundness guarantee.

This kind of thing is what I meant, but I think there are better rules that could be used. There clearly exist some programs that use more than 50% of the available address space, and yet cannot "infer" the location of any specific allocation, so we know there's at least some room for improvement.

We know no such thing, since the compiler, when doing the optimization, only sees parts of the whole program -- it has no way of knowing if the rest of the program is trying to conspire to infer addresses or not. The only correct thing here is to assume the worst.

It also seems like we should just be able to "forbid negative reasoning" on pointers.

Well, then just do it. :) That's where I started, and then I came up with twinsem. So far, to my knowledge, no other correct formal argument has been made for justifying such optimizations.

So maybe just avoid saying "just" when this is demonstrably a hard problem. ;)

To me at least, making reference-to-raw-ptr transmutes UB (or maybe vice versa, we only need one direction to be UB right?) is far more paletteable than making integer-to-raw-ptr transmutes UB, and I would expect a lot of people to feel the same way.

Interesting. I personally feel quite exactly the opposite.

@digama0

Writing a memory location at one type and reading it at another type is equivalent to a cast

No, it is not. Or rather, of course you can make it equivalent, but that would be a bad idea because then this counterexample kicks in and you lose some redundant store elimination optimizations. So the most likely candidates for LLVM/MIR semantics will not make transmutes (via type-punning loads) and casts equivalent.

@Diggsey
Copy link

Diggsey commented Jun 22, 2021

So maybe just avoid saying "just" when this is demonstrably a hard problem. ;)

I didn't mean to imply it was simple, only that it was replacing something else, not in addition to the other thing. I'll use "instead" in future...

I spoke a bit too fast: raw pointers still have basic C-like provenance, as in, they remember the allocated object they belong to. But with default flags, raw pointers do not have extra Stacked Borrows provenance.

Could you expand on this more / point me to where I can read about this. Does SB build on top of this C-like provenance, or are they two systems that must be evaluated in parallel, but otherwise have no interaction?

How do the optimizations fit into this? I'm confused because in the paper you prove various optimizations given the SB rules, seemingly without reference to C-like provenance.

@RalfJung
Copy link
Member Author

RalfJung commented Jun 22, 2021

Could you expand on this more / point me to where I can read about this. Does SB build on top of this C-like provenance, or are they two systems that must be evaluated in parallel, but otherwise have no interaction?

The paper considers a memory model without integer-pointer casts and a CompCert-style memory model: a "pointer" consists of a block ID ("allocation ID", identifying the allocated object) and an offset within that block -- and then we add the Stacked Borrows tag to that. Pointer arithmetic only affects the offset. There simply is no way to perform ptr arithmetic across object boundaries, that kind of provenance is strictly and implicitly enforced everywhere.

In Miri, a Pointer likewise consists of

  • an AllocId for the allocated object
  • an offset (Size) within the allocated object
  • a Tag for Stacked Borrows (always Tag::Untagged for raw pointers, under default settings)

So, yeah, essentially the last option: the two have (basically) no interaction.
(I am working on a patch for Miri that makes the offset be the absolute physical address instead of the offset within the allocated object, but that will not have much impact on this discussion I think.)

@digama0
Copy link

digama0 commented Jun 22, 2021

But if we ignore the existence of such a hypothetical LLVM style pointer type, then we only have to worry about SB references and how they can be forged from integers, and I don't think we need a twinsem-style solution for that, because SB references solve the guessing problem by marking the memory itself as uniquely owned by some tag and thus barring wild pointer access.

We still need it for analyzing programs that do turn a reference to a ptr and that to an integer but keep that all local to the current function so we can still have total knowledge about who can know about this address. Though I suppose one could say, that's such a niche case, it's okay to lose those optimizations.

I suppose you mean:

let x: &mut u8 = &mut 0;
let x_ptr = x as *mut u8; // "escapes" x
unknown_func();
println!("{}", *x_ptr); // can't optimize to 0

Indeed we lose this optimization, and I wish I knew how performance-consequential it is.

The more interesting problem is that the reason this works is that in Miri, casting a reference to a raw ptr has a side-effect somewhat similar to the "exposed" flag of the PNVI-ae proposal for C. This doesn't really square with LLVM treating those casts as pure and effect-free.

Isn't there a way to generate code that explicitly marks the value as exposed in LLVM? I guess that casting to an integer and passing to a compiler black box would be sufficient, but maybe there is a lighter weight alternative.

As for "half the address space", I think that's just a peculiarity of twinsem, no?

It's what you need to make the pointer guessing argument without twinsem -- because then you can refine into twinsem and do the argument there. I don't know any other way to make this argument correctly (as in, mathematically sound, without hand-waving).

I agree that you need twinsem to defend against pointer guessing. What I meant is that the model I'm gunning for doesn't rely on pointer guessing protection in any way - the ability to magically know the addresses of local variables does not change what is UB or what optimizations apply. That has consequences for some examples like the one I wrote above, where we would like to be able to use a pointer guessing argument to say that x is not escaping but the model doesn't allow it, but we can still do some optimizations of this form using a different mechanism:

let x: &mut u8 = &mut 0;
let x_ptr = x as *mut u8; // "escapes" x
let y = &mut *x_ptr; // locks the memory of x
unknown_func();
println!("{}", *y); // can optimize to 0

Here, we know that y locks the memory of x, which has to have the value 0 at the time it was created because any concurrent writes to x would be unsequenced; then while unknown_func() runs, even if it guesses x_ptr (or even if it is passed x_ptr and uses it), any write to that memory location will invalidate y, making the *y access UB, so therefore *y will read the value that was originally in it, which is 0, so the optimization is valid.

I can only interpret this as suggesting that programs that use up "too much" of the address space are UB. (Not sure what else you could mean by "invalid".) That would be a disaster since it breaks Rusts soundness guarantee.

What do you want the spec to say about programs that use too much of the address space, whether that is 50% or 100%? Obviously something has to break down but I'm not sure what solution you prefer. I guess we can say that the allocator is required to give an unallocated block if it succeeds, so at some point it is forced to fail, and if it fails then it follows a normal failure mode like panicking.

Writing a memory location at one type and reading it at another type is equivalent to a cast

No, it is not. Or rather, of course you can make it equivalent, but that would be a bad idea because then this counterexample kicks in and you lose some redundant store elimination optimizations. So the most likely candidates for LLVM/MIR semantics will not make transmutes (via type-punning loads) and casts equivalent.

It's true, reading a variable at a given type and writing it back might not be a no-op. As suggested later in that thread, you can recover most of the content of the optimization by replacing the read/write with a clear_provenance(&x) operation that affects only the abstract memory and is a no-op at the lower level. Potentially MIR or LLVM could have an additional type, not exposed to user code but usable by IR transformations, to represent "arbitrary memory" and satisfies that redundant stores are eliminable; as I understand it this is the idea behind the proposed bN type in LLVM.

@bjorn3
Copy link
Member

bjorn3 commented Jun 22, 2021

Here, we know that y locks the memory of x, which has to have the value 0 at the time it was created because any concurrent writes to x would be unsequenced; then while unknown_func() runs, even if it guesses x_ptr (or even if it is passed x_ptr and uses it), any write to that memory location will invalidate y, making the *y access UB, so therefore *y will read the value that was originally in it, which is 0, so the optimization is valid.

That will not allow eliminating the 0 allocation entirely if it is completely unused accordimg to a memory model with provenance, as unknown_func may still write to it.

@RalfJung
Copy link
Member Author

Isn't there a way to generate code that explicitly marks the value as exposed in LLVM? I guess that casting to an integer and passing to a compiler black box would be sufficient, but maybe there is a lighter weight alternative.

Not currently, as far as I know. But it presumably could be added as an intrinsic -- that then many optimizations would have to be taught about.

But maybe LLVM needs such an intrinsic anyway to robustly model integer-ptr casts interacting with 'noalias'.

That has consequences for some examples like the one I wrote above, where we would like to be able to use a pointer guessing argument to say that x is not escaping but the model doesn't allow it, but we can still do some optimizations of this form using a different mechanism:

So you are basically saying, Stacked-Borrows-like optimizations don't care about pointer guessing. That is true, our theorems we show in the paper establish this.

I personally think we should still have provenance for raw pointers though. That is orthogonal to your point, in this model we can still do all sorts of provenance-based optimizations without worrying about pointer guessing -- but unlike in your model we can do them even when casting to and from with raw pointers. Personally I think that's strictly better.

The problem with both of these models is that we have casts (along the reference -> raw pointer -> integer chain) that have side-effects and thus cannot be removed even if their result is unused. (Though to be fair, I simply don't know how to define Stacked Borrows without such side-effecting casts. And they solve so many problems...)

What do you want the spec to say about programs that use too much of the address space, whether that is 50% or 100%? Obviously something has to break down but I'm not sure what solution you prefer. I guess we can say that the allocator is required to give an unallocated block if it succeeds, so at some point it is forced to fail, and if it fails then it follows a normal failure mode like panicking.

Yes: every allocation can non-deterministically succeed or fail; if the address space is full, it always fails.

That will not allow eliminating the 0 allocation entirely if it is completely unused accordimg to a memory model with provenance, as unknown_func may still write to it.

True -- specifically, if unknown_func writes and then enters and endless loop, there is no UB.

@RalfJung
Copy link
Member Author

I came up with another "fun" example around provenance, Stacked Borrows, and pointer-to-integer casts: #291. This confirms my long-lasting suspicion that we cannot do "dead cast elimination" in models like Stacked Borrows -- but this also applies to C restrict.

Note that @digama0's model with unracked raw pointers has the same problem around reference-to-pointer casts.

@RalfJung
Copy link
Member Author

A year later, we are in the middle of the "Strict Provenance" experiment. I think at this point it seems fairly clear that Rust will have provenance on references and raw pointers, but not on integers, so I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-provenance Topic: Related to when which values have which provenance (but not which alias restrictions follow)
Projects
None yet
Development

No branches or pull requests

7 participants