Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spectre mitigations: add a mode that monitors branch mispredictions and dynamically turns on fences #8175

Open
cfallin opened this issue Mar 18, 2024 · 6 comments

Comments

@cfallin
Copy link
Member

cfallin commented Mar 18, 2024

In discussion today with @fitzgen, @jameysharp, @elliottt and @lpereira, we were considering the idea to dynamically monitor branch mispredictions and isolate execution of any Wasm instance that had used up a "misspeculation quota". I realized that actually what we could do is (effectively) turn off speculation -- you run out, you can't use it anymore! -- by dynamically inserting lfences.

Specifically: the (one?) neat thing about fully coherent icaches on x86 is that we can switch out the code that's running, on the fly, even if other threads are in the middle of functions we're switching out, as long as we're very careful to do it atomically (state between any two stores is valid code).

Consider the case where we want an lfence before every indirect branch (say; or before every branch; orthogonal detail) and we have:

    ...
    mov rax, ... # compute branch target (e.g. from br_table)
    nop # space for `lfence` (3 bytes)
    nop
    nop
    jmp rax

we can replace the three bytes of nop (0x90, 0x90, 0x90) with lfence (0x0f, 0xae, 0xe8) if we want to "turn off speculation" for this module for a little bit.

There are at least three ways to do that on an x86 machine (with coherent icaches):

  • Do an atomic store to code memory. For this we'd need W+X mappings temporarily, and an extra nop to make this a 32-bit region we could overwrite with one 32-bit store.
  • Above, but switch from R+X to R+W; take the SIGBUS from any running thread, temporarily hold, and release when we switch the mapping back (via a futex?).
  • The one I like best: keep another version of the code segment around, and mmap it over the first.

The last one is pretty neat: mmap is atomic with respect to every other thread (appears as a single store in the total store order; it must, because if other thread had it mapped, it would receive an IPI, which is a synchronizing edge). So we basically "yank out the code ROM and replace it" in between instructions, and the new code doesn't speculate.

Using this, we can build a control loop in a separate thread that monitors mispredict counters, and can flip the switch at will for any module that has excessive counts. It doesn't have to be a one-way trapdoor: a module could have a "mispredict quota" per time unit, and could reset to the fast code (no lfences) after a set period. There is no impact on other modules -- it only impacts the module with the mispredicts.

Finally, I suspect this will be a bit harder on non-coherent-icache architectures (aarch64, riscv64), but actually maybe the "mmap a new thing on top of running code" is enough of a jolt to yoink all other cores into coherent happiness again. Note that I haven't tested that!

@cfallin
Copy link
Member Author

cfallin commented Mar 18, 2024

One slight tweak: those three nops need to be one 3-byte nop for the atomic "store" of the new code with re-mmap to work safely; otherwise RIP might be right in the middle of where the lfence is about to spontaneously appear.

@sunfishcode
Copy link
Member

Is a single mmap that spans multiple pages guaranteed to be entirely atomic?

@cfallin
Copy link
Member Author

cfallin commented Mar 18, 2024

I think so? At the very least, in the Linux implementation, the memory-map changes are made under one lock, and one IPI is performed to other cores if needed; it'd be neat to find something in the POSIX spec either way to cite though.

@bjorn3
Copy link
Contributor

bjorn3 commented Mar 18, 2024

Even with low amounts of mispredicted branches it would be possible to (slowly) leak data, right?

@cfallin
Copy link
Member Author

cfallin commented Mar 19, 2024

The idea is that one would set the quota according to the desired probability (leak bit-rate bound). I haven't thought too much about the control algorithm here but perhaps one puts a module in "non-speculative mode" for the remaining duration of any individual instance alive at the time of the heightened branch mispredict rate (one could implement this with epochs, labeling instances at startup and keeping a count of active instances in epoch N-1 and N). Or something like that.

I should also note that this can be layered with existing mitigations: so e.g. any explicit bounds checks are protected already (cannot read others' heaps even in misspeculation) and this technique is mainly to address the "indirect branches can jump anywhere and find a read gadget" problem, which itself should have a lower effective bit-rate...

@cfallin
Copy link
Member Author

cfallin commented Mar 19, 2024

I just experimented a bit with this idea by writing a little program that mmaps two assembly routines over the top of each other -- identical except for LFENCE's vs. 3-byte NOP's -- while running, and observing the effective timing difference. (The second thread can actually mmap back and forth with different duty cycles and one can observe that smoothly changing the runtime by altering how much speculation occurs -- a very weird sort of PWM.) Here is the gist. Note that this doesn't verify the page-crossing behavior (the little snippet lives on one page), it just shows that the remap-it-live action does work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants