Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221

cfallin · 2024-03-22T17:31:53Z

As discussed at in #7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that is legal might succeed.

This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above.

Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V.

This PR implements the idea first proposed here, namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect.

This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified.

On an Apple M2 Pro machine (aarch64), this was measured to have a 2% performance impact when running spidermonkey.wasm with a simple recursive Fibonacci program. It can be used via the -Ccranelift-ensure_precise_store_traps=true flag to Wasmtime.

cfallin · 2024-03-22T17:35:52Z

@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you.

…ng hardware. As discussed at in bytecodealliance#7237 and WebAssembly/design#1490, some instruction-set architectures do not guarantee that a store that "partially traps" (overlaps multiple pages, only one of which disallows the store) does not also have some side-effects. In particular, the part of the store that *is* legal might succeed. This has fascinating implications for virtual memory-based WebAssembly heap implementations: when a store is partially out-of-bounds, it should trap (there is no "partially" here: if the last byte is out of bounds, the store is out of bounds). A trapping store should not alter the final memory state, which is observable by the outside world after the trap. Yet, the simple lowering of a Wasm store to a machine store instruction could violate this expectation, in the presence of "store tearing" as described above. Neither ARMv8 (aarch64) nor RISC-V guarantee lack of store-tearing, and we have observed it in tests on RISC-V. This PR implements the idea first proposed [here], namely to prepend a load of the same size to every store. The idea is that if the store will trap, the load will as well; and precise exception handling, a well-established principle in all modern ISAs, guarantees that instructions beyond a trapping instruction have no effect. This is off by default, and is mainly meant as an option to study the impact of this idea and to allow for precise trap execution semantics on affected machines unless/until the spec is clarified. On an Apple M2 Max machine (aarch64), this was measured to have a 2% performance impact when running `spidermonkey.wasm` with a simple recursive Fibonacci program. It can be used via the `-Ccranelift-ensure_precise_store_traps=true` flag to Wasmtime. [here]: WebAssembly/design#1490 (comment)

bjorn3 · 2024-03-22T17:43:41Z

cranelift/codegen/src/precise_store_traps.rs

+                    offset,
+                } => {
+                    let ty = covering_type_for_value(&pos.func, data);
+                    let _ = pos.ins().stack_load(ty, stack_slot, offset);


Stack stores are guaranteed to succeed when stack probing (or some other kind of stack size check) is used.

github-actions · 2024-03-22T17:44:33Z

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "cranelift", "cranelift:meta", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

bjorn3 · 2024-03-22T17:46:09Z

this was measured to have a 2% performance impact

How does this only have a 2% perf impact? Is some optimization pass removing part of the loads or is it the M2 chip being able to exploit a lot of instruction level parallelism?

cfallin · 2024-03-22T17:48:24Z

Probably the latter -- the loads are not removed, and make it all the way to machine code.

cfallin · 2024-03-22T17:49:23Z

(More specifically, a load inserted by this pass could later be removed by alias analysis, but only if subsumed by another earlier load to the same address, so there is still always some load before the store.)

cfallin · 2024-03-22T17:52:22Z

One other performance-related note: this will possibly have worse impact when the pooling allocator with copy-on-write is enabled in Wasmtime, as the "first touch" to a page matters: if demanded in by a load first, a read-only mapping is created, which might immediately fault again on the store (doing the actual CoW). We've seen such "mapping change" operations become bottlenecks in multithreaded processes due to IPIs. I don't expect we would turn this option on in any production setting though; rather we'd ensure (as is already the case on x86) any production hardware does not have the torn-write behavior. But that's something to consider if we ever discuss having this on by default in all cases.

afonso360 · 2024-03-22T19:13:11Z

I can confirm that this fixes the torn writes at least on QEMU. 🎉

Interestingly QEMU RISC-V only shows this torn write test when using the vector extensions to perform the write (v128), but not using the normal i64.store.

I'm going to test this on real hardware over the weekend, and also see if I can run some benchmarks.

For completeness here's the WAST test that I used.

WAST Test

(module
  (memory 1)
  (func (export "i128.store") (param i32 i64)
    local.get 0
    v128.const i32x4 0xffffffff 0xffffffff 0xffffffff 0xffffffff
    v128.store align=1)
  (func (export "i32.load8_u") (param i32) (result i32)
    local.get 0
    i32.load8_u))

(assert_trap (invoke "i128.store"
                     (i32.const 65529)
                     (i64.const 0xffffffffffffffff))
             "out of bounds memory access")

;; Partial bytes were not written.
(assert_return (invoke "i32.load8_u" (i32.const 65529))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65530))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65531))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65532))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65533))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65534))
               (i32.const 0))
(assert_return (invoke "i32.load8_u" (i32.const 65535))
               (i32.const 0))

alexcrichton · 2024-03-22T20:14:11Z

cranelift/codegen/src/precise_store_traps.rs

+                    let _ = pos.ins().load(ty, flags, addr, offset);
+                }
+                &InstructionData::StoreNoOffset {
+                    opcode: Opcode::AtomicStore,


I think atomics can be exempted here because they are already required to be aligned? At least at the wasm layer that's validated, I'm not sure about the Cranelift layer. This isn't needed for the wasm-level problem though.

alexcrichton · 2024-03-22T20:14:54Z

cranelift/codegen/src/precise_store_traps.rs

+                        Opcode::Istore32 => I32,
+                        _ => unreachable!(),
+                    };
+                    let _ = pos.ins().load(ty, flags, addr, offset);


Can this be skipped if flags.trap_code().is_none() or if flags.aligned()? Only possibly-trapping non-aligned loads (which is all wasm loads, but not anything to a vmctx for example)

tschneidereit · 2024-03-22T21:21:42Z

I fully assume that I'm missing something and the answer is no, but just in case: would it not also work to first attempt to do the part of the store operation touching the highest page, and then do the rest of the write? ISTM that if any writes would fail that'd be guaranteed to include the highest page, and if that one succeeds, a write to all lower pages for the rest of the store operation is certain to succeed as well?

cfallin · 2024-03-22T21:39:31Z

The problem I see with that approach is that we only know that dynamically: consider if we have a 64-bit store (allowed to be unaligned) to address X, we would have to decompose it into bytes and perform them in reverse order (store X+7, then X+6, then ...) always, because we know nothing about X's alignment. A static analysis could sometimes know X's alignment, but usually not, e.g. if it is a value loaded from memory itself. So that is a store-amplification factor of 8 for 64-bit ops, 4 for 32-bit ops, etc, likely to have quite a big perf impact, I think.

tschneidereit · 2024-03-22T22:18:02Z

Hmm, I see, I think. Could we then instead write a single byte (of whatever value) at the highest address, and do the actual store if that succeeded?

cfallin · 2024-03-22T22:25:14Z

I think that would work, yeah. There are some other interesting performance tradeoffs here -- two that come to mind are "partial-store forwarding" (partially overlapping loads/stores can be problematic for uarch memory perf) and the fact that stores in general are more expensive than loads (store buffers typically fewer entries than load buffers so can stall the OoO window, stores take extra effort at retire) that, balanced against all the other considerations here (we want something we could turn on by default on tier-2 architecture(s) if needed, and aren't as concerned with interactions with IPIs in pooling allocator where we're more likely to be very careful to select the right configuration and avoid this mechanism), I think I'd still prefer the load+store over narrow-store+store.

tschneidereit · 2024-03-22T23:03:58Z

That all makes sense to me. I was mainly concerned about the impact on the pooling allocator, and the missing piece was the ability to select a configuration that'd allow us to avoid any of this.

cfallin · 2024-03-22T23:07:33Z

Ah yes, indeed. This is only a problem on aarch64 and riscv64, and AFAICT all aarch64 hardware one would want to run a high-performance server on (Apple implementations, AWS machines, Ampere CPUs, even RPi4) do not exhibit store-tearing, so on aarch64 it's mostly theoretical and/or on very small (low-power / embedded) cores I think.

afonso360 · 2024-03-23T19:15:34Z

I can confirm that this fix also works in real hardware. I ran a few benchmarks at random, and it does have a bit of impact on smaller cores (as expected). Most benchmarks are 5-10% slower, but there is one outlier at 40%

The results below are on a Dual-Issue In Order Core (Sifive u74).

Results

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1459784.00 ± 60785.14 (confidence = 99%)

  noload.so is 1.07x to 1.07x faster than load.so!

  [22663159 22821360.40 22862178] load.so
  [21326450 21361576.40 21405304] noload.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 2560349.00 ± 1538031.16 (confidence = 99%)

  noload.so is 1.01x to 1.04x faster than load.so!

  [110537308 111942491.90 113530980] load.so
  [108818829 109382142.90 111609052] noload.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [3465 3954.40 6169] load.so
  [3444 4122.00 6204] noload.so

  execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 92094.60 ± 29972.88 (confidence = 99%)

  noload.so is 1.06x to 1.12x faster than load.so!

  [1100973 1120327.60 1169522] load.so
  [1019878 1028233.00 1083605] noload.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2281318 2391173.30 2505976] load.so
  [2212191 2312207.60 2376594] noload.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [1028 1098.00 1448] load.so
  [1050 1103.10 1170] noload.so

  execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 66417.80 ± 543.64 (confidence = 99%)

  noload.so is 1.39x to 1.40x faster than load.so!

  [233706 234547.90 235270] load.so
  [167648 168130.10 168742] noload.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4541616 4965704.40 5992445] load.so
  [4423989 4751356.50 5875859] noload.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [1714 2113.50 2415] load.so
  [1798 2135.50 2299] noload.so

  
compilation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  Δ = 44240.30 ± 40968.63 (confidence = 99%)

  noload.so is 1.00x to 1.03x faster than load.so!

  [2848804 2880309.90 2923861] load.so
  [2803879 2836069.60 2912357] noload.so

instantiation :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [1186 1391.10 1495] load.so
  [1185 1445.70 1604] noload.so

execution :: cycles :: benchmarks/blake3-scalar/benchmark.wasm

  No difference in performance.

  [9675 9782.10 10005] load.so
  [9572 9637.80 9814] noload.so

  instantiation :: cycles :: benchmarks/regex/benchmark.wasm

  Δ = 340.10 ± 243.11 (confidence = 99%)

  noload.so is 1.03x to 1.19x faster than load.so!

  [2939 3406.10 3681] load.so
  [2992 3066.00 3281] noload.so

execution :: cycles :: benchmarks/regex/benchmark.wasm

  Δ = 245010.10 ± 1279.07 (confidence = 99%)

  noload.so is 1.06x to 1.06x faster than load.so!

  [4076050 4077873.90 4079406] load.so
  [3831298 3832863.80 3834736] noload.so

compilation :: cycles :: benchmarks/regex/benchmark.wasm

  Δ = 369482.20 ± 93448.03 (confidence = 99%)

  noload.so is 1.03x to 1.04x faster than load.so!

  [10706461 10872794.20 10932079] load.so
  [10360994 10503312.00 10590922] noload.so

candymate · 2024-04-04T00:50:05Z

@afonso360, if you still have access to your RISC-V hardware and a bit of time to test, it would be useful to confirm that this changes the result of the "torn stores" test case from earlier! @candymate, it would be interesting to confirm that this resolves your fuzzbug as well, and this could be useful for further fuzzing for you.

I'm sorry that I saw this just now - I'll check this in a day or two from now and let you know.

Tiny update: I'm sorry to say this, but I'm not available till 5/1. I'll see this again at that time.

I'm not sure about this, but I think I observed that this problem persists with QEMU 8.1.5 - need recheck...

cfallin requested review from a team as code owners March 22, 2024 17:31

cfallin requested review from elliottt and fitzgen and removed request for a team March 22, 2024 17:31

cfallin force-pushed the a-load-a-day-keeps-the-torn-stores-away branch from 7d6193e to 18e2a57 Compare March 22, 2024 17:40

cfallin mentioned this pull request Mar 22, 2024

Partially out-of-bounds writes on ARM and riscv #7237

Open

bjorn3 reviewed Mar 22, 2024

View reviewed changes

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:meta Everything related to the meta-language. wasmtime:api Related to the API of the `wasmtime` crate itself labels Mar 22, 2024

alexcrichton reviewed Mar 22, 2024

View reviewed changes

elliottt removed their request for review March 27, 2024 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221

Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221

cfallin commented Mar 22, 2024 •

edited

cfallin commented Mar 22, 2024

bjorn3 Mar 22, 2024

github-actions bot commented Mar 22, 2024

bjorn3 commented Mar 22, 2024

cfallin commented Mar 22, 2024

cfallin commented Mar 22, 2024

cfallin commented Mar 22, 2024

afonso360 commented Mar 22, 2024

alexcrichton Mar 22, 2024

alexcrichton Mar 22, 2024

tschneidereit commented Mar 22, 2024

cfallin commented Mar 22, 2024

tschneidereit commented Mar 22, 2024

cfallin commented Mar 22, 2024

tschneidereit commented Mar 22, 2024

cfallin commented Mar 22, 2024

afonso360 commented Mar 23, 2024

candymate commented Apr 4, 2024 •

edited

Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221

Are you sure you want to change the base?

Cranelift: implement "precise store traps" in presence of store-tearing hardware. #8221

Conversation

cfallin commented Mar 22, 2024 • edited

cfallin commented Mar 22, 2024

bjorn3 Mar 22, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 22, 2024

Subscribe to Label Action

bjorn3 commented Mar 22, 2024

cfallin commented Mar 22, 2024

cfallin commented Mar 22, 2024

cfallin commented Mar 22, 2024

afonso360 commented Mar 22, 2024

alexcrichton Mar 22, 2024

Choose a reason for hiding this comment

alexcrichton Mar 22, 2024

Choose a reason for hiding this comment

tschneidereit commented Mar 22, 2024

cfallin commented Mar 22, 2024

tschneidereit commented Mar 22, 2024

cfallin commented Mar 22, 2024

tschneidereit commented Mar 22, 2024

cfallin commented Mar 22, 2024

afonso360 commented Mar 23, 2024

candymate commented Apr 4, 2024 • edited

cfallin commented Mar 22, 2024 •

edited

candymate commented Apr 4, 2024 •

edited