Zeroize performance on u8 arrays #743

blckngm · 2022-03-01T04:15:09Z

I inspected the generated assembly code and benchmarked zeroize for [u8; 32] on x86_64 and found it quite inefficient, storing one byte at a time:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=3f44f4b90e6af0eac0dcb1f649390329

On my Ryzen CPU, it takes ~7.8324 ns, or ~1cpb. Binary code size is also quite large.

Using inline assembly (just stabilized in 1.59) and SSE2, zeroing a [u8; 32] takes just 3 instructions and ~492.87 ps (~16 bytes per cycle):

let mut buf: [u8; 32];
core::arch::asm!(
    "xorps {zero}, {zero}",
    "movups {zero}, ({ptr})",
    "movups {zero}, 16({ptr})",
    zero = out(xmm_reg) _,
    ptr = in(reg) &mut buf,
    options(att_syntax, nostack, preserves_flags),
);

So it might be something worth optimizing/documenting.

If you do not want to use inline assembly, maybe you should encourage using larger types or SIMD types, e.g., [u64; 4] or [__m128; 2] instead of [u8; 32]. Using write_volatile on *mut __m128 generates equally compact and efficient code as the assembly code above.

The text was updated successfully, but these errors were encountered:

tarcieri · 2022-03-01T04:22:21Z

@sopium we've done practically no optimization work, so I'm glad you're thinking about it.

Inline assembly is fine, but would need to be feature gated due to the higher MSRV (we can eventually consider bumping it, but our last bump was already painful)

For something like [T; N] though, it would be good to ensure this doesn't monomorphize to N distinct blobs of ASM.

blckngm · 2022-03-01T05:01:12Z

It boils down to LLVM not able to optimize write_volatile for u8 pointers. The bytes have to be written individually.

I think monomorphize for 16/32/64-byte arrays should be fine. These would also be commonly used key sizes in cryptography so the optimization can be justified.

For a u8 slice with unknown but largish size you can also use align_to_mut:

pub fn zeroize_slice(x: &mut [u8]) {
    unsafe {
        // Use __m128 on x86/x86_64.
        let (p, m, s) = x.align_to_mut::<u64>();
        for b in p {
            core::ptr::write_volatile(b, 0);
        }
        for m in m {
            core::ptr::write_volatile(m, 0);
        }
        for b in s {
            core::ptr::write_volatile(b, 0);
        }
    }
}

tarcieri · 2022-03-01T05:07:15Z

The use of write_volatile was something of a hack to ensure that the writes would not be elided by the compiler.

But inline assembly should also ensure that as well, and permit a highly performant implementation.

blckngm · 2022-03-01T05:12:06Z

The use of write_volatile was something of a hack to ensure that the writes would not be elided by the compiler.

But inline assembly should also ensure that as well, and permit a highly performant implementation.

Yes, I understand this. If it's “normal” write in a loop LLVM would be able recognize it and generate efficient instructions or call memset.

tarcieri · 2022-03-01T05:24:54Z

Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile memset typically realized by using some kind of compiler fence, often implemented as invoking memset in conjunction with an assembly-based dataflow optimization barrier.

While Rust has bindings to LLVM's volatile memset intrinsics, they are perma-unstable and no work has been done on a stable API AFAIK:

https://doc.rust-lang.org/std/intrinsics/fn.volatile_set_memory.html

But perhaps inline assembly could provide the next best thing, albeit on a target architecture-by-architecture basis.

newpavlov · 2022-03-01T07:33:10Z

We can not use write_volatile on *mut __m128 which we got from [u8; 32], since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split [u8; 32] into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.

blckngm · 2022-03-01T07:42:46Z

We can not use write_volatile on *mut __m128 which we got from [u8; 32], since the pointer could be insufficiently aligned. AFAIK we do not have unaligned version of the volatile write. As a hack we could split [u8; 32] into aligned and unaligned parts. In theory, compiler should be able to remove the unaligned code parts if the array is sufficiently aligned.

I was suggesting using [__m128; N] from the start, and transmuting to &[u8; N * 16] when necessary.

As for splitting to aligned and unaligned parts, that's exactly what align_to_mut does.

blckngm · 2022-03-01T07:52:13Z

Yeah the typical way to implement a zeroization primitive efficiently is some sort of volatile memset typically realized by using some kind of compiler fence, often implemented as invoking memset in conjunction with an assembly-based dataflow optimization barrier.

Oh this reminds me, now that we have stable inline assembly, this can be done too, and it's not architecture dependent:

pub fn zeroize(x: &mut [u8]) {
    for b in x.iter_mut() {
        *b = 0;
    }
    unsafe {
        core::arch::asm!(
            "/* {ptr} */",
            ptr = in(reg) x.as_mut_ptr(),
            options(nostack, readonly, preserves_flags),
        );
    }
}

pub fn f() {
    let mut x = [0u8; 32];
    zeroize(&mut x);
}

Playground: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=32fc07b8173bd2ce233eaa9bf0d8d81a

zeroize in f is optimized to two movaps but is not optimized out.

cbeck88 · 2023-01-06T21:50:21Z

@sopium I have a question about the use of readonly in the assembly options. (This is mostly for my own understanding.)

I watched a talk by Chandler Carruth (major contributor to LLVM) a long time ago, where he described creating similar functions in order to defeat the optimizer for benchmarking purposes: https://www.youtube.com/watch?v=nXaxk27zwlk&t=2476s

static void escape(void * p) {
    asm volatile("" : : "g"(p) : "memory")
}

I realize that this is a C++ talk and not rust, but they are both going through LLVM in the end so there's a lot of overlap.

In his talk he seems to say that it's important for the optimizer to believe that
(1) This asm has side-effects that the optimizer cannot know about. In C++ that's done with "volatile" asm, and in rust I think that's done by not marking your asm pure
(2) This asm potentially clobbers all memory. So, any reads from memory after this point have to actually go back to memory and any values cached in registers are dirty.

In your block you use the readonly flag and I'm trying to understand what this does and why this is good.

We are telling the optimizer, this assembly is not pure, it has some "crazy unknowable side-effect" in chandler's words
However, we are also telling the optimizer, this assembly is readonly and doesn't write to memory
It also preserves flags.

So, the optimizer has to believe that

This assembly can read from the memory which we are supposed to have zeroes
This assembly has unknowable side-effects, not affecting memory, but something else (flags? some CPU state?) that it has to care about, and whatever that something is might become wrong if we don't actually write the zeroes beforehand
Afterwards, memory has not been clobbered, and flags have not been changed, so we do not have to re-read any cached values in registers or modify any flags.

So 3 is good because it will help the efficiency of code using this block.

But it's hard for me to understand, if we're promising the optimizer that flags are not changed, registers are not changed (only the in(reg) for ptr is read from), and memory is not changed, what is the side-effect that the compiler still has to be worried about that prevents it from dropping this block? Or the model is just that, it still has to assume that there might be something beyond this that it doesn't understand?

newpavlov · 2023-01-06T22:28:13Z

@cbeck88
AFAIK readonly applies only to memory of the current process, the asm block from compiler's PoV may in theory perform a syscall and it will be well in bounds of the asm! options contract.

cbeck88 · 2023-01-06T23:33:39Z

I see, thank you

cbeck88 · 2023-02-25T00:11:29Z

It looks to me that one barrier to actually creating a patch that does this is the need for some form of the specialization language feature.

We currently have these impl's:

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
    Z: Zeroize,
{
    fn zeroize(&mut self) {
        self.iter_mut().zeroize();
    }
}

utils/zeroize/src/lib.rs

Line 361 in 91c2c21

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].

/// Impl [`Zeroize`] on arrays of types that impl [`Zeroize`].
impl<Z, const N: usize> Zeroize for [Z; N]
where
    Z: Zeroize,
{
    fn zeroize(&mut self) {
        self.iter_mut().zeroize();
    }
}

utils/zeroize/src/lib.rs

Line 461 in 91c2c21

impl<Z> Zeroize for [Z]

If we wanted to also add impls for the special case that Z = u8, then these will be overlapping impls, and we obviously cannot remove the existing impls.

I had this same problem in a hashing library: https://github.com/mobilecoinfoundation/mobilecoin/blob/b07c17fd2e1946c15b3a38c30b0fb1dc4ab516e3/crypto/digestible/src/lib.rs#L218

What I decided to do there was, simply not support my trait on the primitive type u8, so that I could have my nice impl for [T] and a nice impl for [u8]. Since it's extremely rare that someone actually needs to put a single u8 in a struct, I never actually had a problem, and I was able to do this kind of optimization without waiting for specialization to land.

In your case you already shipped a stable API which supports u8 so you probably don't want to do this approach.

tarcieri · 2023-02-25T00:36:04Z

u8 is pretty commonly used as a carrier for masked boolean values when writing constant-time code

cbeck88 · 2023-02-25T02:04:00Z

Good point

So do you see a path forward for this? Or just wait for specialization?

It seems to me we could make an optimized "zeroize_bytes" function based on the approach here, and people could either implement Zeroize manually and call out to it, or we could make like a #[zeroize(with = ...)] attribute for the zeroize-derive proc macro? Similar to what they do in serde? Is that something you would support?

tarcieri · 2023-02-25T20:18:01Z

A free function works, although is somewhat inelegant.

It might be possible to optimize the impl on [T] generically, packing as many copies of <T as Default>::default() as can fit into a quadword, then writing quadwords with movups until the remaining space in the slice is less-than-quadword sized, then falling back to writing them one-at-a-time in order to fill the remaining space.

Something like this:

const QUADWORD_SIZE: usize = 8;
const SIZE_OF_T: usize = mem::size_of::<T>();

if (SIZE_OF_T < QUADWORD_SIZE) && (QUADWORD_SIZE % SIZE_OF_T == 0) {
    const COPIES_IN_QUADWORD: usize = QUADWORD_SIZE / SIZE_OF_T;

    let quadword = [T::default(); COPIES_IN_QUADWORD]; 
    let mut iter = slice.chunks_exact_mut(COPIES_IN_QUADWORD);

    for chunk in iter {
        movups(chunk, &quadword as *const T as *const u64);
    }

    slow_path(iter.remainder());
} else {
    slow_path(slice);
}

cbeck88 · 2023-02-25T21:55:46Z

Interesting idea, maybe this is a good use-case for https://doc.rust-lang.org/std/primitive.slice.html#method.align_to_mut

Looking more at the code, it looks to me that we already have this impl -- maybe not directly relevant to OP:

utils/zeroize/src/lib.rs

Line 474 in 91c2c21

unsafe { volatile_set(self.as_mut_ptr(), Z::default(), self.len()) };

/// Impl [`Zeroize`] on slices of types that can be zeroized with [`Default`].
///
/// This impl can eventually be optimized using an memset intrinsic,
/// such as [`core::intrinsics::volatile_set_memory`]. For that reason the
/// blanket impl on slices is bounded by [`DefaultIsZeroes`].
///
/// To zeroize a mut slice of `Z: Zeroize` which does not impl
/// [`DefaultIsZeroes`], call `iter_mut().zeroize()`.
impl<Z> Zeroize for [Z]
where
    Z: DefaultIsZeroes,
{
    fn zeroize(&mut self) {
        assert!(self.len() <= isize::MAX as usize);

        // Safety:
        //
        // This is safe, because the slice is well aligned and is backed by a single allocated
        // object for at least `self.len()` elements of type `Z`.
        // `self.len()` is also not larger than an `isize`, because of the assertion above.
        // The memory of the slice should not wrap around the address space.
        unsafe { volatile_set(self.as_mut_ptr(), Z::default(), self.len()) };
        atomic_fence();
    }
}

It's too bad that Vec<T> and Box<[T]> and [T; n] don't seem to call to this impl, I guess that they can't because they don't have the DefaultIsZeroes bound? And that's where we'd need negative trait bounds or specialization or something in order to make them able to do it in the cases when that bound does hold.

But maybe this is the place to do this kind of optimization work, and people can call foo.as_mut_slice().zeroize() or whatever when the perf matters, for now.

tarcieri · 2023-02-25T22:06:14Z

The impls on slice types call volatile_set, so that seems like the proper place to provide an optimized implementation for filling slices

cbeck88 · 2023-02-25T22:49:45Z

Here's the thing -- I bet that volatile_set is already going to be well understood by the compiler and that splitting this up for alignment is going to be done by llvm, so I feel like writing out the align_to_mut stuff is probably going to be wasted effort. (Could look at code gen and see). I know I've seen llvm generate all this before for memset calls.

Here's an idea: Maybe we can refactor things so that we don't have special-casing based on DefaultIsZeroes trait?

What if we made Zeroize like this:

pub trait Zeroize {
    /// The bit pattern that we will set objects of this type to when zeroizing them
    fn zero_pattern() -> Self;

    /// Write the zero pattern over an object, can't be optimized away
    fn zeroize(&mut self) {
        volatile_write(self, Self::zero_pattern());
        atomic_fence();
    }
}

Then the idea is we could have impl <Z> Zeroize for [Z] without any additional bounds using the same code as above, and it would use Z::zero_pattern() rather than Z::default(). And we would similarly be able to implement Zeroize for Vec<Z>, Box<[Z]>, etc. in a way that calls to the optimized implementation for [Z].

I think we might not need to wait for specialization if we did that, or something along these lines.

tarcieri · 2023-02-25T22:57:54Z

First, you're proposing breaking changes to a post-1.0 crate which exposes a trait-oriented API. Those would only be accepted as a last resort.

But also, zeroize is designed to work on structured types that don't necessarily implement Copy. That's why zeroize takes &mut self: so it can mutate data in-place.

DefaultIsZeroes exists to provide a strategy for zeroizing Copy types by replacing them with the Default value, but that approach doesn't work for non-Copy types, and zeroize is very much designed to support both.

cbeck88 · 2023-02-25T23:08:23Z

Yeah -- I see that, it's going to get more complicated when you want the implementation for [Z] not to assume that Z is copy anymore, so that it can be parametric.

Intuitively, what I'd like to say is, what if we try to implement the general case where, Z might have destructors, put calls to drop_in_place etc. as appropriate, which will become no-ops when Z is copy. We could try to be parametric over all types. At least it's not obvious to me that "that approach doesn't work for non-copy types". Maybe it's obvious to you, you've thought about it more.

I understand it's a breaking change, just trying to think about directions that might make things simpler or better.

tarcieri · 2023-02-25T23:35:34Z

The big problem with any breaking change is like serde, zeroize contains traits which compose across many crates.

It has ~500 downstream dependencies, and any kind of upgrade needs to be coordinated across all of them.

So really breaking changes need to be reserved for things that absolutely must make them. And I don't see any reason that mandates such a drastic change.

it's going to get more complicated when you want the implementation for [Z] not to assume that Z is copy anymore

That's not an issue. The impl of Zeroize for [Z] already bounds on DefaultIsZeroes, which in turn bounds on Copy, Default, and Sized, so we can write an optimized implementation which does a simple copy. It was designed that way from the start.

The impl on IterMut can be used to zeroize [Z] where Z is non-Copy, i.e. non_copy_slice.iter_mut().zeroize().

cbeck88 · 2023-02-26T22:09:23Z

I'm just trying to understand, even if we did optimize the impl Zeroize for [Z]: Z: DefaultIsZeroes, what as a user would I have to do to actually get it.

Because if I have code like OP:

let mut x = [7u8; 32];
x.zeroize();

it's not going to call the zeroize function for slices, it's going to call the [T; n] one.

Suppose I have code like

#[derive(Zeroize)]
struct A {
   a: Vec<u8>
}

This is not going to call the impl Zeroize for [Z]: Z: DefaultIsZeroes impl that we're proposing to optimize, which is kind of unfortunate. So that means that optimizing this stuff feels kind of like wasted effort until this is somehow improved.

In the meantime, from experiments it looks that

let mut x = [7u8; 32];
(&mut x as &mut [u8]).zeroize();

will convince the compiler to pick the faster implementation, so maybe we could do the optimization we're talking about and document that that's what you have to do as a user if zeroizing one byte at a time isn't good enough?

And for a struct like A you probably have to implement Zeroize by hand, and you probably only should do that if you're sure that your vec is not getting resized ever? Or you would have to marry the impl for Vec<Z> with the optimizations we have for slices with DefaultIsZeroes

It would be nice if we could make this easier for the user of the crate somehow, but I guess I don't see a way if we're not interested in discussing zeroize 2.0 here. Thanks.

tarcieri · 2023-02-26T22:26:13Z

what as a user would I have to do to actually get it.

Borrow the value as a mutable slice, then call zeroize.

It's a bit of a pain, but the best we can do without specialization.

guess I don't see a way if we're not interested in discussing zeroize 2.0 here

I still don't see anything worth making breaking changes over until specialization lands, and even then it would probably still make sense to have a specialization feature which adds the specialized impls.

The purpose of the change is to make calls to `x.as_mut_slice().zeroize()` considerably faster, particularly for types like `[u8; n]`. The reason it becomes faster is that the call to `volatile_set` before this change appears not to be easily optimizable, and (for example) leads to setting bytes one at a time, instead of the compiler consolidating them into SIMD instructions. In the modified code, we don't use `volatile_set`, we instead loop over the slice setting the elements to `Default::default()`, and to ensure that the writes are not optimized out, we use an empty asm block. There is discussion of the correct asm options to use here in the issue. Because the asm block potentially reads from the pointer and makes a syscall of some kind, the compiler cannot optimize out the zeroizing, or it could cause observable side-effects. In the improved code, we only create such an optimization barrier once, rather than after each byte that it is written. The call to `atomic_fence()` is not changed. --- This change may help give users a way to improve performance, if they have to zeroize very large objects, or, frequently have to zeroize many small objects. We tested code-gen here in godbolt (in addition to the tests posted in the github issue) and found that this change is typically enough for llvm to start adding in SIMD optimizations that zero many bytes at once.

blckngm mentioned this issue Mar 2, 2022

Stable counterpart for core::intrinsics::volatile_set_memory rust-lang/rust#76892

Open

cbeck88 mentioned this issue Jan 6, 2023

Make the Tx object zeroize itself on drop mobilecoinfoundation/mobilecoin#2925

Open

cbeck88 mentioned this issue Feb 28, 2023

zeroize: use asm! to improve performance #841

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zeroize performance on u8 arrays #743

Zeroize performance on u8 arrays #743

blckngm commented Mar 1, 2022

tarcieri commented Mar 1, 2022

blckngm commented Mar 1, 2022

tarcieri commented Mar 1, 2022 •

edited

blckngm commented Mar 1, 2022

tarcieri commented Mar 1, 2022

newpavlov commented Mar 1, 2022

blckngm commented Mar 1, 2022

blckngm commented Mar 1, 2022

cbeck88 commented Jan 6, 2023

newpavlov commented Jan 6, 2023 •

edited

cbeck88 commented Jan 6, 2023

cbeck88 commented Feb 25, 2023

tarcieri commented Feb 25, 2023

cbeck88 commented Feb 25, 2023 •

edited

tarcieri commented Feb 25, 2023 •

edited

cbeck88 commented Feb 25, 2023 •

edited

tarcieri commented Feb 25, 2023

cbeck88 commented Feb 25, 2023 •

edited

tarcieri commented Feb 25, 2023 •

edited

cbeck88 commented Feb 25, 2023 •

edited

tarcieri commented Feb 25, 2023

cbeck88 commented Feb 26, 2023 •

edited

tarcieri commented Feb 26, 2023 •

edited

Zeroize performance on u8 arrays #743

Zeroize performance on u8 arrays #743

Comments

blckngm commented Mar 1, 2022

tarcieri commented Mar 1, 2022

blckngm commented Mar 1, 2022

tarcieri commented Mar 1, 2022 • edited

blckngm commented Mar 1, 2022

tarcieri commented Mar 1, 2022

newpavlov commented Mar 1, 2022

blckngm commented Mar 1, 2022

blckngm commented Mar 1, 2022

cbeck88 commented Jan 6, 2023

newpavlov commented Jan 6, 2023 • edited

cbeck88 commented Jan 6, 2023

cbeck88 commented Feb 25, 2023

tarcieri commented Feb 25, 2023

cbeck88 commented Feb 25, 2023 • edited

tarcieri commented Feb 25, 2023 • edited

cbeck88 commented Feb 25, 2023 • edited

tarcieri commented Feb 25, 2023

cbeck88 commented Feb 25, 2023 • edited

tarcieri commented Feb 25, 2023 • edited

cbeck88 commented Feb 25, 2023 • edited

tarcieri commented Feb 25, 2023

cbeck88 commented Feb 26, 2023 • edited

tarcieri commented Feb 26, 2023 • edited

tarcieri commented Mar 1, 2022 •

edited

newpavlov commented Jan 6, 2023 •

edited

cbeck88 commented Feb 25, 2023 •

edited

tarcieri commented Feb 25, 2023 •

edited

cbeck88 commented Feb 25, 2023 •

edited

cbeck88 commented Feb 25, 2023 •

edited

tarcieri commented Feb 25, 2023 •

edited

cbeck88 commented Feb 25, 2023 •

edited

cbeck88 commented Feb 26, 2023 •

edited

tarcieri commented Feb 26, 2023 •

edited