Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Fill for [MaybeUninit<T>] #1080

Closed
kangalio opened this issue Dec 19, 2020 · 25 comments
Closed

Implement Fill for [MaybeUninit<T>] #1080

kangalio opened this issue Dec 19, 2020 · 25 comments
Labels
P-postpone Waiting on something else

Comments

@kangalio
Copy link

Background

What is your motivation? Right now, the Rng::fill function can only write to buffers that are already initialized. That's because passing uninitialized memory as a &[T] is undefined behavior. With an AsByteSliceMut implementation for [MaybeUninit<T>], rand could be used to write vast amounts of random numbers into memory without needing to zero out the memory before.

What type of application is this? Fast generation of vast amounts of numbers

Feature request

The rand::AsByteSliceMut trait should be implemented for all [MaybeUninit<T>] where T are the numeric integer types. In other words, similar to how you can fill a &mut [u32]/&mut [i32]/&mut [usize]/... or a &mut [Wrapping<u32>]/&mut [Wrapping<i32>]/&mut [Wrapping<usize>]/..., rand should support filling a &mut [MaybeUninit<u32>]/&mut [MaybeUninit<i32>]/&mut [MaybeUninit<usize>]/...

@dhardy
Copy link
Member

dhardy commented Dec 20, 2020

Hmm, the problem there is that implementing AsByteSliceMut for [MaybeUninit<T>] gives a "safe" way to read uninitialised memory — as such the implementation would violate memory safety rules around unsafe, even if typical usage wouldn't.

Rand v0.8 has already replaced this trait with Fill which doesn't have this problem, so implementing that is fine (should you like to make a PR).

(Note also that to use MaybeUninit you have to use unsafe anyway, so you could just transmute to &mut [u8] yourself.)

@kangalio
Copy link
Author

kangalio commented Dec 20, 2020

Rand v0.8 has already replaced this trait with Fill which doesn't have this problem, so implementing that is fine

Oh damn, that's what I get for forgetting to click the "Go to latest version" button on docs.rs.

That's good news then, I'm glad there's nothing in the way of this feature.

(Note also that to use MaybeUninit you have to use unsafe anyway, so you could just transmute to &mut [u8] yourself.)

Actually no, because it's undefined behavior in Rust to produce a mutable reference to uninitialized memory, even if it's not explicitly read from. That's because, when writing to the mutable reference, Rust will attempt to drop the previous contained value, which is uninitialized in this case. So a designated Fill implementation for [MaybeUninit<T>] is unfortunately required.

@kangalio
Copy link
Author

kangalio commented Dec 20, 2020

Question about the source code: in the Fill trait's try_fill implementations, after the slice has been filled with random data, all the elements are turned to little endian. Why? The bytes are completely random anyways, it makes no difference if they're little-endian or big-endian.

rand/src/rng.rs

Lines 350 to 358 in 98a1aaf

rng.try_fill_bytes(unsafe {
slice::from_raw_parts_mut(self.as_mut_ptr()
as *mut u8,
self.len() * mem::size_of::<$t>()
)
})?;
for x in self {
*x = x.to_le();
}

@newpavlov
Copy link
Member

This trait is also used by PRNGs and for some applications (e.g. games) it's important to keep RNG output reproducible on different platforms.

@kangalio
Copy link
Author

Currently, the API of RngCore can only fill &mut [u8]. However, it is undefined behavior to have a &mut [u8] pointing to uninitialized memory, therefore RngCore in its current state does not support filling uninitialized bytes.

In order to implement the feature proposed in this issue, a new method would need to be added to RngCore:

pub fn try_fill_uninitialized_bytes(&mut self, dest: &mut [MaybeUninit<u8>]) -> Result<(), Error>

What do the rand maintainers think about this suggestion?

@kangalio kangalio changed the title Implement AsByteSliceMut for [MaybeUninit<T>] Implement Fill for [MaybeUninit<T>] Dec 20, 2020
@dhardy
Copy link
Member

dhardy commented Dec 21, 2020

I'd really prefer not to go that route since (a) we want to keep RngCore minimal and (b) avoid breaking changes to rand_core.

How much performance overhead is there if you copy to a buffer first? The compiler may even optimise the extra copy away, and in many cases the perf. isn't critical anyway.

@dhardy
Copy link
Member

dhardy commented Dec 21, 2020

Of course at this point you might as well just use a zeroed array to start with and avoid using MaybeUninit at all.

If we really need to redesign the core RNG for better performance here, it would be better to look at making BlockRngCore::generate take ptr: *mut u8, len: usize and use this trait directly for filling the uninit array. But I don't think we need to consider that yet. (And in this case, you could just consider your use-case special and write a custom RNG for it. But only if you really need to fully optimise this case.)

@vks
Copy link
Collaborator

vks commented Dec 21, 2020

This should probably be backed up by a benchmark demonstrating the performance advantage.

@newpavlov
Copy link
Member

We also probably should wait for rust-lang/rust#78485 to be implemented and use the proposed ReadBuf type.

@dhardy
Copy link
Member

dhardy commented May 27, 2021

The ReadBuf RFC now has an impl PR but it is not merged. Assuming this does get stabilised, we could add a method like the following to RngCore:

fn try_fill_read_buf(&mut self, buf: &mut ReadBuf<'_>) -> Result<(), Error> {
    self.try_fill_bytes(buf.initialize_unfilled());
}

This is non-breaking and fairly low complexity, but still an addition to what is supposed to be a simple core trait.

What are our thoughts on this? Mine is that a benchmark demonstrating significant improvement is required. I guess it would be interesting to look at both something like SmallRng and something stronger like ChaCha8Rng.

@newpavlov
Copy link
Member

I think it may be useful to have fill methods for u32/u64. In some (AFAIK relatively rare) cases they may result in a significant acceleration, especially for PRNGs like ChaCha which can use SIMD to produce several blocks in parallel. Also they should result in less memory stores if a PRNG state can fit into registers. Those methods could replace the next_* methods, by marking the fill methods #[inline] the &mut [T; 1] case should be trivially optimizable by compiler. And it also should be possible to introduce them gradually (introduce them implemented in terms of next_* first, then mark next_* methods deprecated, and eventually remove them in the next breaking release), limiting the migration churn.

I am significantly less sure about benefits of supporting uninitialized memory. First it should be demonstrated that zero-filling initialization indeed does not get optimized out for our PRNGs, I will not be surprised if it does (but note that compiler should not be able to perform such optimization for OsRng). And even if it does not, I think that potential performance loss will be miniscule, especially if user code properly reuses buffers.

@vks
Copy link
Collaborator

vks commented May 27, 2021

@newpavlov Isn't fill_bytes supposed to fill that niche? I suppose that it might not due to alignment issues.

I think it might make sense to improve Distribution to better support vectorization as well.

@newpavlov
Copy link
Member

@vks
It indeed fills this niche right now, but its drawbacks not only the alignment issue which can reduce performance a bit, but also the fact that we have to compensate for endianness outside of the fill method, which on BE targets results in a two-pass approach. Plus I don't quite like the possibility of potentially different results for filling slice via the next_* methods and via the Fill trait.

@vks
Copy link
Collaborator

vks commented Jun 13, 2021

Maybe a good alternative is to provide something like DistString from #1133 for [u8; N] and Vec<u8> instead?

@SUPERCILEX
Copy link
Contributor

I threw together a quick benchmark:

fn tmp(c: &mut Criterion) {
    c.bench_function("tmp", |b| {
        let mut rng = XorShiftRng::from_entropy();
        b.iter(|| {
            let mut buf = [0u8; 4096];
            rng.fill_bytes(&mut buf);
            buf
        });
    });

    c.bench_function("pmt", |b| {
        let mut rng = XorShiftRng::from_entropy();
        b.iter(|| {
            // I know this is UB, but there's no way around this right now
            let mut buf: [u8; 4096] = unsafe { MaybeUninit::uninit().assume_init() };
            rng.fill_bytes(&mut buf);
            buf
        });
    });
}

I'm seeing a consistent 0.1us second difference between the two which I guess isn't all that much (it's still measurable though and would presumably get worse as the buffer size increases). I feel like it'd be a huge bummer to make it impossible to use MaybeUninit without UB. What's the argument against adding these two methods with default implementations?

fn fill_uninit_bytes(&mut self, dest: &mut [MaybeUninit<u8>]);
fn try_fill_uninit_bytes(&mut self, dest: &mut [MaybeUninit<u8>]) -> Result<(), Error>;

People who know what they're doing (this repo) would implement those methods, but other people can just ignore them and pay the cost of zeroing everything out.

@dhardy
Copy link
Member

dhardy commented Feb 8, 2022

            // I know this is UB, but there's no way around this right now

Interesting benchmark. Does it do what you think it does?

Quips aside, saving 100ns isn't sufficient justification for adding new API.

We could however try building an API around ReadBuf now (nightly-only and feature-flagged).

@SUPERCILEX
Copy link
Contributor

Playing with this on godbolt, you get basically the same assembly (also, I'm using the UB method in my project with tests that pass so it seems fine for now):

example::test:
        push    rbx
        mov     eax, 4096
        call    __rust_probestack
        sub     rsp, rax
        mov     rbx, rsp
        mov     edx, 4096
        mov     rdi, rbx
        xor     esi, esi
        call    qword ptr [rip + memset@GOTPCREL]
        add     rsp, 4096
        pop     rbx
        ret
example::test:
        mov     eax, 4096
        call    __rust_probestack
        sub     rsp, rax
        mov     rax, rsp
        add     rsp, 4096
        ret

Quips aside, saving 100ns isn't sufficient justification for adding new API.

I was kind of thinking that too, but those 100ns are way more significant if you look at it from the percentage perspective:

zero_init                     time:   [1.5549 us 1.5631 us 1.5724 us]
uninit                        time:   [1.4582 us 1.4675 us 1.4770 us]

That's 6% lost for nothing.


Regarding the ReadBuf stuff, I don't understand its relevance. ReadBuf is a higher level API that offers users access to either an uninitialized buffer or an initialized version as needed. Why would we force people to use that when they can just pass in a MaybeUninit buf (which can come from ReadBuf if they want).


Regarding the API, why is there resistance against something that consumers don't have to implement? That doesn't seem like bloat to me. Also, the current API forces users to leave performance on the table which I'm strongly against — you should always have the choice to go lower level if you'd like.

@dhardy
Copy link
Member

dhardy commented Feb 9, 2022

Regarding the API, why is there resistance against something that consumers don't have to implement?

It's still extra complexity, which requires some justification.

1.46μs vs 1.56μs is potentially valid justification, but it's a micro-benchmark and basically the best-case scenario for this change. 100ns on its own is nothing, so this is only significant if (a) repeated often or (b) a much larger buffer is filled. But, if repeating often, the same buffer should be reused, so (a) is out. With a larger buffer, probably the only reason you'd do that is to write to disk, but then the disk writes will be much more significant than the extra initialisation time. Hence, 6% faster in this benchmark is on its own very weak grounds for any new addition. Demonstrating a 5% improvement in a real-world example which saves non-negligible amounts of CPU time would be much stronger justification.

@SUPERCILEX
Copy link
Contributor

That's a fair point. I came up with a somewhat convoluted benchmark that would be best case for disk writes as it doesn't reuse the buffer at all:

fn maybe(c: &mut Criterion) {
    let mut g = c.benchmark_group("maybe");
    g.sample_size(1000);

    g.bench_function("init", |b| {
        let mut rng = XorShiftRng::from_entropy();
        let mut file = std::fs::File::create("/tmp/init.bench").unwrap();
        b.iter(|| {
            let mut buf = [0u8; 4096];
            rng.fill_bytes(&mut buf);
            file.write_all(&buf).unwrap();
        });
        std::fs::remove_file("/tmp/init.bench").unwrap();
    });

    g.bench_function("uninit", |b| {
        let mut rng = XorShiftRng::from_entropy();
        let mut file = std::fs::File::create("/tmp/uninit.bench").unwrap();
        b.iter(|| {
            let mut buf: [u8; 4096] = unsafe { std::mem::MaybeUninit::uninit().assume_init() };
            rng.fill_bytes(&mut buf);
            file.write_all(&buf).unwrap();
        });
        std::fs::remove_file("/tmp/uninit.bench").unwrap();
    });
}

Note that I changed machines so everything is way slower.

maybe/init              time:   [18.179 us 18.407 us 18.631 us]
maybe/uninit            time:   [17.311 us 17.619 us 17.924 us]
4.37462%

Perhaps more interestingly is finding the number of iterations it takes to make that lost performance insignificant by reusing the buffer N times:

fn maybe(c: &mut Criterion) {
    let mut g = c.benchmark_group("maybe");
    g.sample_size(1000);

    const N: usize = 10;
    let mut rng = XorShiftRng::from_entropy();
    let mut file = std::fs::File::create("/tmp/maybe.bench").unwrap();

    g.bench_function("init", |b| {
        b.iter(|| {
            let mut buf = [0u8; 4096];
            for _ in 0..N {
                rng.fill_bytes(&mut buf);
                file.write_all(&buf).unwrap();
            }
        });
    });

    g.bench_function("uninit", |b| {
        b.iter(|| {
            let mut buf: [u8; 4096] = unsafe { std::mem::MaybeUninit::uninit().assume_init() };
            for _ in 0..N {
                rng.fill_bytes(&mut buf);
                file.write_all(&buf).unwrap();
            }
        });
    });

    std::fs::remove_file("/tmp/maybe.bench").unwrap();
}

I couldn't get any consistent numbers, but at 10 iterations it takes somewhere around 200us to complete. If we use the 1-2us difference from the first benchmark, 200us is threshold where the cost of zero initialization drop just below 1%, meaning we have to use the buffer 10 times before the cost becomes insignificant. Realistically, I would argue that you have to use the buffer ~100 times before it becomes truly insignificant (0.1%).

Honestly not really sure if these benchmarks are useful since they're so finicky, but I thought it'd be worth sharing what I've been playing around with.

@dhardy
Copy link
Member

dhardy commented Feb 13, 2022

Thanks for the extra benchmarking. Summary: 4.4% or 800ns impact in the most significant case, and it's still something that doesn't exactly scale (unless you have code that writes thousands of random files but doesn't share the buffer used for some reason).

If it were some internal change, then sure, but I'm still not convinced this justifies API additions.

@SUPERCILEX
Copy link
Contributor

Yeah, that's fair. BTW, is the API addition issue specifically around RngCore, or just in general? I'm wondering if it could make sense to add an RngExt trait (maybe behind a feature flag?) with the uninit stuff. Thoughts?

@dhardy
Copy link
Member

dhardy commented Feb 14, 2022

Honestly, it's about keeping the complexity of the library in check. You seem oddly committed to getting this optimisation in somehow, so I'm not saying 100% no, just that it'll take more to convince me — but personally, I'd drop the idea. We're spending more CPU time discussing this issue than I can see it saving.

I don't want an RngExt trait for this in Rand, no. Also, Ext traits are often extra functionality over a base trait which is auto-implemented for all types (like our Rng trait is — I dislike the naming we picked now).

@SUPERCILEX
Copy link
Contributor

I guess what bothers me is that it's impossible. I also don't feel like 2 extra methods and some transmutes to bridge the gap adds much complexity, but since neither of us are budging, I'll drop it. :)

@SUPERCILEX
Copy link
Contributor

@dhardy based on the latest readbuf developments, #1241 should be closed and this issue should be renamed to something along the lines of "Implement read_buf method of Read" in the impl here:

impl std::io::Read for dyn RngCore {
.

@dhardy
Copy link
Member

dhardy commented Oct 15, 2022

@SUPERCILEX thanks for the comments and benches. In the end, I think we won't do this, at least not unless there is good real-world justification. Your benchmarks are interesting, but not quite enough to convince me of its utility.

The original motivation was this:

What type of application is this? Fast generation of vast amounts of numbers

... in which case, re-use of buffers should surely make the savings from using uninit memory insignificant.

Regarding Read::read_buf, it should not be considered since RngCore is nostd capable; std::io::Read isn't part of core.

The core issue is whether to add a method like try_fill_uninitialized_bytes as @kangalioo mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P-postpone Waiting on something else
Projects
None yet
Development

No branches or pull requests

5 participants