[Question] Next steps about vtable and mmap? #359

quark-zju · 2020-01-17T07:09:47Z

I have been looking for a zero-copy way of sharing mmap content via Bytes. I noticed the vtable feature added by #294 and didn't find follow-ups.

I wonder if bytes is interested in getting a Arc<memmap::Mmap> version implemented (probably gated by a mmap feature that is default off). Or if other approaches are preferred (ex. expose the vtable interface and implement the mmap support in other crates).

I personally think there are very limited types that meaningfully fit the vtable interface (I can hardly think of another aside from the mmap one). So it seems easier for end-users if bytes just implements them all (use features to keep the default deps slim). But I can see concerns about memmap being unmaintained.

If the next steps are clear, I can help implementing them.

The text was updated successfully, but these errors were encountered:

spl · 2020-02-20T12:34:12Z

For what it's worth, I'm also interested in a Bytes-like interface to a memory-mapped region.

quark-zju · 2020-02-25T07:23:48Z

Hi @spl, FYI I ended up reinventing it. It's not a lot of code if Arc<dyn Trait> is used.

spl · 2020-02-25T07:55:32Z

@quark-zju Nice! Your BytesOwner is a neat approach. I drafted something using the following:

enum Buf {
    Mmap(Mmap),
    Static(&'static [u8]),
    Vec(Vec<u8>),
}

pub struct SharedBuf {
    buf: Arc<Buf>,
    offset: usize,
    len: usize,
}

Of course, this wasn't intended for general-purpose use as a library.

spl · 2020-02-25T08:42:04Z

Could bytes provide something like the following from @quark-zju's library?

impl Bytes {
    pub fn from_owner(value: impl BytesOwner) -> Self { ... }
}

pub trait BytesOwner: AsRef<[u8]> + Send + Sync + 'static {}

scottlamb · 2020-06-08T18:59:44Z

I personally think there are very limited types that meaningfully fit the vtable interface (I can hardly think of another aside from the mmap one).

Here's some food for thought. Moonfire NVR currently uses a bytes::Buf implementation that's a wrapper around reffers::ARefss<'static, [u8]>. It puts these into a hyper::body::HttpBody implementation and sends them to hyper. It does two things with this:

1. mmap

It does this today, but I'm not really sure it's a good idea because of the page fault problem. As lucio-rs suggested here, it might be better to simply copy stuff from the mmaped region to a "normal" bytes rather than having a mmap-backed bytes. My current approach of not dealing with this at all means that the tokio threads may stall on page faults, which is not so great. It's really not good enough to have a bytes object for a mmap()ed region which is only special for loading and unloading, not also when reading from it.

2. more complex shared ownership / minimizing allocations

tl;dr version: there's one large object moonfire_nvr::mp4::File which lives for the entire response, is reference-counted all together via struct File(Arc<FileInner>), and has a small number of Vecs which turn into a larger number of HTTP response chunks. I'd need more allocations (bytes::bytes::Shared ones) to duplicate what it does with bytes::Bytes as it exists today.

Some detail, in case you're curious:

mp4.rs constructs a .mp4 file on-the-fly, including:

chunks from a "global" buf: Vec<u8> which has a bunch of dynamic file metadata (including stuff like length of structures, timestamps, etc) that gets stuffed between other parts of the file. It's nice to have that backed by the overall mp4::File rather than have a separate bytes::bytes::Shared allocation.
actual chunks of video frame data. Each segment comes from a different file with ~1 minute of H.264-encoded video frames in it.
indexes for the segment. There are three kinds of index needed for each segment. (stts = sample time table, stsz = sample size table, stss sync sample table) It's most efficient to all the indexes for a segment at once (but only if the HTTP request actually asks for a byte range that overlaps with at least one of them) and stuff them into one Vec. It's nice to not need another per-segment allocation for a bytes::bytes::Shared.

Here's the Debug output on a typical mp4::File. This structure gets generated on every HTTP request, then chunks may or may not actually get generated and served depending on the byte range the browser asked for.

LucioFranco · 2020-06-15T17:16:17Z

My current approach of not dealing with this at all means that the tokio threads may stall on page faults, which is not so great.

This effect may honestly not be horrible, if you're on modern hardware it should be somewhat fast. Really prolonged blocking is the issue.

scottlamb · 2020-06-17T05:12:26Z

By modern hardware I assume you mean SSD? Maybe so in that case. I have terabytes of files on cheap hardware, so I use spinning disk instead. I typically assume a seek takes about 10 ms but I just benchmarked it on this system at closer to 20 ms. wikipedia says a seek took 20 ms in the 1980s so basically on this metric I might as well be using 40-year-old hardware. And even the fastest modern hard drives can't do an order of magnitude better. The physics are prohibitive.

LucioFranco · 2020-06-17T14:23:25Z

wikipedia says a seek took 20 ms in the 1980s so basically on this metric I might as well be using 40-year-old hardware.

Haha, taking commodity hardware to a new level!

You are right, you can't assume what those times might be.

cc @seanmonstar on this issue since he wrote most of the vtable stuff.

quark-zju · 2020-06-20T16:22:28Z

@scottlamb I have some questions about the mp4::File use-case.

When constructing the Bytes backed by mp4::File, would it read the entire content of the file as an ordinary Bytes backed by Vec<u8> immediately or lazily?

If immediately, what is the benefit of not closing the File after reading? If the File is closed, it seems the ordinary Bytes fits the use-case.

If lazily, how would it actually implement the vtable interface? The current vtable struct needs plain ptr: *const u8 and len: usize. What's their values?

scottlamb · 2020-06-20T18:04:19Z

When constructing the Bytes backed by mp4::File, would it read the entire content of the file as an ordinary Bytes backed by Vec immediately or lazily?

Neither. There's not a single chunk (~ Bytes struct) for the whole file. A mp4::File produces a stream of many chunks. In the debug output I linked above, that file had 95 slices. When a HTTP request comes in, my code examines the requested byte range; there's a chunk for the portion of each slice that overlaps with that request, so up to 95 chunks in the stream. Chunks are created on demand via a bounded channel as hyper consumes previous chunks. Each is immediately usable once created.

"Immediately usable"...except for the problem of file-backed chunks causing major page faults that wait for a disk seek. Unless folks have a better idea, I plan to either abandon mmap entirely or copy from a mmaped region to a heap-backed Bytes instead to avoid this. I once was thinking of having a Bytes to represent a mlocked chunk of a mmaped file, but having a bunch of mlock/munlock calls doesn't seem like a good idea at all post-Meltdown.

The best thing long-term IMHO would be passing around something to cause a IORING_OP_SPLICE to happen from the file to the HTTP stream (with KTLS for https). With some fallback for older Linux kernels and non-Linux OSs. But I assume that's out of scope for the bytes crate, and the ecosystem as a whole needs a lot of work before it's possible...

quark-zju · 2020-06-20T19:40:49Z

Thanks for explanation. I think the Bytes::as_ref(&self) -> &[u8] API makes it unsuitable for non-continuous buffers and it is hard to change the API now. For complex backend backed by many chunks, perhaps it can use other abstractions like io::Seek + io::Read or their async versions.

seanmonstar · 2020-06-20T22:21:30Z

Buf is a trait in this crate that has fine support non-contiguous buffers. The Bytes type is meant to be an API to allow easy sharing and slicing of a single contiguous buffer.

scottlamb · 2020-06-20T23:27:58Z

I don't think Bytes being contiguous is a problem for me.

I built Entity::get_range on top of streams of chunks. While I'm using Buf today, it's a simple implementation with contiguous buffers like Bytes has as you can see here:

https://github.com/scottlamb/moonfire-nvr/blob/6b5359b7cb1502c4e24f9ae7fa81a2ceac9a4f23/src/body.rs#L69-L75

hyper often will batch up my chunks into a single syscall, as you can see from strace output:

8638  writev(42, [{iov_base="HTTP/1.1 206 Partial Content\r\nac"..., iov_len=449}, {iov_base="\0\0\0 ftypisom\0\0\2\0isomiso2avc1mp41", iov_len=32}, {iov_base="\0\24\r,moov\0\0\0xmvhd\1\0\0\0\0\0\0\0\333\23\270\225\0\0\0\0"..., iov_len=48}, {iov_base="\0\1\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=76}, {iov_base="\0\0\0\3\0\24\5[trak\0\0\0\\tkhd\0\0\0\7\333\23\270\225\333\23\270\225"..., iov_len=44}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=52}, {iov_base="\7\200\0\0\48\0\0\0\24\4\367mdia\0\0\0,mdhd\1\0\0\0\0\0\0\0"..., iov_len=60}, {iov_base="\0\0\0!hdlr\0\0\0\0\0\0\0\0vide\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=33}, {iov_base="\0\24\4\242", iov_len=4}, {iov_base="minf\0\0\0\24vmhd\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0$dinf"..., iov_len=60}, {iov_base="\0\24\4bstbl\0\0\0\226stsd\0\0\0\0\0\0\0\1", iov_len=24}, {iov_base="\0\0\0\206avc1\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=134}, {iov_base="\0\r/\20stts\0\0\0\0\0\1\245\340", iov_len=16}, {iov_base="\0\0\0\1\0\0\7\10\0\0\0\1\0\0\16N\0\0\0\1\0\0\16\n\0\0\0\1\0\0\7,"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\7\24\0\0\0\1\0\0\16x\0\0\0\1\0\0\r\240\0\0\0\1\0\0\16\n"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\10r\0\0\0\1\0\0\r\34\0\0\0\1\0\0\r\220\0\0\0\1\0\0\6\362"..., iov_len=14400}, {iov_base="\0\0\0\1\0\0\6\370\0\0\0\1\0\0\168\0\0\0\1\0\0\r\374\0\0\0\1\0\0\6\374"..., iov_len=14400}], 17) = 58632

Buf's additional flexibility wouldn't help me in any significant way that I can see. I can't represent the whole response with a single Buf because there's the possibility of error while reading from a file. [edit: mostly in opening the file. Old files get cleaned up to make space for new ones, and it's possible a backing file is deleted by the time the client reaches that portion of the mp4::File. Similarly, generating part of the index can fail if that database row has been deleted.] Buf::bytes_vectored doesn't return a Result so it can't be lazy when the reading is fallible.

scottlamb · 2024-03-19T20:19:06Z

[Moonfire NVR passes mmap-backed chunks to hyper] today, but I'm not really sure it's a good idea because of the page fault problem.

FYI, I stopped for this reason (see scottlamb/moonfire-nvr#88). Moonfire still uses mmap, but now only within a dedicated per-disk thread. That thread does a memcpy into chunks backed by the global allocator. I did not find using all of (Bytes, mmap, spinning disks, tokio) in one place to be a winning combo.

I still do the "more complex shared ownership / minimizing allocations" thing a little but wouldn't have bothered if I had started with this mmap design from the start. I don't have numbers on how much it's saving me, but probably not that much, especially given that the portions I read from HDD have per-read chunk allocations now anyway.

NobodyXu · 2024-03-20T00:32:43Z

That thread does a memcpy into chunks backed by the global allocator.

I think you could ask the kernel to load everything at once, to avoid page fault?

Copying mmap into memory sounds worse than just reading

scottlamb · 2024-03-20T17:23:37Z

I think you could ask the kernel to load everything at once, to avoid page fault?

MAP_POPULATE? Yes, but there's no guarantee it won't page it back out again unless you also mlock.

Copying mmap into memory sounds worse than just reading

It was faster for me. YMMV. There are several factors. One somewhat surprising one was discussed in this article a while ago: userspace memcpy can be faster than the in-kernel memcpy because it uses SIMD registers.

NobodyXu · 2024-03-20T23:11:52Z

MAP_POPULATE? Yes, but there's no guarantee it won't page it back out again unless you also mlock.

There's also no guarantee that your heap allocated memory will stay uncompressed in main memory, if user have enabled swapping, or configured something like zswap

I think madvice would help preventing the page out.

One somewhat surprising one was discussed in this article a while ago: userspace memcpy can be faster than the in-kernel memcpy because it uses SIMD registers.

Thanks, that makes sense.

spl mentioned this issue Feb 26, 2020

How to represent a sequence of bytes terminusdb/terminusdb-store#39

Closed

scottlamb mentioned this issue Jun 8, 2020

Remove Sync bound for Body scottlamb/http-serve#18

Merged

scottlamb mentioned this issue Jun 30, 2020

mmaped body's major page faults can block tokio threads scottlamb/moonfire-nvr#88

Closed

daxpedda mentioned this issue Jul 2, 2020

Body can be limiting. seanmonstar/reqwest#963

Open

carllerche mentioned this issue Oct 20, 2020

meta: Expose Bytes vtable #437

Open

HyeonuPark mentioned this issue Aug 20, 2022

Custom vtable API #567

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Next steps about vtable and mmap? #359

[Question] Next steps about vtable and mmap? #359

quark-zju commented Jan 17, 2020

spl commented Feb 20, 2020

quark-zju commented Feb 25, 2020

spl commented Feb 25, 2020

spl commented Feb 25, 2020

scottlamb commented Jun 8, 2020

LucioFranco commented Jun 15, 2020

scottlamb commented Jun 17, 2020

LucioFranco commented Jun 17, 2020

quark-zju commented Jun 20, 2020 •

edited

scottlamb commented Jun 20, 2020

quark-zju commented Jun 20, 2020 •

edited

seanmonstar commented Jun 20, 2020

scottlamb commented Jun 20, 2020 •

edited

scottlamb commented Mar 19, 2024

NobodyXu commented Mar 20, 2024

scottlamb commented Mar 20, 2024 •

edited

NobodyXu commented Mar 20, 2024

[Question] Next steps about vtable and mmap? #359

[Question] Next steps about vtable and mmap? #359

Comments

quark-zju commented Jan 17, 2020

spl commented Feb 20, 2020

quark-zju commented Feb 25, 2020

spl commented Feb 25, 2020

spl commented Feb 25, 2020

scottlamb commented Jun 8, 2020

1. mmap

2. more complex shared ownership / minimizing allocations

LucioFranco commented Jun 15, 2020

scottlamb commented Jun 17, 2020

LucioFranco commented Jun 17, 2020

quark-zju commented Jun 20, 2020 • edited

scottlamb commented Jun 20, 2020

quark-zju commented Jun 20, 2020 • edited

seanmonstar commented Jun 20, 2020

scottlamb commented Jun 20, 2020 • edited

scottlamb commented Mar 19, 2024

NobodyXu commented Mar 20, 2024

scottlamb commented Mar 20, 2024 • edited

NobodyXu commented Mar 20, 2024

quark-zju commented Jun 20, 2020 •

edited

quark-zju commented Jun 20, 2020 •

edited

scottlamb commented Jun 20, 2020 •

edited

scottlamb commented Mar 20, 2024 •

edited