Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build our own file I/O API #1960

Open
njsmith opened this issue Apr 14, 2021 · 18 comments
Open

Build our own file I/O API #1960

njsmith opened this issue Apr 14, 2021 · 18 comments

Comments

@njsmith
Copy link
Member

njsmith commented Apr 14, 2021

Right now, our file I/O API is just a re-export of the one built into Python, with threads wrapped around all the I/O operations.

Python's file I/O API is very rich. For example, io.FileIO is a type of io.RawIOBase, which is a type of io.IOBase. And open by default returns a io.TextIOWrapper wrapped around an io.BufferedRandom wrapped around a io.FileIO object, with incremental unicode and newline decoding, custom buffering, etc. Reusing this code lets us isolate ourselves + our users from the details of low-level file I/O.

The downside, of course, is that if we don't like how Python is handling that low-level file I/O, there's not much we can do about it, because there are like 3 abstraction layers between us and the actual syscalls. Historically, this hasn't been a big deal, because there hasn't been any better option than running regular blocking syscalls in a thread. But the tide of history is changing.

First, linux added preadv2(..., RWF_NOWAIT), which is very simple -- it just lets you skip going to a thread if the data is already in cache; still have to go to a thread otherwise. But this is still enough for a dramatic speedup if you can use it. I was hoping that we could extend the io module to support this (see bpo-32561), but (a) this hasn't really gone anywhere, and (b) see next paragraph.

Then, io_uring came along, which is completely incompatible with the io module. And this article makes a compelling case that you really need a io_uring-like API to get reasonable performance on modern hardware; the RWF_NOWAIT trick isn't enough:

https://itnext.io/modern-storage-is-plenty-fast-it-is-the-apis-that-are-bad-6a68319fbc1a?gi=acab1e8296c4

We also have this request to support FreeBSD's native aio API: #1953

So... it seems like sooner or later we need to give up on io and write our own async file API. What should that look like?

One option would be to copy the io API in detail, but... it's huge, so that would be difficult, and also... I'm not sure all the hair is really useful? One thing in particular I'm not a big fan of is that way everything is centered around the "current file position". This means every file object has some global state. Especially in a concurrent program, an API where you simply say which offset you want to read/write at each time seems better. (This is what Unix calls pread/pwrite, as opposed to read/write that use the "current position".) This would mean we can't support treating streaming data like sockets as files, the way the io module can, but... that seems fine.

What do you really need to do with files?

  • read/write bytes at offset
  • write bytes/text at end
  • iterate through byte chunks
  • iterate through text chunks
  • iterate through text lines
  • read the whole thing as a single big blob of bytes/text
  • [edit to add] truncate

Is there anything else? Those all seem pretty simple, and don't require anything like the io module's elaborate inheritance hierarchy.

The article I linked above also links to a rust library for io_uring; they might have some useful API inspiration.

@njsmith
Copy link
Member Author

njsmith commented Apr 14, 2021

It probably doesn't make much sense to worry about this until after we have basic support for io_uring: #932

@njsmith
Copy link
Member Author

njsmith commented Apr 14, 2021

We may also want to consider some things like teaching trio.Path to do async stat or async listdir natively through io_uring, though that's probably lower priority than bulk read/write.

@Sxderp
Copy link

Sxderp commented Apr 16, 2021

One thing in particular I'm not a big fan of is that way everything is centered around the "current file position". This means every file object has some global state.

I don't really dabble in low level stuff, but doesn't the OS / Kernel keep state anyway? Is that not accessible from Python?

@smurfix
Copy link
Contributor

smurfix commented Apr 16, 2021

I don't really dabble in low level stuff, but doesn't the OS / Kernel keep state anyway? Is that not accessible from Python?

Yes, it is. The problem is that it's global state. Two tasks that access the same file will thus interfere with each other.
Thus we should keep the current position at the end of the file, so that appending data works, and use pread/pwrite for everything else.

Frankly I'm looking forward to a trio.os module that has the same API as os except that everything is async (except for getpid and other syscalls that cannot sleep).

@richardsheridan
Copy link
Contributor

Not sure if in scope of this API, but I tend to mmap big arrays and do read-only operations on many small chunks of them. I've always assumed that those reads need to be in a thread since I can't know when a cache miss is coming, but maybe there is a smarter way?

@gmacon
Copy link

gmacon commented Apr 21, 2021

In my experience, I don't often handle a stream in a streaming fashion and with random access at once, just one or the other. I think it might make sense to have two different file objects, a FileStream for reading or writing without random access, and a RandomAccessFile that only exposes the equivalents of pread and pwrite.

A while ago, I did a paper design for an HTTP API trying to pretend that I'd never used requests, but that I had read Trio's documentation and knew h11. One of the things that ended up in that design was a collection of stream adapters with signatures like def lines(inner: ReceiveStream, codec='utf-8', ...) -> ReceiveChannel[str] and async def slurp(inner: ReceiveStream) -> bytes. I wonder if those could be useful to cover all of the cases on top of FileStream and RandomAccessFile.

@smurfix
Copy link
Contributor

smurfix commented Apr 21, 2021

a RandomAccessFile that only exposes the equivalents of pread and pwrite.

plus append, please.

@richardsheridan Putting these into a thread might not help if the code in question doesn't know to relinquish the GIL while it waits for the data. Out of scope for this issue probably.

@njsmith
Copy link
Member Author

njsmith commented Apr 23, 2021

Yeah, unfortunately there's really no way to make a memory access async. But on Linux, at least, I'd expect random access to large files via io_ring should have similar performance to mmap, while also being async friendly.

@smurfix appending is an interesting case. I guess the options are:

  • open in append mode: on Unix, the kernel guarantees that all writes will append. No idea what happens on windows. Since this is a flag on the fd, it probably needs like, an entire different interface – you can't just make it a method on a generic random access file.

  • track where we think the end of the file is, and write to that offset. Obviously this is race-y if someone else is also appending to the same file. But in practice this might not matter. (Append mode is also race-y, but it guarantees that data won't be lost, only interleaved.)

  • call stat to check where the end of the file is, and then write to that offset. Still race-y, but a much narrower race window than the previous option. Requires an extra syscall on every append.

A related issue: "disk" files that aren't really on disk, like /dev/zero, /proc/mountinfo, Unix named pipes, windows COM, etc. These aren't seekable (I think? Maybe magic files in /proc are?), and you kind of need to just read/write at the current position, maybe? Or do you always read/write at offset zero?

@smurfix
Copy link
Contributor

smurfix commented Apr 23, 2021

Append mode also has the problem that pwrite64 ignores your offset, it always appends. Not a good idea. Personally I'd prefer the "call stat plus use file locking" approach if there's a risk of multiple writers, otherwise just track.

Many nontrivial /proc files are seekable, though the seek offset often doesn't correspond to the number of bytes you've read. Instead it encodes the number of the current entry, shifted left so that it covers the max line length.

@Tinche
Copy link

Tinche commented Apr 23, 2021

I see you folks are considering some pretty advanced stuff here. I was wondering if there was anything trio-specific in this, or could it be built as a separate library? That way I could also depend on it in aiofiles, and asyncio users would benefit too.

@njsmith
Copy link
Member Author

njsmith commented Apr 23, 2021

@Tinche interesting question. The main motivation here is io_uring, and that's not something you can easily drop on top of an existing event loop. Ideally it involves rewriting the event loop itself. So on asyncio, I think you'd be putting everything in threads anyway? And aiofiles already does that, so I'm not sure you'd gain much?

@njsmith
Copy link
Member Author

njsmith commented Apr 23, 2021

I guess technically you can use io_uring with an epoll loop by doing something tricky with eventfd, so your io_uring multiplexing code gets called when the main event loop detects that there's activity on the ring. I feel like there are enough tricky design questions here already though, so we'd probably want to first focus on how to get it working at all on trio, and then think about if it makes sense to add that extra complexity?

@Tinche
Copy link

Tinche commented Apr 23, 2021

Ah, I see. If it requires changes in the event loop itself, then implementing this in asyncio itself or uvloop might be a better strategy.

@smurfix
Copy link
Contributor

smurfix commented Apr 23, 2021

I guess technically you can use io_uring with an epoll loop by doing something tricky with eventfd,

Yeah, just as technically you can emulate a 32-bit CPU with an 8-bit CPU. Presto, your 8-bit-CPU runs Linux, albeit somewhat slowly. ;-)

IMHO if we do the io_uring thing (which IMHO we should do) then the ring shall be the center of our part of the universe, and everything else shall revolve around it (eventually). It's at the bottom of the stack. Writing an "io_uring for aiofiles" back-end is starting from the wrong side of the fence. The right way IMHO is to implement a basic io_uring syscall mechanism, build the Trio mainloop on top of that, create a trio.os module that does the same thing os does (except async), implement the rest of the infrastructure we need on top of that, and then create a shallow shim for aiofiles that simply calls the corresponding Trio code.

@YoSTEALTH
Copy link

There is no need to use preadv2(..., RWF_NOWAIT) if you are using io_uring as this is already implemented/improved in io_uring. axboe/liburing#280 (comment)

@njsmith
Copy link
Member Author

njsmith commented May 29, 2021

Note: some released versions of Linux 5.9 and 5.10 have broken RWF_NOWAIT (might affect io_uring too): tokio-rs/tokio#3803

@takluyver
Copy link
Contributor

appending is an interesting case. I guess the options are...

pwritev2 has an RWF_APPEND flag to write at the end of the file regardless of the offset, and if I've followed the documentation of io_uring correctly it should have the same thing. Would that work? Is there anything similar on other platforms?

@njsmith
Copy link
Member Author

njsmith commented Jun 16, 2021

pwritev2 has an RWF_APPEND flag to write at the end of the file regardless of the offset, and if I've followed the documentation of io_uring correctly it should have the same thing. Would that work? Is there anything similar on other platforms?

Oh, good point!

It looks like Windows does have this:

To write to the end of file, specify both the Offset and OffsetHigh members of the OVERLAPPED structure as 0xFFFFFFFF. This is functionally equivalent to previously calling the CreateFile function to open hFile using FILE_APPEND_DATA access.

From a quick look I'm not finding anything similar on macOS/*BSD, though.

OTOH, it sounds like macOS might implement O_APPEND as seek+write, so it's race-y no matter what you do? https://stackoverflow.com/questions/50752288/atomicity-of-write2-on-a-file-opened-with-the-o-append-flag?noredirect=1&lq=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants