New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build our own file I/O API #1960
Comments
It probably doesn't make much sense to worry about this until after we have basic support for |
We may also want to consider some things like teaching |
I don't really dabble in low level stuff, but doesn't the OS / Kernel keep state anyway? Is that not accessible from Python? |
Yes, it is. The problem is that it's global state. Two tasks that access the same file will thus interfere with each other. Frankly I'm looking forward to a |
Not sure if in scope of this API, but I tend to |
In my experience, I don't often handle a stream in a streaming fashion and with random access at once, just one or the other. I think it might make sense to have two different file objects, a A while ago, I did a paper design for an HTTP API trying to pretend that I'd never used requests, but that I had read Trio's documentation and knew h11. One of the things that ended up in that design was a collection of stream adapters with signatures like |
plus @richardsheridan Putting these into a thread might not help if the code in question doesn't know to relinquish the GIL while it waits for the data. Out of scope for this issue probably. |
Yeah, unfortunately there's really no way to make a memory access @smurfix appending is an interesting case. I guess the options are:
A related issue: "disk" files that aren't really on disk, like |
Append mode also has the problem that Many nontrivial |
I see you folks are considering some pretty advanced stuff here. I was wondering if there was anything trio-specific in this, or could it be built as a separate library? That way I could also depend on it in aiofiles, and asyncio users would benefit too. |
@Tinche interesting question. The main motivation here is io_uring, and that's not something you can easily drop on top of an existing event loop. Ideally it involves rewriting the event loop itself. So on asyncio, I think you'd be putting everything in threads anyway? And aiofiles already does that, so I'm not sure you'd gain much? |
I guess technically you can use io_uring with an epoll loop by doing something tricky with eventfd, so your io_uring multiplexing code gets called when the main event loop detects that there's activity on the ring. I feel like there are enough tricky design questions here already though, so we'd probably want to first focus on how to get it working at all on trio, and then think about if it makes sense to add that extra complexity? |
Ah, I see. If it requires changes in the event loop itself, then implementing this in asyncio itself or uvloop might be a better strategy. |
Yeah, just as technically you can emulate a 32-bit CPU with an 8-bit CPU. Presto, your 8-bit-CPU runs Linux, albeit somewhat slowly. ;-) IMHO if we do the io_uring thing (which IMHO we should do) then the ring shall be the center of our part of the universe, and everything else shall revolve around it (eventually). It's at the bottom of the stack. Writing an "io_uring for aiofiles" back-end is starting from the wrong side of the fence. The right way IMHO is to implement a basic io_uring syscall mechanism, build the Trio mainloop on top of that, create a |
There is no need to use |
Note: some released versions of Linux 5.9 and 5.10 have broken RWF_NOWAIT (might affect io_uring too): tokio-rs/tokio#3803 |
|
Oh, good point! It looks like Windows does have this:
From a quick look I'm not finding anything similar on macOS/*BSD, though. OTOH, it sounds like macOS might implement |
Right now, our file I/O API is just a re-export of the one built into Python, with threads wrapped around all the I/O operations.
Python's file I/O API is very rich. For example,
io.FileIO
is a type ofio.RawIOBase
, which is a type ofio.IOBase
. Andopen
by default returns aio.TextIOWrapper
wrapped around anio.BufferedRandom
wrapped around aio.FileIO
object, with incremental unicode and newline decoding, custom buffering, etc. Reusing this code lets us isolate ourselves + our users from the details of low-level file I/O.The downside, of course, is that if we don't like how Python is handling that low-level file I/O, there's not much we can do about it, because there are like 3 abstraction layers between us and the actual syscalls. Historically, this hasn't been a big deal, because there hasn't been any better option than running regular blocking syscalls in a thread. But the tide of history is changing.
First, linux added
preadv2(..., RWF_NOWAIT)
, which is very simple -- it just lets you skip going to a thread if the data is already in cache; still have to go to a thread otherwise. But this is still enough for a dramatic speedup if you can use it. I was hoping that we could extend theio
module to support this (see bpo-32561), but (a) this hasn't really gone anywhere, and (b) see next paragraph.Then, io_uring came along, which is completely incompatible with the
io
module. And this article makes a compelling case that you really need a io_uring-like API to get reasonable performance on modern hardware; theRWF_NOWAIT
trick isn't enough:https://itnext.io/modern-storage-is-plenty-fast-it-is-the-apis-that-are-bad-6a68319fbc1a?gi=acab1e8296c4
We also have this request to support FreeBSD's native aio API: #1953
So... it seems like sooner or later we need to give up on
io
and write our own async file API. What should that look like?One option would be to copy the
io
API in detail, but... it's huge, so that would be difficult, and also... I'm not sure all the hair is really useful? One thing in particular I'm not a big fan of is that way everything is centered around the "current file position". This means every file object has some global state. Especially in a concurrent program, an API where you simply say which offset you want to read/write at each time seems better. (This is what Unix callspread
/pwrite
, as opposed toread
/write
that use the "current position".) This would mean we can't support treating streaming data like sockets as files, the way theio
module can, but... that seems fine.What do you really need to do with files?
Is there anything else? Those all seem pretty simple, and don't require anything like the
io
module's elaborate inheritance hierarchy.The article I linked above also links to a rust library for io_uring; they might have some useful API inspiration.
The text was updated successfully, but these errors were encountered: