Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read upload files using read(CHUNK_SIZE) rather than iter(). #1948

Merged
merged 4 commits into from Nov 22, 2021

Conversation

tomchristie
Copy link
Member

Resolves #1911

When digging into this, it turns out that when sending an upload file we're using iter(file_obj), which happens to be a line-by-line iterator, and might yield super-large chunks for binary files. That happens to be slow because you don't really end up streaming the file to the network at all, but rather batching it all up in memory first.

The low-hanging fruit here is to cap the size of the chunks that we send to a max of 64k, which from a bit of prodding seems to be a fairly decent value.

It's possible that using .read() on a stream if it exists might be beneficial too, but I've not dug into that yet.

@tomchristie tomchristie added the perf Issues relating to performance label Nov 22, 2021
@tomchristie
Copy link
Member Author

Okay, this makes even more sense:

  • Use read(CHUNK_SIZE) directly when available.
  • Otherwise use the iterator interface.

@tomchristie tomchristie changed the title Cap upload chunk sizes Read upload files using read(CHUNK_SIZE) rather than iter(). Nov 22, 2021
@tomchristie tomchristie merged commit 6f5865f into master Nov 22, 2021
@tomchristie tomchristie deleted the cap-upload-chunk-sizes branch November 22, 2021 13:15
@tomchristie tomchristie mentioned this pull request Jan 5, 2022
@andrewshadura
Copy link

andrewshadura commented Feb 23, 2022

(edited)
@tomchristie, for some reason I’m still observing this behaviour when using content= with a file-like object with 0.22.0. I assumed this was only fixed for multipart uploads, but it seems this new code should also be used in this case?

@tomchristie
Copy link
Member Author

Can you show me where you mean?

@andrewshadura
Copy link

andrewshadura commented Feb 23, 2022

I have this code:

async with NamedTemporaryFile() as tmpfile:
    debug(f"Buffering into {tmpfile.name}")
    async for data in request.body:
        await tmpfile.write(data)

    await tmpfile.seek(0)
    debug(f"Uploading to {uri} from {tmpfile.name}")
    return await client.post(uri, content=tmpfile, …)

(this uses aiofiles.tempfile)
and even with 0.22.0 I can see data being uploaded in tiny chunks except as I understand when the disk cache kicks in:

…
uploading 37 bytes
uploading 44 bytes
uploading 2 bytes
uploading 8 bytes
uploading 13 bytes
uploading 21 bytes
uploading 1820 bytes
uploading 2056 bytes
uploading 5129 bytes
uploading 1012 bytes
uploading 7 bytes
uploading 17 bytes
uploading 65476 bytes
uploading 262144 bytes
uploading 213018 bytes

@andrewshadura
Copy link

andrewshadura commented Feb 23, 2022

Oh, I see, aiofiles have methods named differently than what your code expects.
Wait, aread is a method of AsyncByteStream, it needs only __aiter__ from the underlying file object, so it’s supposed to work?

@andrewshadura
Copy link

@tomchristie, any ideas?

@tomchristie
Copy link
Member Author

Rather than me dig into this myself, lemme point you in the right directions towards figuring this.
Might take a little longer, but I'm sure it'll be a valuable approach.

First up, you've given me a partial example. Can you give me a complete replication, but make it absolutely as simple as you can. Ideally I ought to be able to copy and paste your example to see the behaviour you're talking about.

(Once we've got that we'll work through the next steps...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf Issues relating to performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants