Read upload files using `read(CHUNK_SIZE)` rather than `iter()`. #1948

tomchristie · 2021-11-22T10:26:39Z

Resolves #1911

When digging into this, it turns out that when sending an upload file we're using iter(file_obj), which happens to be a line-by-line iterator, and might yield super-large chunks for binary files. That happens to be slow because you don't really end up streaming the file to the network at all, but rather batching it all up in memory first.

The low-hanging fruit here is to cap the size of the chunks that we send to a max of 64k, which from a bit of prodding seems to be a fairly decent value.

It's possible that using .read() on a stream if it exists might be beneficial too, but I've not dug into that yet.

tomchristie · 2021-11-22T12:17:04Z

Okay, this makes even more sense:

Use read(CHUNK_SIZE) directly when available.
Otherwise use the iterator interface.

andrewshadura · 2022-02-23T12:25:17Z

(edited)
@tomchristie, for some reason I’m still observing this behaviour when using content= with a file-like object with 0.22.0. I assumed this was only fixed for multipart uploads, but it seems this new code should also be used in this case?

tomchristie · 2022-02-23T12:28:54Z

Can you show me where you mean?

andrewshadura · 2022-02-23T12:31:30Z

I have this code:

async with NamedTemporaryFile() as tmpfile:
    debug(f"Buffering into {tmpfile.name}")
    async for data in request.body:
        await tmpfile.write(data)

    await tmpfile.seek(0)
    debug(f"Uploading to {uri} from {tmpfile.name}")
    return await client.post(uri, content=tmpfile, …)

(this uses aiofiles.tempfile)
and even with 0.22.0 I can see data being uploaded in tiny chunks except as I understand when the disk cache kicks in:

…
uploading 37 bytes
uploading 44 bytes
uploading 2 bytes
uploading 8 bytes
uploading 13 bytes
uploading 21 bytes
uploading 1820 bytes
uploading 2056 bytes
uploading 5129 bytes
uploading 1012 bytes
uploading 7 bytes
uploading 17 bytes
uploading 65476 bytes
uploading 262144 bytes
uploading 213018 bytes

andrewshadura · 2022-02-23T12:42:57Z

~~Oh, I see, aiofiles have methods named differently than what your code expects.~~
Wait, aread is a method of AsyncByteStream, it needs only __aiter__ from the underlying file object, so it’s supposed to work?

andrewshadura · 2022-02-25T12:57:46Z

@tomchristie, any ideas?

tomchristie · 2022-02-25T13:06:20Z

Rather than me dig into this myself, lemme point you in the right directions towards figuring this.
Might take a little longer, but I'm sure it'll be a valuable approach.

First up, you've given me a partial example. Can you give me a complete replication, but make it absolutely as simple as you can. Ideally I ought to be able to copy and paste your example to see the behaviour you're talking about.

(Once we've got that we'll work through the next steps...)

Cap upload chunk sizes

e941d79

tomchristie added the perf Issues relating to performance label Nov 22, 2021

Use '.read' for file streaming, where possible

f7863e2

Direct iteration should not apply chunk sizes

71aa1cd

tomchristie changed the title ~~Cap upload chunk sizes~~ Read upload files using read(CHUNK_SIZE) rather than iter(). Nov 22, 2021

Merge branch 'master' into cap-upload-chunk-sizes

0ff5726

tomchristie merged commit 6f5865f into master Nov 22, 2021

tomchristie deleted the cap-upload-chunk-sizes branch November 22, 2021 13:15

tomchristie mentioned this pull request Jan 5, 2022

Version 0.21.2 #2011

Merged

tomchristie mentioned this pull request Jan 28, 2022

HTTPX AsyncClient slower than aiohttp? #838

Closed

lo5 mentioned this pull request Feb 27, 2022

q.site.upload is slow h2oai/wave#982

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read upload files using `read(CHUNK_SIZE)` rather than `iter()`. #1948

Read upload files using `read(CHUNK_SIZE)` rather than `iter()`. #1948

tomchristie commented Nov 22, 2021

tomchristie commented Nov 22, 2021

andrewshadura commented Feb 23, 2022 •

edited

tomchristie commented Feb 23, 2022

andrewshadura commented Feb 23, 2022 •

edited

andrewshadura commented Feb 23, 2022 •

edited

andrewshadura commented Feb 25, 2022

tomchristie commented Feb 25, 2022

Read upload files using read(CHUNK_SIZE) rather than iter(). #1948

Read upload files using read(CHUNK_SIZE) rather than iter(). #1948

Conversation

tomchristie commented Nov 22, 2021

tomchristie commented Nov 22, 2021

andrewshadura commented Feb 23, 2022 • edited

tomchristie commented Feb 23, 2022

andrewshadura commented Feb 23, 2022 • edited

andrewshadura commented Feb 23, 2022 • edited

andrewshadura commented Feb 25, 2022

tomchristie commented Feb 25, 2022

Read upload files using `read(CHUNK_SIZE)` rather than `iter()`. #1948

Read upload files using `read(CHUNK_SIZE)` rather than `iter()`. #1948

andrewshadura commented Feb 23, 2022 •

edited

andrewshadura commented Feb 23, 2022 •

edited

andrewshadura commented Feb 23, 2022 •

edited