Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-535791: PUT operations read full input data into memory #536

Open
asonawalla opened this issue Jan 26, 2022 · 3 comments
Open

SNOW-535791: PUT operations read full input data into memory #536

asonawalla opened this issue Jan 26, 2022 · 3 comments
Assignees
Labels
enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team

Comments

@asonawalla
Copy link

We use the streaming PUT feature of gosnowflake to upload data to snowflake internal stages. Recently, we started throwing GB sized files at it and saw our memory usage explode. Through some experimentation with an isolated production workload, we saw that for a 1GB input file, the driver was using approximately 8GB of memory when the PUT operation was configured as a stream, and 2-3GB when it was configured as a file on disk.

Looking at the code, the problem seems to be that alleged "streams" are often passed around as bytes.Buffers that read the entire input data into memory multiple times over, the exact amount depending on the command and options (e.g. here, here, here, everywhere this is invoked, etc). Some crude experimentation with a fork I've created suggest there are likely more places.

Beyond the obvious issue that the documentation on streaming puts is misleading, I would go so far as to say the driver should never have to read the entire input contents into memory. The major operations it's responsible for (compression, encryption, calculating digests, etc) are all possible with modest buffers that don't need to scale with the input size.

I'd be happy to contribute here, but wanted to start a discussion since the required changes seem to be nontrivial.

@github-actions github-actions bot changed the title PUT operations read full input data into memory SNOW-535791: PUT operations read full input data into memory Jan 26, 2022
@github-actions github-actions bot closed this as completed Jul 1, 2022
@sfc-gh-jfan sfc-gh-jfan reopened this Jul 1, 2022
@github-actions github-actions bot closed this as completed Jul 2, 2022
@sfc-gh-jfan sfc-gh-jfan reopened this Jul 6, 2022
@sfc-gh-dszmolka
Copy link
Contributor

thank you for submitting this issue with us! read through the linked PR 538 and it has another reference to another PR , #527 which is merged, so I guess this should be resolved for now.
if i got it wrong, please feel free to reopen or comment.

@williamhbaker
Copy link

I can confirm that this is still an issue. There are at least instances where the stream is read fully into memory still here and here (for the s3 uploader).

The concept of having the entire contents of the stream available seems pretty baked-in, for example here where the size of the stream is needed. I can imagine it will take some effort to fully refactor this but it would be be useful to handle streaming PUTs in a memory-efficient way.

@sfc-gh-dszmolka
Copy link
Contributor

thank you for reviewing and commenting and especially for providing useful details! reopening this Issue and marking it as an enhancement for the product team to consider

@sfc-gh-dszmolka sfc-gh-dszmolka added the enhancement The issue is a request for improvement or a new feature label Apr 13, 2023
@sfc-gh-dszmolka sfc-gh-dszmolka added the status-triage_done Initial triage done, will be further handled by the driver team label Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement The issue is a request for improvement or a new feature status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

5 participants