Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-899773: Allow specification of of batch_size for batch-generating functions #1712

Open
willsthompson opened this issue Aug 24, 2023 · 4 comments
Assignees
Labels
feature status-triage_done Initial triage done, will be further handled by the driver team

Comments

@willsthompson
Copy link

What is the current behavior?

Batch size is not controllable from the client when using batch-generating functions, e.g. get_result_batches(), fetch_arrow_batches(), fetch_pandas_batches()

What is the desired behavior?

Allow specification of a batch_size parameter when making batch requests that determines the number of records returned in each batch.

How would this improve snowflake-connector-python?

Many applications require tight control over memory usage to operate reliably. This applies to essentially any service running in a remote server, i.e. not a user's laptop. Our application provides connections to multiple databases and cloud storage providers, and the only way we can provide the equivalent level of reliability (every other database and storage provider we support has this feature available in their connector) is for Snowflake to include the ability to control the size of responses for large requests.

References and other background

  • The same request was made a few years ago, then evidently closed after a Snowflake developer reached out to the OP, but no details about how the issue was resolved were included in the issue (SNOW-165822: fetch_pandas_batches batch size #320)
  • Someone filed a bug complaining about small and varying batch sizes, but a bot closed it earlier this year after no comments for over a year (SNOW-606540: fetch_pandas_batches tiny batch size #1160)
  • It would still solve our problem (and I think most similar problems) if the returned batches are only close, but not exactly the requested size. Generally, we only want to ensure we do not receive batches so big that a worker process runs out of memory or receive many batches so small that it takes too long to iterate and merge them.
@github-actions github-actions bot changed the title Allow specification of of batch_size for batch-generating functions SNOW-899773: Allow specification of of batch_size for batch-generating functions Aug 24, 2023
@kylejcaron
Copy link

kylejcaron commented Dec 14, 2023

It would still solve our problem (and I think most similar problems) if the returned batches are only close, but not exactly the requested size. Generally, we only want to ensure we do not receive batches so big that a worker process runs out of memory or receive many batches so small that it takes too long to iterate and merge them.

Hope y'all don't mind me chiming in here - lot of ML frameworks require tightly controlled batch sizes. An example pattern thats compatible with alot of these frameworks is to_torch_datapipe from snowflake.ml

It might be worth finding an expert to weigh in that uses Tensorflow/PyTorch/Jax at scale

@bitshop
Copy link

bitshop commented Jan 5, 2024

I found this case searching around thinking I must be misreading the docs or just not finding another command name in the connector to fetch up to a specific size. NOTE: I suggest this may be worth breaking into two requests:

  1. Specify a MAX batch size, for memory constrained apps this is fine
  2. Specify an EXACT batch size - This requires more compute on Snowflake's side potentially to borrow from other batches to get ones fetched to be the exact size - Assuming a highly distributed query I would think each producer of rows would only publish their number of rows - Hence thinking this request is harder.

To be clear the difference is #1 is about memory constraints, if some batches have 1 row and others the max that's acceptable in that use case.

@sfc-gh-dszmolka
Copy link

thank you for opening this request with us - we'll consider it for a possible future improvement in the connector

@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed needs triage labels Mar 11, 2024
@JHuangg
Copy link

JHuangg commented Mar 18, 2024

is there a way to understand how batch size is being generated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

7 participants