SNOW-899773: Allow specification of of batch_size for batch-generating functions #1712

willsthompson · 2023-08-24T22:29:43Z

What is the current behavior?

Batch size is not controllable from the client when using batch-generating functions, e.g. get_result_batches(), fetch_arrow_batches(), fetch_pandas_batches()

What is the desired behavior?

Allow specification of a batch_size parameter when making batch requests that determines the number of records returned in each batch.

How would this improve `snowflake-connector-python`?

Many applications require tight control over memory usage to operate reliably. This applies to essentially any service running in a remote server, i.e. not a user's laptop. Our application provides connections to multiple databases and cloud storage providers, and the only way we can provide the equivalent level of reliability (every other database and storage provider we support has this feature available in their connector) is for Snowflake to include the ability to control the size of responses for large requests.

References and other background

The same request was made a few years ago, then evidently closed after a Snowflake developer reached out to the OP, but no details about how the issue was resolved were included in the issue (SNOW-165822: fetch_pandas_batches batch size #320)
Someone filed a bug complaining about small and varying batch sizes, but a bot closed it earlier this year after no comments for over a year (SNOW-606540: fetch_pandas_batches tiny batch size #1160)
It would still solve our problem (and I think most similar problems) if the returned batches are only close, but not exactly the requested size. Generally, we only want to ensure we do not receive batches so big that a worker process runs out of memory or receive many batches so small that it takes too long to iterate and merge them.

The text was updated successfully, but these errors were encountered:

kylejcaron · 2023-12-14T19:51:01Z

It would still solve our problem (and I think most similar problems) if the returned batches are only close, but not exactly the requested size. Generally, we only want to ensure we do not receive batches so big that a worker process runs out of memory or receive many batches so small that it takes too long to iterate and merge them.

Hope y'all don't mind me chiming in here - lot of ML frameworks require tightly controlled batch sizes. An example pattern thats compatible with alot of these frameworks is to_torch_datapipe from snowflake.ml

It might be worth finding an expert to weigh in that uses Tensorflow/PyTorch/Jax at scale

bitshop · 2024-01-05T19:13:31Z

I found this case searching around thinking I must be misreading the docs or just not finding another command name in the connector to fetch up to a specific size. NOTE: I suggest this may be worth breaking into two requests:

Specify a MAX batch size, for memory constrained apps this is fine
Specify an EXACT batch size - This requires more compute on Snowflake's side potentially to borrow from other batches to get ones fetched to be the exact size - Assuming a highly distributed query I would think each producer of rows would only publish their number of rows - Hence thinking this request is harder.

To be clear the difference is #1 is about memory constraints, if some batches have 1 row and others the max that's acceptable in that use case.

sfc-gh-dszmolka · 2024-03-11T11:46:53Z

thank you for opening this request with us - we'll consider it for a possible future improvement in the connector

JHuangg · 2024-03-18T19:27:11Z

is there a way to understand how batch size is being generated?

willsthompson added feature needs triage labels Aug 24, 2023

github-actions bot changed the title ~~Allow specification of of batch_size for batch-generating functions~~ SNOW-899773: Allow specification of of batch_size for batch-generating functions Aug 24, 2023

sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed needs triage labels Mar 11, 2024

sfc-gh-ashahi self-assigned this Mar 20, 2024

sfc-gh-dszmolka assigned sfc-gh-dszmolka and sfc-gh-anugupta and unassigned sfc-gh-ashahi and sfc-gh-dszmolka May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-899773: Allow specification of of batch_size for batch-generating functions #1712

SNOW-899773: Allow specification of of batch_size for batch-generating functions #1712

willsthompson commented Aug 24, 2023

kylejcaron commented Dec 14, 2023 •

edited

bitshop commented Jan 5, 2024

sfc-gh-dszmolka commented Mar 11, 2024

JHuangg commented Mar 18, 2024

SNOW-899773: Allow specification of of batch_size for batch-generating functions #1712

SNOW-899773: Allow specification of of batch_size for batch-generating functions #1712

Comments

willsthompson commented Aug 24, 2023

What is the current behavior?

What is the desired behavior?

How would this improve snowflake-connector-python?

References and other background

kylejcaron commented Dec 14, 2023 • edited

bitshop commented Jan 5, 2024

sfc-gh-dszmolka commented Mar 11, 2024

JHuangg commented Mar 18, 2024

How would this improve `snowflake-connector-python`?

kylejcaron commented Dec 14, 2023 •

edited