New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we deprecate chunksize from read_parquet? #9043
Comments
…` and ``aggregate_files`` (#9052) As discussed in #9043 (for `chunksize`) and #9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.
…` and ``aggregate_files`` (dask#9052) As discussed in dask#9043 (for `chunksize`) and dask#9051 (for `aggregate_files`), I propose that we deprecate two complex and rarely-utilized arguments from `read_parquet`: `chunksize` and `aggregate_files`. This PR simply adds "pre-deprectation" warnings for the targeted arguments (including links to the relevant Issues discussing their deprecation). My goal is to find (and inform) whatever users may be depending on these obscure options.
We use Since moving over to using the following:
Reading the data, and processing it, has become faster and more reliable. My guess is, that in the future, the replacement will be |
Rather than deprecate could we lean in and automate?
For example we might randomly sample the size of row groups, and from that
and a default chunk size determine how many row groups to include in every
task.
…On Wed, May 18, 2022 at 8:55 AM Alien Science ***@***.***> wrote:
We use chunksize quite a lot because some of the data we have to read is
not under our control and has a large number of partitions i.e: 1 parquet
sub-directory contains many files. I believe the large number of partitions
causes the task graph to grow and Dask workers to die.
Since moving over to using the following:
df = dd.read_parquet(path,
engine="pyarrow",
chunksize="200MB",
aggregate_files=True,
gather_statistics=True,
filters=some_filter)
Reading the data, and processing it, has become faster and more reliable.
My guess is, that in the future, the replacement will be
split_row_groups=N where N is a number we reach by trail and error?
—
Reply to this email directly, view it on GitHub
<#9043 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDXTMEOIA7IB2FVA23VKTZGDANCNFSM5VGRU3BA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes - This is similar to what Merlin/NVTabular does, and is much easier to support than the current Thanks for the example @alienscience - It seems like you are relying on the current relationship between |
We use We read a small number of large Apache Parquet files that expanded in memory can easily exceed 32GB each. Using an 64-core/512GB machine, for example, we could use 16 workers (32GB per worker) and 48 workers would be idle. The idea is to use I admit there might be better alternatives and I confess I was a bit reluctant to use this parameter that looked obscure, undocumented and the current implementation not being very efficient. But I'm very much keen to learn, contribute and any recommendations would be immensely appreciated. In summary, there seems to be an use case when reading the input data puts the user in out-of-memory territory, but using the power of Dask to break it into smaller pieces could be a solution. Greatly appreciative of your work and maintaining Dask! Thank you so much for creating this issue for discussion. |
Thank you for commenting @g4brielvs ! This is good information - Hopefully we can preserve the features you are finding useful as we clean things up. A few questions with this motivation in mind:
import pyarrow.dataset as ds
path = "/your/dataset/path"
sizes = []
first_row_group = None
for file_frag in ds.dataset(path).get_fragments():
for rg_frag in file_frag.split_by_row_group():
row_group = rg_frag.row_groups[0]
if first_row_group is None:
first_row_group = row_group
sizes.append(row_group.total_byte_size)
print(f"First row-group size: {first_row_group.total_byte_size} bytes\n")
print(pd.DataFrame({"sizes": sizes}).describe()) Example Output
I ask this because I am hoping that we can capture the benefits of |
I wish I could use chunksize, but I'm not sure that I doing it correctly.
=== My case: imbalanced parquet file on S3:
Example: 30 parts, each ~70mb of compressed numbers and ONE ~1GB. I want to read this parquet from S3 location (S3FS) with Dask with N workers (processes), each worker have same amount M of RAM. So by default Dask trying to parallelize read operations between all workers (1 worker per parquet-part) and 29 read successfully, but the last one (1GB) goes for execution in the same way to just 1 worker, but it fails to read it because of memory. I'd happy if Dask can read that big file by chunks (chunksize) or at least multiple number of partitions by factor passed with split_row_groups. |
First row-group size: 96876297 bytes
count 3.600000e+01 |
Great! Please do let us know if you run into any issues with Regarding |
I've run into this a couple of times recently. In trying to read the nyc-taxi data on the NYC-TLC S3 bucket I now get 12 large partitions rather than many, which was unpleasant. I was also just speaking with a customer who routinely has 100 GB partitions and was sad about the new default (although also quite happy about not running into metadata issues). If it was the case that reading metadata for a single file/row-group was reliably easy then I would be in favor of at least that (maybe sampling would be good as well). Are there good arguments against this? Client/worker mismatches in environment and data access? |
I do think the most realiable approach for getting good partition sizes is to simply read the first row-group in the dataset upfront, and to use the The possible drawbacks are that (1) this approach is a bit slower than the current default, (2) the first row group may not be representitive of all row-groups in the dataset, and (3) it's yet another new feature and behavior change that some users may not like :) With these drawbacks in mind, I suppose we could start by using a simple |
Adding the functionality seems like a general good to me. Making it default introduces some questions. I would defer to you and @ian-r-rose on what makes sense there. My current pain points me to want to make it default, but that might be short-sighted if I don't have all of the context in-brain. |
@ian-r-rose any concerns that you'd like to raise here? I'll propose a basic "look at the metadata for the first partition" policy. Concerns? |
Apologies for the slow response -- I don't have any objections to peeking at the statistics for the first row group and making a decision based on some reasonable heuristics about whether to read in parquet files per-row-group or per-file. It's likely that our estimates of the expansion factor from parquet files to in-memory dataframes would be pretty rough, as they would be dependent on both features of the data (dictionary encoding, length encoding, etc) and on output formats (e.g., are we using pyarrow strings?, cf #9617). So I'd imagine this would be something like a 75% solution, and would require some extra thought for some users. An interesting wrinkle here is that a good choice for whether to split by row group or not is a function of how much memory your individual workers have. So implementing the |
Given the feedback I think that we should pursue this. @rjzamora is this something you're interested in?
I kinda agree with this, but not entirely. I think that ideal chunk size has more to do with the CPU/memory ratio than it does with total memory. Most machines today have a pretty consistent ratio of 4 GB/core. |
Sure - I'd be interested (unless someone else is eager to jump in). I do think it makes sense for us to try to agree on the desired API first... My immediate thought is to change the existing It may also make sense for Some drawbacks to either of these
|
I don't have strong opinions. I'd encourage you to push ahead if no one else jumps in in the next few days |
Update: My initial proposal in #9637 is to add a (default) |
Update: #9637 is now merged - The "final" decision there was to continue deprecating |
Follow-up to recent issues like #8937
Recent PRs have simplified the
read_parquet
API a bit, but the code is still vast. After some careful consideration, I'd like to suggest that we deprecate thechunksize
option.Why to deprecate
chunksize
:cc @jcrist
The text was updated successfully, but these errors were encountered: