Change `split_row_groups` default to "infer" #9637

rjzamora · 2022-11-08T16:55:14Z

This PR proposes new "infer" and "adaptive" options for split_row_groups, as well as a new blocksize argument (defaulting to "128 MiB") to replace chunksize. Using split_row_groups="infer" (the new default) results in the uncompressed storage sizes specified in the metadata of the first file being used to set split_row_groups to either "adaptive" or False (depending on blocksize). The immediate result is that users should be much less likely to get large/problematic partition sizes by default. For users with unbalanced row-group sizes, split_row_groups="adaptive" can be set directly to ensure that none of the output partitions will exceed blocksize (according to the uncompressed storage sizes recorded in the parquet metadata).

~~After this PR is revised/merged, I propose that we do something similar for aggregate_files (probably a mixture of aggregate_files="adaptive" and the changes in #9197). For example, ...~~

The difference being that the current chunksize option results in us (slowly) choosing a distinct number of row-groups to aggregate together for every individual partition. I suppose we can still support this (slower) fine-grained partitioning approach when the user specifies split_row_groups=True (or aggregate_files=True) and chunksize=<something>.

Partially addresses Should we deprecate chunksize from read_parquet? #9043
Tests added / passed
Passes pre-commit run --all-files

mrocklin · 2022-11-08T20:57:44Z

Oh cool. I'm excited about this.

🤔 who should review this? @ian-r-rose is out

Any concerns you want to call out @rjzamora ? I notice that this is in draft mode.

jrbourbeau

Thanks @rjzamora -- I'll take a look at this

rjzamora · 2022-11-08T23:14:25Z

I'll take a look at this

Thanks @jrbourbeau !

Any concerns you want to call out @rjzamora ? I notice that this is in draft mode.

It should be ready for feedback. My main points of uncertainty are:

Should we remove the deprecation plan (and the corresponding FutureWarning) for "legacy" chunksize usage? With the introduction of split_row_groups="auto", we are able to do something much faster than what chunksize was doing before. However, there may still be messy datasets where each partition really does need to correspond to a distinct row-group count (from 1+ files). I propose that we still support that behavior if the user explicitly specifies split_row_groups=True (rather than an integer or "auto") and chunksize=<something>.
Are we okay with chunksize corresponding to an uncompressed storage size, or do we want to use the real in-memory size? Although I was originally planning for in-memory size, I eventually decided on using a value that is available in the parquet metadata (for performance reasons).
What should the default chuksize be when split_row_groups="auto"? This PR proposes the same as the default blocksize used in read_csv (1/10 the per-core memory).

fjetter

However, there may still be messy datasets where each partition really does need to correspond to a distinct row-group count (from 1+ files)

What use case would that be? Honestly, I don't think we should offer this functionality at all. Row groups are more or less random chunks with statistics to optimize data access. Why would we want to guarantee a one-to-one mapping to dask partitions?
Even if that is a use case, how frequent is it really? Is this edge case really worth the complexity?

Are we okay with chunksize corresponding to an uncompressed storage size, or do we want to use the real in-memory size? Although I was originally planning for in-memory size, I eventually decided on using a value that is available in the parquet metadata (for performance reasons).

I don't see a problem with this as long as it is clearly documented. I'd love some utility functions that could tell the user what the average compression ratio for a given dataset is (even if it is just sampled) to help advanced users make good decisions.
In my experience, compressed vs uncompressed is maybe a factor of 2-3 unless the user really knows what they are doing and write a dataset highly compressed (I've seen parquet files with 20-100 compression ratios but this is very rare and this kind of expert user can use the API appropriately).

What should the default chuksize be when split_row_groups="auto"? This PR proposes the same as the default blocksize used in read_csv (1/10 the per-core memory).

I was surprised to see the 1/10 implementation. This will break down for all kinds of distributed compute scenarios. Even for a LocalCluster this might be a bit much.

One datapoint: I have a 16GB / 4 CPU notebook. That would create 400MB uncompressed data partitions. That will easily blow up to a GB or more after decompression. That's a bit heavy.
This will, of course, be caught by min(blocksize, int(64e6)), i.e. I realistically have 64MB * compression ratio almost all the time unless somebody has many CPUs with virtually no memory

So, two things

I think the logic we're applying here is overkill. psutil + rounding down to 64MB sounds smart but in the end it's hard coded 64MB, why not be explicit? Even if it's not rounded, the psutil will lead to confusing results when executing on a distributed cluster. After all, why would the client memory size have an impact on worker memory size? Lastly, the truly optimal chunksize should consider cluster size as well... After this argument, why not drop the "auto" calculation and just set "64/100MiB" as a default? In my experience this is a better UX than some magical/opaque "auto"
64MB uncompressed feels OK. That'll create 100-500MB uncompressed partitions which I think is a comfortable place for a default value

fjetter · 2022-11-29T09:34:46Z

dask/dataframe/io/parquet/core.py

+    split_row_groups : "auto", bool, or int, default "auto"
        If True, then each output dataframe partition will correspond to a single
        parquet-file row-group. If False, each partition will correspond to a
        complete file.  If a positive integer value is given, each dataframe
        partition will correspond to that number of parquet row-groups (or fewer).
+        If "auto" (the default), the uncompressed storage size of all row-groups
+        in the first file will be used to automatically set a value that is
+        consistent with ``chunksize``.
    chunksize : int or str, default None
-        WARNING: The ``chunksize`` argument will be deprecated in the future.
-        Please use ``split_row_groups`` to specify how many row-groups should be
-        mapped to each output partition. If you strongly oppose the deprecation of
-        ``chunksize``, please comment at https://github.com/dask/dask/issues/9043".
-
        The desired size of each output ``DataFrame`` partition in terms of total
-        (uncompressed) parquet storage space. If specified, adjacent row-groups
-        and/or files will be aggregated into the same output partition until the
-        cumulative ``total_byte_size`` parquet-metadata statistic reaches this
-        value. Use `aggregate_files` to enable/disable inter-file aggregation.
+        (uncompressed) parquet storage space. If ``split_row_groups='auto'``,
+        this argument will default to 1/10 the per-core system memory. The metadata
+        of the first file will then be used to choose an ``split_row_groups`` value
+        that is consistent with ``chunksize``.
+
+        WARNING: Using the ``chunksize`` argument in the absence of
+        ``split_row_groups='auto'`` is often slow on large and/or remote datasets.
+
+        If ``split_row_groups`` is set to ``True``, ``chunksize`` defaults to
+        ``None``. If ``chunksize`` is set to an explicit value, adjacent row-groups
+        will be aggregated into the same output partition until the cumulative
+        ``total_byte_size`` parquet-metadata statistic reaches that value.
+        Use `aggregate_files` to enable/disable inter-file aggregation.


Highlevel API wise I don't feel great about supporting two kwargs that are so strongly entangled.

Isn't chunksize already sufficient? The fact that we're splitting the files per row group almost feels like an implementation detail.

chunksize
"auto": Generate reasonably large partitions, e.g. 50-100MB or larger. Fuse files or split files as appropriate
int/str: Specifiy the size in bytes. e.g. 100 * 1024**2 or "100 MiB". Fuse files or split files as appropriate.
False: Disable all fusing/splitting and serve files as-is

The same argument applies for aggregate_files. Do we need this (out of scope for this PR, of course)

rjzamora · 2022-11-29T19:09:50Z

Thanks for the review @fjetter !

What use case would that be? Honestly, I don't think we should offer this functionality at all. Row groups are more or less random chunks with statistics to optimize data access. Why would we want to guarantee a one-to-one mapping to dask partitions?
Even if that is a use case, how frequent is it really? Is this edge case really worth the complexity?

I probably agree with you, but I’m not sure I understand this question: “Why would we want to guarantee a one-to-one mapping to dask partitions?”  

To clarify exactly what I am talking about here, lets pretend like we have a simple Parquet dataset with two files, and each file has 3 x 500MB row-groups:  

File-0: [rg-0, rg-1, rg-2]
 File-1: [rg-0, rg-1, rg-2]

 
 Case A - The result I am talking about supporting (chunksize=“1GB”, split_row_groups=True) would be:
 

   Partition-0: [(File-0, [rg-0, rg-1])]
    Partition-1: [(File-0, [rg-2]), (File-1, [rg-0])] 
    Partition-2: [(File-1, [rg-1, rg-2])]

    
    Case B - The (faster) result implemented in this PR (chunksize=“1GB”, split_row_groups=“auto”) would be:  
    

    Partition-0: [(File-0, [rg-0, rg-1])]
     Partition-1: [(File-0, [rg-2])] 
     Partition-2: [(File-1, [rg-0, rg-1])] 
     Partition-3: [(File-1, [rg-2])]

     
 We currently support “case A”, but it doesn’t get used much (and currently returns a “pre-deprectation” warning), because iterating through the row-group statistics can be slow. I’d be very happy to drop support for this altogether, but it also seems like a reasonable result to want.  
 

I don't see a problem with this as long as it is clearly documented. I'd love some utility functions that could tell the user what the average compression ratio for a given dataset is (even if it is just sampled) to help advanced users make good decisions.
In my experience, compressed vs uncompressed is maybe a factor of 2-3 unless the user really knows what they are doing and write a dataset highly compressed (I've seen parquet files with 20-100 compression ratios but this is very rare and this kind of expert user can use the API appropriately).

Utility functions do sound useful here. However, just to clarify one point: Compression shouldn’t really come into play here. The parquet metadata we use corresponds to the parquet storage size without compression (even if the data ends up being compressed). Therefore, the distinction between the specified chunksize and the “real” memory usage will have more to do with dictionary encoding and other variations in the memory layout (e.g. “object” vs string).

I was surprised to see the 1/10 implementation. This will break down for all kinds of distributed compute scenarios. Even for a LocalCluster this might be a bit much.  

 The 1/10 implementation is just the precedent set in other places (like dd.read_csv). Do you think we should overhaul auto-partitioning in other places as well? I agree that there is no guarantee that the workers and client will have a comparable memory/core ratio, so perhaps using/documenting a reasonable default is better than playing with psutil to produce a system-dependent default.  
 
 To partially summarize my thoughts here: I think I agree that the auto-partitioning logic is overkill. I also want to make sure it is clear that we don’t need to worry about compression (unless you are using compression to refer to memory-layout details like dictionary encoding).

fjetter · 2022-12-01T11:13:17Z

Utility functions do sound useful here. However, just to clarify one point: Compression shouldn’t really come into play here. The parquet metadata we use corresponds to the parquet storage size without compression (even if the data ends up being compressed).

Never mind then.

To clarify exactly what I am talking about here, lets pretend like we have a simple Parquet dataset with two files, and each file has 3 x 500MB row-groups:  
CaseA/B

Thanks for clarification. I was under the impression that we'd always do A, even now.

The one-to-one mapping I'm talking about is a usecase that would break out every RG to a dedicated partitions regardless of actual RG sizes

Partition-0: [(File-0, [rg-0])]
Partition-1: [(File-0, [rg-1])]
Partition-2: [(File-0, [rg-2])]
Partition-3: [(File-1, [rg-0])]
Partition-4: [(File-1, [rg-1])]
Partition-5: [(File-1, [rg-2])]

This example would still be fine bc 500MB is a nice partition size but in general I think RGs are smaller (at least I've been very successfully working with much smaller RGs.) and this would produce horrible results.

Isn't this chunksize=None, split_row_groups=True/auto doing this or is this not possible?

My bigger point is actually: Why do we have a split_row_groups? What's the point of defining a chunk size if we do not split rowgroups?

The 1/10 implementation is just the precedent set in other places (like dd.read_csv). Do you think we should overhaul auto-partitioning in other places as well? I agree that there is no guarantee that the workers and client will have a comparable memory/core ratio, so perhaps using/documenting a reasonable default is better than playing with psutil to produce a system-dependent default.

If that's what we're using, yes, I believe we should. The psutil logic is smarter than it needs to be since in almost all situations this will default to 64MB. From a UX perspective, I'd rather set the default to 64MB and write a sentence into the docs (or point to an entire page weighing pros/cons).

fjetter · 2022-12-01T11:14:08Z

This argument also lets me wonder: What datasets are we using to measure this? 500MB RGs sound insanely large from my experience. I wonder if we have a severe misalignment here

fjetter · 2022-12-14T16:47:29Z

FWIW I'm not blocking on this. Just wanted to learn more about what we do and why. I think this is a good step even if we pursue something differently long term (e.g. #9637)

rjzamora · 2022-12-14T17:47:11Z

FWIW I'm not blocking on this. Just wanted to learn more about what we do and why. I think this is a good step even if we pursue something differently long term (e.g. #9637)

That makes sense. I do ultimately want to move in a different direction, but agree that this PR is clearly an improvement.

The only lingering question I have about this particular PR is about how to merge it without effectively "blocking" the possibility of further improvement. For example, I'd personally like to continue with the plan to deprecate chunksize (under its current meaning), because (1) the current implementation is too ugly to maintain, and (2) chunksize usually corresponds to a row-count (rather than byte size) is pandas.

Perhaps the answer is to continue deprecating chunksize, and to add a distinct blocksize argument having the sole purpose (for now) of controlling split_row_groups when it is set to "auto".

fjetter · 2023-02-09T10:00:34Z

I'd be happy about this but to_parquet should be changed to automatically split into sensible row groups by default

Off topic but this is a -1 from me. Finding sensible row group sizes depends on many different factors like data types, number of columns, entropy of the data, cardinality, access patterns, etc.

I agree that 67_108_864 is too large but finding a good default that suits everyone is not possible. Hence, pyarrow is choosing a very boring but very safe default.

fjetter · 2023-02-09T10:04:12Z

My backlog for reviews is currently a bit full. @jrbourbeau do you have capacity for this?

rjzamora · 2023-02-14T19:34:33Z

dask/dask.yaml

@@ -11,7 +11,7 @@ dataframe:
  shuffle-compression: null  # compression for on disk-shuffling. Partd supports ZLib, BZ2, SNAPPY
  parquet:
    metadata-task-size-local: 512  # Number of files per local metadata-processing task
-    metadata-task-size-remote: 16  # Number of files per remote metadata-processing task
+    metadata-task-size-remote: 1  # Number of files per remote metadata-processing task


Setting this greater than 1 is only an optimization when you are dealing with many small files. However, it is more likely that we will need to parse metadata when files are large, and we need to split them by row-group. For large remote files, metadata_task_size=1 is often a better choice.

rjzamora · 2023-02-17T20:48:37Z

Small update: I'd like to get this in before the next release.

cc @martindurant - In case you have thoughts

martindurant · 2023-02-17T21:01:33Z

dask/dataframe/io/tests/test_parquet.py

+        assert "_metadata" not in files
+        path = os.path.join(dirname, "*.parquet")
+
+    # Use (default) split_row_groups="auto"


Default is actually "infer" ?

martindurant · 2023-02-17T21:20:14Z

The description, conversation and reasoning above all seem sound to me.
The names "infer" and "auto" mean more or less the same in my mind, so I wonder if better names could be found - but I don't have an immediate better suggestion.

I feel like test_blocksize should ideally have exact result to measure against. If we specify no compression, no nulls and no dict encoding, then the sizes of the column chunks we play with should be nearly deterministic ("nearly" because the thrift header stuff adds a little). Actually making tests like this might be time-consuming, however; but it would be nice to have at least check that our idea is reasonable

import pandas as pd
df = pd.DataFrame({"a": range(10000)})
pf = fastparquet.ParquetFile("out.parquet", compression=False)
pf = fastparquet.ParquetFile("out.parquet")
pf.row_groups[0].total_byte_size # => 80039  ~= 10000 * 8

df.to_parquet("out2.parquet", index=False, engine="pyarrow", compression={}, use_dictionary=False)
pf = fastparquet.ParquetFile("out2.parquet")
pf.row_groups[0].total_byte_size # => 80075

(note that pyarrow always wants to make dictionary encoding, which would have made this data appear ~20% bigger)

rjzamora · 2023-02-21T15:31:56Z

Thanks for the review @martindurant !

The names "infer" and "auto" mean more or less the same in my mind

I also struggled with this naming question a bit. The default setting of "infer" is meant to mean: "use a metadata sample from the first file to decide if we should be partitioning by file, or if we should be using a row-group range for each partition. I like the "infer" name for this a bit more than "auto", because we are using a metadata sample to "infer" properties for the rest of the dataset.

I am less comfortable with the meaning of "auto". Right now, it means that we will "individualize" the number of row groups for each partition, depending on the size of each row-group. Perhaps the name for this should be something like "variable", "adaptive", or "dynamic"?

martindurant · 2023-02-21T15:33:39Z

adaptive or dynamic sounds better to me

…earer

rjzamora · 2023-02-21T15:46:29Z

adaptive or dynamic sounds better to me

Okay - Thanks for the quick feedback! Going with "adaptive"

Updates `ReadParquet` to use metadata-parsing and IO logic from `dask.dataframe.io.parquet`. Requires dask/dask#9637 (only because my environment was using that PR when I put this together).

rjzamora added 3 commits November 7, 2022 15:44

initial pass at split_row_groups='auto'

1a8fed9

update docstrings and tests

7807cc5

make auto_blocksize private

e7804dc

github-actions bot added dataframe io labels Nov 8, 2022

rjzamora mentioned this pull request Nov 8, 2022

Should we deprecate chunksize from read_parquet? #9043

Closed

remove userwarning

52cc142

jrbourbeau reviewed Nov 8, 2022

View reviewed changes

jrbourbeau self-requested a review November 8, 2022 22:59

rjzamora marked this pull request as ready for review November 8, 2022 23:01

add _auto_split_row_groups

527178a

fjetter reviewed Nov 29, 2022

View reviewed changes

rjzamora mentioned this pull request Dec 5, 2022

[DISCUSSION] Rethinking default read_parquet partitioning #9712

Open

rjzamora added 2 commits December 14, 2022 12:04

move to blocksize argument

3f0ab0a

Merge remote-tracking branch 'upstream/main' into auto-split-row-groups

bf0574f

rjzamora mentioned this pull request Jan 20, 2023

Add blocksize to read_parquet and read_json (non-line json) #9849

Open

crusaderky mentioned this pull request Jan 24, 2023

dask.dataframe.read_*: change default blocksize to 128 MiB #9850

Open

rjzamora marked this pull request as draft January 26, 2023 14:42

rjzamora added 2 commits January 26, 2023 09:27

use infer as default and update tests

36ef233

Merge remote-tracking branch 'upstream/main' into auto-split-row-groups

ccedacd

rjzamora changed the title ~~Change split_row_groups default to "auto"~~ Change split_row_groups default to "infer" Jan 26, 2023

rjzamora added 2 commits January 30, 2023 15:41

removing chunksize

3bb6a2b

update comment

fac7a45

rjzamora added 2 commits February 8, 2023 20:44

apply filters before collectinge parts in parallel

f1466b7

only use up-front filters for 'partitioned' data

58be95d

rjzamora mentioned this pull request Feb 9, 2023

Add filter optimization for remote parquet #7090

Closed

Merge remote-tracking branch 'upstream/main' into auto-split-row-groups

cc329d1

rjzamora commented Feb 14, 2023

View reviewed changes

rjzamora added 2 commits February 14, 2023 12:38

improve comments

ed8b818

fix typos

09751c3

martindurant reviewed Feb 17, 2023

View reviewed changes

fix typo in comment

4009a3d

rjzamora added 2 commits February 21, 2023 06:27

Merge remote-tracking branch 'upstream/main' into auto-split-row-groups

0f24dfb

update documentation

1299b67

github-actions bot added the documentation Improve or add to documentation label Feb 21, 2023

move from 'auto' to 'adaptive' to make the difference from 'infer' cl…

faa220c

…earer

rjzamora added 2 commits February 21, 2023 08:33

improve test

69a9b70

fix test

b68f21f

rjzamora mentioned this pull request Feb 21, 2023

[WIP] Add "real" read_parquet logic dask/dask-expr#1

Merged

rjzamora changed the title ~~Change split_row_groups default to "infer"~~ Change split_row_groups default to "infer" Feb 21, 2023

rjzamora merged commit 556d3df into dask:main Feb 21, 2023

rjzamora deleted the auto-split-row-groups branch February 21, 2023 20:12

rjzamora mentioned this pull request Feb 26, 2023

Fix parquet overwrite behavior after an adaptive read_parquet operation #10002

Merged

3 tasks

rjzamora mentioned this pull request Mar 6, 2023

Fix getitem bug in parquet_statistics dask-contrib/dask-sql#1072

Merged

trivialfis mentioned this pull request Jun 21, 2023

[dask] Support prediction with empty partition. dmlc/xgboost#9318

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change `split_row_groups` default to "infer" #9637

Change `split_row_groups` default to "infer" #9637

rjzamora commented Nov 8, 2022 •

edited

mrocklin commented Nov 8, 2022

jrbourbeau left a comment

rjzamora commented Nov 8, 2022

fjetter left a comment

fjetter Nov 29, 2022

rjzamora commented Nov 29, 2022

fjetter commented Dec 1, 2022

fjetter commented Dec 1, 2022

fjetter commented Dec 14, 2022

rjzamora commented Dec 14, 2022

fjetter commented Feb 9, 2023 •

edited

fjetter commented Feb 9, 2023

rjzamora Feb 14, 2023

rjzamora commented Feb 17, 2023

martindurant Feb 17, 2023

martindurant commented Feb 17, 2023

rjzamora commented Feb 21, 2023

martindurant commented Feb 21, 2023

rjzamora commented Feb 21, 2023

Change split_row_groups default to "infer" #9637

Change split_row_groups default to "infer" #9637

Conversation

rjzamora commented Nov 8, 2022 • edited

mrocklin commented Nov 8, 2022

jrbourbeau left a comment

Choose a reason for hiding this comment

rjzamora commented Nov 8, 2022

fjetter left a comment

Choose a reason for hiding this comment

fjetter Nov 29, 2022

Choose a reason for hiding this comment

rjzamora commented Nov 29, 2022

fjetter commented Dec 1, 2022

fjetter commented Dec 1, 2022

fjetter commented Dec 14, 2022

rjzamora commented Dec 14, 2022

fjetter commented Feb 9, 2023 • edited

fjetter commented Feb 9, 2023

rjzamora Feb 14, 2023

Choose a reason for hiding this comment

rjzamora commented Feb 17, 2023

martindurant Feb 17, 2023

Choose a reason for hiding this comment

martindurant commented Feb 17, 2023

rjzamora commented Feb 21, 2023

martindurant commented Feb 21, 2023

rjzamora commented Feb 21, 2023

Change `split_row_groups` default to "infer" #9637

Change `split_row_groups` default to "infer" #9637

rjzamora commented Nov 8, 2022 •

edited

fjetter commented Feb 9, 2023 •

edited