Remove statistics-based set_index logic from read_parquet #9661

rjzamora · 2022-11-14T23:33:41Z

dd.read_parquet will only gather statistics for columns that either (1) have been designated as an index or (2) are being filtered. For this reason, it no longer makes sense to auto-infer an index that was not specified in the pandas metadata or by the user (via dd.read_parquet(..., index=)). By using statistics to automatically set an arbitrary sorted column as the index, we open ourselves up to problematic/surprising behavior (see dask-contrib/dask-sql#903 (comment)).

This PR proposes that we officially remove the logic use to automatically select an index column using statistics.

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

rjzamora · 2022-11-14T23:33:53Z

cc @charlesbluca

fjetter · 2022-11-29T11:05:49Z

+1 for the change itself. I don't think we should auto set the index. From a UX perspective, I could see this being a utility function answering the question "Are there sorted columns that could be useful to be an index in my dataset" but setting it automatically without opt-out appears to be strange.

This is likely breaking to some. I'm wondering if we want to communicate this somehow. How did we deal with these situations in the past?

Code also LGTM. I'm OK with merging after a brief discussion about deprecation cylce

…em as the index)

rjzamora · 2022-11-29T15:47:17Z

This is likely breaking to some. I'm wondering if we want to communicate this somehow. How did we deal with these situations in the past?

Good point. Although this PR changes the code to be more consistent with documentation, I agree that some users may be expecting the "undocumented" behavior that we are trying to fix.

The good news is that the only case we are really changing here is when index=None and filters is defined. This is because we only collect min/max statistics for index columns and filtered columns anyway. Therefore, I was able to add a simple UserWarning for the case that we have detected a sorted column that is not being used as the index already (and index=None).

fjetter

I like the suggestion with the user warning

fjetter · 2022-12-01T09:51:28Z

dask/dataframe/io/tests/test_parquet.py

+def test_select_filtered_column(tmp_path, engine):
+
+    df = pd.DataFrame({"a": range(10), "b": ["cat"] * 10})
+    path = tmp_path / "test_select_filtered_column.parquet"
+    df.to_parquet(path, index=False)
+
+    with pytest.warns(UserWarning, match="Sorted columns detected"):
+        ddf = dd.read_parquet(path, engine=engine, filters=[("b", "==", "cat")])
+    assert_eq(df, ddf)


The fact that you are deleting so much logic and you are not even deleting a test but add a new one is possibly the most compelling reason why this feature needs to go

rjzamora added 3 commits November 14, 2022 12:13

remove auto indexing

bf7f207

fix test

bf93d70

cleanup

f336634

rjzamora added dataframe parquet bug Something is broken labels Nov 14, 2022

github-actions bot added the io label Nov 14, 2022

charlesbluca approved these changes Nov 15, 2022

View reviewed changes

add UserWarning when we ignore sorted columns (when we used to set th…

9978200

…em as the index)

rjzamora added 2 commits November 29, 2022 08:54

Merge remote-tracking branch 'upstream/main' into remove-auto-index

ec0c471

fix ci failures

4e30555

benrutter mentioned this pull request Nov 30, 2022

Method changed to support duplicate column cum-functions #9685

Merged

3 tasks

Merge remote-tracking branch 'upstream/main' into remove-auto-index

11d1d94

fjetter approved these changes Dec 1, 2022

View reviewed changes

fjetter merged commit 945435b into dask:main Dec 1, 2022

rjzamora deleted the remove-auto-index branch December 1, 2022 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove statistics-based set_index logic from read_parquet #9661

Remove statistics-based set_index logic from read_parquet #9661

rjzamora commented Nov 14, 2022

rjzamora commented Nov 14, 2022

fjetter commented Nov 29, 2022

rjzamora commented Nov 29, 2022

fjetter left a comment

fjetter Dec 1, 2022

Remove statistics-based set_index logic from read_parquet #9661

Remove statistics-based set_index logic from read_parquet #9661

Conversation

rjzamora commented Nov 14, 2022

rjzamora commented Nov 14, 2022

fjetter commented Nov 29, 2022

rjzamora commented Nov 29, 2022

fjetter left a comment

Choose a reason for hiding this comment

fjetter Dec 1, 2022

Choose a reason for hiding this comment