Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove statistics-based set_index logic from read_parquet #9661

Merged
merged 7 commits into from Dec 1, 2022

Conversation

rjzamora
Copy link
Member

dd.read_parquet will only gather statistics for columns that either (1) have been designated as an index or (2) are being filtered. For this reason, it no longer makes sense to auto-infer an index that was not specified in the pandas metadata or by the user (via dd.read_parquet(..., index=)). By using statistics to automatically set an arbitrary sorted column as the index, we open ourselves up to problematic/surprising behavior (see dask-contrib/dask-sql#903 (comment)).

This PR proposes that we officially remove the logic use to automatically select an index column using statistics.

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

@rjzamora rjzamora added dataframe parquet bug Something is broken labels Nov 14, 2022
@rjzamora
Copy link
Member Author

cc @charlesbluca

@github-actions github-actions bot added the io label Nov 14, 2022
@fjetter
Copy link
Member

fjetter commented Nov 29, 2022

+1 for the change itself. I don't think we should auto set the index. From a UX perspective, I could see this being a utility function answering the question "Are there sorted columns that could be useful to be an index in my dataset" but setting it automatically without opt-out appears to be strange.

This is likely breaking to some. I'm wondering if we want to communicate this somehow. How did we deal with these situations in the past?

Code also LGTM. I'm OK with merging after a brief discussion about deprecation cylce

@rjzamora
Copy link
Member Author

This is likely breaking to some. I'm wondering if we want to communicate this somehow. How did we deal with these situations in the past?

Good point. Although this PR changes the code to be more consistent with documentation, I agree that some users may be expecting the "undocumented" behavior that we are trying to fix.

The good news is that the only case we are really changing here is when index=None and filters is defined. This is because we only collect min/max statistics for index columns and filtered columns anyway. Therefore, I was able to add a simple UserWarning for the case that we have detected a sorted column that is not being used as the index already (and index=None).

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the suggestion with the user warning

Comment on lines +4406 to +4414
def test_select_filtered_column(tmp_path, engine):

df = pd.DataFrame({"a": range(10), "b": ["cat"] * 10})
path = tmp_path / "test_select_filtered_column.parquet"
df.to_parquet(path, index=False)

with pytest.warns(UserWarning, match="Sorted columns detected"):
ddf = dd.read_parquet(path, engine=engine, filters=[("b", "==", "cat")])
assert_eq(df, ddf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you are deleting so much logic and you are not even deleting a test but add a new one is possibly the most compelling reason why this feature needs to go

@fjetter fjetter merged commit 945435b into dask:main Dec 1, 2022
@rjzamora rjzamora deleted the remove-auto-index branch December 1, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants