Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing struct stats after schema evolution #901

Merged
merged 9 commits into from
Nov 3, 2022
Merged

Fix parsing struct stats after schema evolution #901

merged 9 commits into from
Nov 3, 2022

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Oct 24, 2022

Description

When a delta table's schema is evolved the struct stat schemas in checkpoints are also evolved. Since the struct stats are stored in a columnar way adding a single file with the new columns will cause nulls to appear in the struct stats for all other files. This is a significant difference compared to the json stats.

Unfortunately I overlooked this in #656 for both nullCounts and min/max values. This caused parsed struct stats to have extra columns full of nulls. I don't know if this was actually an issue at all but it should be fixed even if just for the sake of the warnings spam.

[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null

Related Issue(s)

Changes:

  • Replace the test data with similar test data that includes a schema evolution.
  • Add error handling for min/max values to ensure warnings will be logged for other unexpected types (there probably shouldn't be any). As a total rust noob I originally filled with nulls but I think that was a mistake.
  • Ignore nulls for min/max stats and null count stats since these are expected after schema evolution and should be ignored without logging a warning.

Usual disclaimer on a PR from me: I don't know what I'm doing writing rust code. (thanks to wjones for tidying up my dodgy rust code 馃檪)

@Tom-Newton Tom-Newton changed the title Fix warning spam when parsing struct stats after schema evolution Fix parsing struct stats after schema evolution Oct 24, 2022
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a draft, but thought I might leave some early feedback.

rust/src/action/parquet_read/mod.rs Outdated Show resolved Hide resolved
rust/src/action/parquet_read/mod.rs Outdated Show resolved Hide resolved
rust/src/action/parquet_read/mod.rs Outdated Show resolved Hide resolved
rust/src/action/parquet_read/mod.rs Outdated Show resolved Hide resolved
rust/src/action/parquet_read/mod.rs Outdated Show resolved Hide resolved
Tom-Newton and others added 5 commits October 25, 2022 08:19
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
refactor: simplify logic for parsing struct stats
@Tom-Newton Tom-Newton marked this pull request as ready for review November 3, 2022 08:52
@wjones127
Copy link
Collaborator

Thanks @Tom-Newton

@wjones127 wjones127 merged commit fae50cc into delta-io:main Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants