Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilise struct stats when available #656

Merged
merged 42 commits into from Jul 6, 2022
Merged

Utilise struct stats when available #656

merged 42 commits into from Jul 6, 2022

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Jun 26, 2022

Description

Updates Add.get_stats() to use parquet structs stats when available by parsing them to the same format json stats are parsed to. This means delta-rs can still utilise stats on tables where checkpoint files are configured to only store struct stats.

Related Issue(s)

Documentation

  • Add functions for converting struct stats to the same form as json stats.
  • Added a test asserting that the result of parsing the parquet struct stats matches the result we get from parsing the json stats.
    • The test data for this includes every type that is supported by spark including nested structs and a struct, array, map combination. Spark does not support unsigned ints so these are ommitted.
    • It seems like currently delta does not write stats for binary, map or array types so I have not made conversions for these.

I did my best but I've never written rust code before...

@Tom-Newton Tom-Newton marked this pull request as ready for review June 26, 2022 14:50
@@ -193,16 +193,6 @@ def test_read_table_with_stats():
# assert data.num_rows == 0


def test_read_table_with_only_struct_stats():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing this because everything should now be covered by the lower level rust test I added. This did surface something interesting though. It seems like the DeltaTable.to_pyarrow_schema() function cannot handle the struct of array of map type I put in my test data. I think this is probably an issue for another time though.

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this. I especially appreciate the example data with a variety of data type.

I left a few code suggestions.

I think we'll need to clean up decimal handling later. The behavior of Spark didn't match my existing expectations.

Comment on lines +410 to +413
Field::Decimal(decimal) => match BigInt::from_signed_bytes_be(decimal.data()).to_f64() {
Some(int) => json!(int / (10_i64.pow((decimal.scale()).try_into().unwrap()) as f64)),
_ => serde_json::Value::Null,
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So on the decimal representation, it does seem to be the case that the Spark implementation writes them out as numbers in the JSON file. I can't find the implementation code, but it does seem like they have a special parser that handles decimals. We don't so I'm worried about having a lossy conversion of Decimal into float.

It's a complicated subject and sort of an edge case, so I'm fine with this for now, but we should do a follow-up to make sure we are using decimal statistics appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was a bit concerned about this too. I'm tempted to just remove it since that is probably a safer option.

rust/src/action.rs Outdated Show resolved Hide resolved
rust/src/action.rs Outdated Show resolved Hide resolved
rust/src/action.rs Outdated Show resolved Hide resolved
rust/src/action.rs Outdated Show resolved Hide resolved
rust/src/action.rs Outdated Show resolved Hide resolved
rust/src/action.rs Outdated Show resolved Hide resolved
rust/tests/read_delta_test.rs Outdated Show resolved Hide resolved
Tom-Newton and others added 7 commits June 27, 2022 08:43
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 requested review from houqp and mosyp June 30, 2022 17:24
@wjones127 wjones127 merged commit 8d7f3b6 into delta-io:main Jul 6, 2022
@wjones127
Copy link
Collaborator

Thanks @Tom-Newton !

@Tom-Newton
Copy link
Contributor Author

I'm afraid I made a mistake in this PR. I just realised while testing #712. I've a made a follow up PR to fix this.

wjones127 added a commit that referenced this pull request Nov 3, 2022
# Description
When a delta table's schema is evolved the struct stat schemas in
checkpoints are also evolved. Since the struct stats are stored in a
columnar way adding a single file with the new columns will cause nulls
to appear in the struct stats for all other files. This is a significant
difference compared to the json stats.

Unfortunately I overlooked this in
#656 for both nullCounts and
min/max values. This caused parsed struct stats to have extra columns
full of nulls. I don't know if this was actually an issue at all but it
should be fixed even if just for the sake of the warnings spam.
```
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
``` 

# Related Issue(s)
- Relates to #653 but for the
most part its an already solved issue.

# Changes:
- Replace the test data with similar test data that includes a schema
evolution.
- Add error handling for min/max values to ensure warnings will be
logged for other unexpected types (there probably shouldn't be any). As
a total rust noob I originally filled with nulls but I think that was a
mistake.
- Ignore nulls for min/max stats and null count stats since these are
expected after schema evolution and should be ignored without logging a
warning.

Usual disclaimer on a PR from me: I don't know what I'm doing writing
rust code. (thanks to wjones for tidying up my dodgy rust code 🙂)

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't utilise struct delta stats with delta_table.to_pyarrow_dataset()
2 participants