Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating checkpoints for tables with missing column stats results in Err #2493

Open
shanisolomon opened this issue May 8, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@shanisolomon
Copy link

Delta-rs version:
0.16.5


Bug

In a table with > 32 colums, when trying to create a checkpoint in a delta table (using checkpoints::create_checkpoint API) that contains transaction log written by Spark, which only includes stats for 32 columns by default rather than for all columns, we're getting the following Err:
Failed to convert into Arrow schema: Json error: whilst decoding field 'add': whilst decoding field 'stats_parsed': whilst decoding field 'minValues': Encountered unmasked nulls in non-nullable StructArray child: < child>.

I suspect it's either a bug in arrow-json package that for some reason receive null pos for the overflowing columns when decoding the transaction log statistics, or perhaps it's a bug in the 'add' transaction json created by delta.rs during checkpoint in which the schema contains > 32 columns, but the 'stats_parsed' json does not have a corresponding value to all columns.

What you expected to happen:
I expect to be able to construct the Arrow Json schema when stats are not present for all columns, and more broadly - to be able to create a checkpoint file using delta.rs library after Spark has optimized the table.

How to reproduce it:
A table with > 32 columns that Spark engine ran OPTIMIZE transaction on, which doesn't include stats for all fields.
The delta log itself is enough to repro this issue. I'm able to provide necessary example files if needed.

More details:
If this is intended behavior and not a bug, please let me know. Thanks in advance!

@shanisolomon shanisolomon added the bug Something isn't working label May 8, 2024
@shanisolomon shanisolomon changed the title Creating checkpoints for tables with missing column stats panics Creating checkpoints for tables with missing column stats results in Err May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant