Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation to RecordBatch for non-nullable fields containing null values #1890

Merged
merged 8 commits into from Jun 18, 2022

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

Closes #1888

Rationale for this change

Allowing null values in non-nullable fields can lead to incorrect results and missed optimizations in downstream crates such as DataFusion

What changes are included in this PR?

Add validation check when creating a RecordBatch

Are there any user-facing changes?

Yes. Users may start getting errors if their schemas are not currently defined correctly.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 16, 2022
@andygrove andygrove requested a review from alamb June 16, 2022 18:26
@andygrove
Copy link
Member Author

@alamb I think this PR has exposed a bug in dictionary IPC encoding. The checked-in dictionary_generated files seem to have an incorrect schema (declaring a column as non-nullable when it contains null values). The roundtrip tests fail for the same reason.

@andygrove andygrove changed the title WIP: Add validation to RecordBatch for non-nullable fields containing null values Add validation to RecordBatch for non-nullable fields containing null values Jun 16, 2022
@andygrove andygrove marked this pull request as ready for review June 16, 2022 18:36
Comment on lines 144 to 145
// hacky workaround for known issue with dictionary IPC encoding
// https://github.com/apache/arrow-rs/issues/1892
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1892

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd rather an if !condition over an if else with an empty if body

@codecov-commenter
Copy link

codecov-commenter commented Jun 16, 2022

Codecov Report

Merging #1890 (c91f355) into master (f0bf7f9) will increase coverage by 0.00%.
The diff coverage is 97.29%.

@@           Coverage Diff           @@
##           master    #1890   +/-   ##
=======================================
  Coverage   83.44%   83.44%           
=======================================
  Files         202      202           
  Lines       57135    57146   +11     
=======================================
+ Hits        47678    47688   +10     
- Misses       9457     9458    +1     
Impacted Files Coverage Δ
arrow/src/array/array_binary.rs 94.18% <ø> (ø)
arrow/src/record_batch.rs 93.90% <90.90%> (-0.09%) ⬇️
arrow/src/csv/reader.rs 89.89% <100.00%> (ø)
arrow/src/ipc/writer.rs 81.78% <100.00%> (ø)
arrow/src/json/reader.rs 84.58% <100.00%> (ø)
arrow/src/json/writer.rs 92.11% <100.00%> (ø)
arrow/src/datatypes/datatype.rs 65.42% <0.00%> (-0.38%) ⬇️
arrow/src/ffi_stream.rs 79.89% <0.00%> (ø)
parquet/src/encodings/encoding.rs 93.65% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0bf7f9...c91f355. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, minor nits, big fan of this

@@ -138,6 +138,20 @@ impl RecordBatch {
)
})?;

for (c, f) in columns.iter().zip(&schema.fields) {
if !f.is_nullable() && c.null_count() > 0 {
if f.name().len() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_empty() is normally preferred

Comment on lines 144 to 145
// hacky workaround for known issue with dictionary IPC encoding
// https://github.com/apache/arrow-rs/issues/1892
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd rather an if !condition over an if else with an empty if body

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andygrove
Copy link
Member Author

Integration tests failed with:

################# FAILURES #################
FAILED TEST: datetime Java producing,  C# consuming

@andygrove
Copy link
Member Author

Failed with a Rust / C# test this time

==========================================================
Testing file duration
==========================================================
Traceback (most recent call last):
################# FAILURES #################
FAILED TEST: datetime Rust producing,  C# consuming
1 failures
  File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case
    run_binaries(producer, consumer, test_case)
  File "/arrow/dev/archery/archery/integration/runner.py", line 286, in _produce_consume
    consumer.stream_to_file(producer_stream_path, consumer_file_path)
  File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file
    self.run_shell_command(cmd)
  File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command
    subprocess.check_call(cmd, shell=True)
  File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmpyt_o8mf0/b64aa0aa_generated_datetime.consumer_stream_as_file < /tmp/tmpyt_o8mf0/b64aa0aa_generated_datetime.producer_file_as_stream' returned non-zero exit status 1.

@andygrove
Copy link
Member Author

Other builds are failing with the same issue. I wonder if it is related to apache/arrow#13279 ?

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 143 to 144
// hacky workaround for known issue with dictionary IPC encoding
// https://github.com/apache/arrow-rs/issues/1892
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1893 should fix this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya I have upmerged and removed this workaround

@tustvold
Copy link
Contributor

I think a rebase should clear the test failures related to IPC dictionaries

@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Jun 18, 2022
@andygrove andygrove removed the arrow-flight Changes to the arrow-flight crate label Jun 18, 2022
@andygrove andygrove merged commit 535cd20 into apache:master Jun 18, 2022
@andygrove andygrove deleted the check-nulls-in-non-nullable-column branch June 18, 2022 14:43
@alamb
Copy link
Contributor

alamb commented Jun 19, 2022

🎉

@alamb alamb added the api-change Changes to the arrow API label Jun 23, 2022
@alamb
Copy link
Contributor

alamb commented Jun 23, 2022

Marking this as an API change because it caused many errors in the datafusion tests -- see apache/datafusion#2778 -- and I think it should be highlighted prominently in the release notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating a RecordBatch with null values in non-nullable fields does not cause an error
6 participants