Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure the row count is preserved when coalescing over empty records #3439

Merged
merged 2 commits into from Sep 12, 2022

Conversation

isidentical
Copy link
Contributor

Which issue does this PR close?

Closes #3283.

Rationale for this change

The row count information is required for the validation happening during the RecordBatch construction. Luckily we already calculate it and propagate it in the stream handler code, but just don't preserve it when creating the final record.

This PR also changes the SQL test code to never double-optimize logical plans (as it used to) since that would hide this problem (and maybe different ones in the future).

What changes are included in this PR?

This PR now creates the final record with the propagated row_count for the cases like #3283 where the schema might not contain any fields at all and the validation logic in arrow-rs would fail.

Are there any user-facing changes?

This is a bug fix in general. It also includes a change in our internal tooling to never double-optimize logical plans, which the reasoning is provided inn #3283.

@github-actions github-actions bot added the core Core datafusion crate label Sep 10, 2022
@codecov-commenter
Copy link

codecov-commenter commented Sep 10, 2022

Codecov Report

Merging #3439 (928eb9a) into master (4d22076) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3439      +/-   ##
==========================================
+ Coverage   85.61%   85.67%   +0.06%     
==========================================
  Files         297      298       +1     
  Lines       54490    54647     +157     
==========================================
+ Hits        46650    46821     +171     
+ Misses       7840     7826      -14     
Impacted Files Coverage Δ
...afusion/core/src/physical_plan/coalesce_batches.rs 93.38% <100.00%> (+0.09%) ⬆️
datafusion/core/tests/sql/mod.rs 97.78% <100.00%> (+0.01%) ⬆️
datafusion/physical-expr/src/aggregate/mod.rs 12.50% <0.00%> (-12.50%) ⬇️
datafusion/common/src/error.rs 78.65% <0.00%> (-1.81%) ⬇️
datafusion/expr/src/window_frame.rs 92.43% <0.00%> (-0.85%) ⬇️
...fusion/optimizer/src/pre_cast_lit_in_comparison.rs 94.00% <0.00%> (-0.38%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.19% <0.00%> (-0.17%) ⬇️
datafusion/physical-expr/src/expressions/binary.rs 97.63% <0.00%> (-0.08%) ⬇️
datafusion/sql/src/planner.rs 80.88% <0.00%> (-0.04%) ⬇️
... and 15 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@isidentical isidentical marked this pull request as ready for review September 11, 2022 10:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @isidentical -- this looks great. I left some small suggestions but they are not required -- let me know if you want to address them in this PR or a follow on

Thanks again 🏅

datafusion/core/tests/sql/mod.rs Show resolved Hide resolved
Comment on lines +295 to +296
let mut options = RecordBatchOptions::default();
options.row_count = Some(row_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an FYI (not important) you can write this same pattern in what is likely more idiomatic rust (avoid mut) like:

Suggested change
let mut options = RecordBatchOptions::default();
options.row_count = Some(row_count);
let options = RecordBatchOptions{
row_count:Some(row_count),
..Default::default()
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to fail since RecordBatchOptions is non-exhaustive and comes directly from arrow-rs. I am not really familiar with this part of Rust semantics, but there seems to be an existing issue about it.

error[E0639]: cannot create non-exhaustive struct using struct expression
   --> /home/isidentical/projects/arrow-datafusion/datafusion/core/src/physical_plan/coalesce_batches.rs:295:19
    |
295 |       let options = RecordBatchOptions {
    |  ___________________^
296 | |         row_count: Some(row_count),
297 | |         ..Default::default()
298 | |     };
    | |_____^

For more information about this error, try `rustc --explain E0639`.

Is there any way to achieve it? The only other example of RecordBatchOptions construction present in DF is the following which uses the same pattern as the code present in this PR but I'd be happy to refactor both if there is a more idiomatic way.
https://github.com/apache/arrow-datafusion/blob/9956f80f197550051db7debae15d5c706afc22a3/datafusion/core/src/physical_plan/file_format/mod.rs#L286-L288

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for trying @isidentical -- I will file a follow on in arrow-rs to make this API more ergonomic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed apache/arrow-rs#2728 as a follow on (I also hit the same thing in #3454 FWIW)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI apache/arrow-rs#2729 fixed this issue and #3483 includes using the new API.

@Dandandan Dandandan merged commit 6de0796 into apache:master Sep 12, 2022
@ursabot
Copy link

ursabot commented Sep 12, 2022

Benchmark runs are scheduled for baseline = 69d05aa and contender = 6de0796. 6de0796 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

project_column_with_filters_that_cant_pushed_down_always_true only passes if plan is optimized twice
6 participants