Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to arrow 23.0.0 #3483

Merged
merged 8 commits into from Sep 20, 2022
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 14, 2022

Which issue does this PR close?

re apache/arrow-rs#2665

Rationale for this change

I am putting up this PR so I can incrementally test the effect on DataFusion of splitting up the arrow crate in PRs such as apache/arrow-rs#2693

Hopefully we can use it to actually upgrade to arrow 23.0.0 when it is released as well

Arrow 23.0.0 was released -- need to keep up with dependencies

What changes are included in this PR?

Update to arrow 23.0.0

Are there any user-facing changes?

No

@alamb
Copy link
Contributor Author

alamb commented Sep 15, 2022

The failure in the test job https://github.com/apache/arrow-datafusion/actions/runs/3055080369/jobs/4927734152 appears to be real:

failures:

---- sql::expr::csv_query_nullif_divide_by_0 stdout ----
thread 'sql::expr::csv_query_nullif_divide_by_0' panicked at 'attempt to subtract with overflow', C:\Users\runneradmin\.cargo\git\checkouts\arrow-rs-3b86e19e889d5acc\5146663\arrow\src\compute\kernels\arity.rs:198:18

Looks related to some changes to the arity kernels -- specifically the code that is panic'ing seems to have been changed in apache/arrow-rs#2666 by @tustvold . I am not sure if DataFusion is passing bad input or if there is some corner case in arrow that is hit.

I plan to look into it later today if no one else beat me to it.

@tustvold
Copy link
Contributor

It will actually be apache/arrow-rs#2643 which changed the unchecked divide kernels to panic on divide by zero - apache/arrow-rs#2647

@github-actions github-actions bot added the core Core datafusion crate label Sep 15, 2022
@alamb
Copy link
Contributor Author

alamb commented Sep 15, 2022

It will actually be apache/arrow-rs#2643 which changed the unchecked divide kernels to panic on divide by zero - apache/arrow-rs#2647

Thanks @tustvold -- great hint -- fixed in 0493aa1

@@ -24,7 +24,7 @@ use std::{any::Any, sync::Arc};

use arrow::array::*;
use arrow::compute::kernels::arithmetic::{
add, add_scalar, divide, divide_scalar, modulus, modulus_scalar, multiply,
add, add_scalar, divide_opt, divide_scalar, modulus, modulus_scalar, multiply,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default divide kernel changed behavior -- divide_opt now has the "NULL on divide by zero" behavior datafusion expects

@@ -372,14 +372,18 @@ pub(crate) fn multiply_decimal_scalar(
Ok(array)
}

pub(crate) fn divide_decimal(
pub(crate) fn divide_opt_decimal(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to be consistent with the arrow rename

@codecov-commenter
Copy link

codecov-commenter commented Sep 16, 2022

Codecov Report

Merging #3483 (e6c9ced) into master (f30fc4e) will increase coverage by 0.00%.
The diff coverage is 87.50%.

@@           Coverage Diff           @@
##           master    #3483   +/-   ##
=======================================
  Coverage   85.79%   85.79%           
=======================================
  Files         300      300           
  Lines       55403    55397    -6     
=======================================
- Hits        47533    47530    -3     
+ Misses       7870     7867    -3     
Impacted Files Coverage Δ
...usion/core/src/avro_to_arrow/arrow_array_reader.rs 0.00% <0.00%> (ø)
...afusion/core/src/physical_plan/coalesce_batches.rs 93.33% <100.00%> (-0.05%) ⬇️
...tafusion/core/src/physical_plan/file_format/mod.rs 96.94% <100.00%> (-0.02%) ⬇️
datafusion/physical-expr/src/expressions/binary.rs 97.63% <100.00%> (-0.01%) ⬇️
...sical-expr/src/expressions/binary/kernels_arrow.rs 95.03% <100.00%> (ø)
datafusion/expr/src/logical_plan/plan.rs 77.68% <0.00%> (-0.17%) ⬇️
datafusion/common/src/scalar.rs 85.18% <0.00%> (+0.06%) ⬆️
datafusion/core/src/physical_plan/metrics/value.rs 87.43% <0.00%> (+0.50%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -286,8 +286,7 @@ impl SchemaAdapter {
let projected_schema = Arc::new(self.table_schema.clone().project(projections)?);

// Necessary to handle empty batches
let mut options = RecordBatchOptions::default();
options.row_count = Some(batch.num_rows());
let options = RecordBatchOptions::new().with_row_count(Some(batch.num_rows()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to use the nice API @askoa added in apache/arrow-rs#2729

if dep_name in ("arrow", "parquet", "arrow-flight") and constraint.get("version") is not None:
doc[section][dep_name]["version"] = new_version
if dep_name in ("arrow", "parquet", "arrow-flight"):
if type(constraint) == tomlkit.items.String:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code I wrote in IOx in https://github.com/influxdata/influxdb_iox/blob/main/scripts/update_arrow_deps.py

Basically the previous version didn't handle arrow = "22.0.0" type toml

@github-actions github-actions bot added development-process Related to development process of arrow-datafusion logical-expr Logical plan and expressions optimizer Optimizer rules sql labels Sep 16, 2022
@alamb alamb changed the title Upgrade to arrow 23.0.0 (WIP) Upgrade to arrow 23.0.0 Sep 19, 2022
@alamb alamb marked this pull request as ready for review September 19, 2022 21:35
@alamb alamb merged commit 67002a0 into apache:master Sep 20, 2022
@alamb alamb deleted the alamb/update_arrow_23.0.0 branch September 20, 2022 09:55
@ursabot
Copy link

ursabot commented Sep 20, 2022

Benchmark runs are scheduled for baseline = e873423 and contender = 67002a0. 67002a0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate development-process Related to development process of arrow-datafusion logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants