Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to object_store 0.9.0 and arrow 50.0.0 #8758

Merged
merged 14 commits into from
Jan 14, 2024

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jan 5, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core datafusion crate label Jan 5, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this

datafusion-cli/src/exec.rs Show resolved Hide resolved
@tustvold tustvold changed the title Prepare object_store 0.9.0 Prepare object_store 0.9.0 and arrow 50.0.0 Jan 8, 2024
@github-actions github-actions bot added sql physical-expr Physical Expressions labels Jan 8, 2024
@@ -40,12 +40,12 @@ async fn describe() -> Result<()> {
"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these precision changes relate to apache/arrow-rs#5100

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a better result to me (less numeric stability)

@@ -451,10 +451,6 @@ Dml: op=[Insert Into] table=[test_decimal]
"INSERT INTO test_decimal (nonexistent, price) VALUES (1, 2), (4, 5)",
"Schema error: No field named nonexistent. Valid fields are id, price."
)]
#[case::type_mismatch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -194,7 +194,7 @@ worth noting that using the settings in the `[profile.release]` section will sig

```toml
[dependencies]
datafusion = { version = "22.0" , features = ["simd"]}
datafusion = { version = "22.0" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1746,6 +1746,7 @@ mod tests {
}

#[tokio::test]
#[ignore]
Copy link
Contributor Author

@tustvold tustvold Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing with a memory exhausted error, I don't believe this is because of an inherent issue with the arrow release, rather a very sensitive test, I don't think it should block the arrow release

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can update the max_memory of new_spill_ctx(2, 1500) in check_aggregates as long as we understand why we need more memory.
Looks like the last change was actually reduced from 2500 to 1500 in #7587

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the memory sizes in 0c4a8a1 to get the tests to pass.

However, I don't have a good idea of what requires more memory now. Any thoughts on allocation changes @tustvold (like did we change alignment in 50, or add some new fields to the array / buffer structures?)

Copy link
Contributor Author

@tustvold tustvold Jan 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We changed the way aggregates are computed and this might have impacted buffer sizing in some way, it is hard to know for sure without investing a lot of time. If it isn't a major change I wouldn't be overly concerned about it. These sorts of test are always extremely fragile.

It could even be that the aggregates are now much faster and therefore we end up buffering more, I don't know 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is a major change personally

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through this PR and I agree there is nothing that looks like it should block the arrow-rs / arrow 50 release apache/arrow-rs#5234

Thank you @tustvold

datafusion/core/Cargo.toml Show resolved Hide resolved
@@ -40,12 +40,12 @@ async fn describe() -> Result<()> {
"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a better result to me (less numeric stability)

@@ -43,7 +43,7 @@ async fn csv_query_custom_udf_with_cast() -> Result<()> {
"+------------------------------------------+",
"| AVG(custom_sqrt(aggregate_test_100.c11)) |",
"+------------------------------------------+",
"| 0.6584408483418833 |",
"| 0.6584408483418835 |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this differs in the last decimal place and thus I think is related to floating point stability and thus this change is fine

@@ -1746,6 +1746,7 @@ mod tests {
}

#[tokio::test]
#[ignore]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?

@@ -451,10 +451,6 @@ Dml: op=[Insert Into] table=[test_decimal]
"INSERT INTO test_decimal (nonexistent, price) VALUES (1, 2), (4, 5)",
"Schema error: No field named nonexistent. Valid fields are id, price."
)]
#[case::type_mismatch(
"INSERT INTO test_decimal SELECT '2022-01-01', to_timestamp('2022-01-01T12:00:00')",
"Error during planning: Cannot automatically convert Timestamp(Nanosecond, None) to Decimal128(10, 2)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb changed the title Prepare object_store 0.9.0 and arrow 50.0.0 Upgrade to object_store 0.9.0 and arrow 50.0.0 Jan 12, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2024
@@ -29,7 +29,6 @@ rust-version = "1.70"
[features]
ci = []
default = ["mimalloc"]
simd = ["datafusion/simd"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow 50 removed the manual SIMD implementation and now relies on auto vectorization - apache/arrow-rs#5184

datafusion-cli/src/exec.rs Show resolved Hide resolved
@@ -340,13 +340,10 @@ mod tests {
let session_token = "fake_session_token";
let location = "s3://bucket/path/file.parquet";

// Missing region
// Missing region, use object_store defaults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object_store defaults now to us-east-1 apache/arrow-rs#5244

@@ -138,7 +138,7 @@ physical_plan
SortPreservingMergeExec: [column1@0 ASC NULLS LAST]
--CoalesceBatchesExec: target_batch_size=8192
----FilterExec: column1@0 != 42
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..197], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..201], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:201..403], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:197..394]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet file appears to be slightly larger and thus the offsets are now slightly different (this can happen because, for example, the metadata written changed (instead of "arrow-rs 49.0.0" it may now say "arrow-rs 50.0.0"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is also because we now write the column sort order information

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this PR apache/arrow-rs#5110

@alamb alamb marked this pull request as ready for review January 13, 2024 10:02
@tustvold tustvold merged commit acf0f78 into apache:main Jan 14, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate documentation Improvements or additions to documentation physical-expr Physical Expressions sql sqllogictest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants