Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

tustvold · 2024-01-05T11:14:51Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion-cli/src/exec.rs

alamb

Thank you for this

datafusion-cli/src/exec.rs

tustvold · 2024-01-08T18:11:05Z

datafusion/core/tests/dataframe/describe.rs

@@ -40,12 +40,12 @@ async fn describe() -> Result<()> {
        "+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",


I believe these precision changes relate to apache/arrow-rs#5100

This seems like a better result to me (less numeric stability)

tustvold · 2024-01-08T18:11:39Z

datafusion/sql/tests/sql_integration.rs

@@ -451,10 +451,6 @@ Dml: op=[Insert Into] table=[test_decimal]
    "INSERT INTO test_decimal (nonexistent, price) VALUES (1, 2), (4, 5)",
    "Schema error: No field named nonexistent. Valid fields are id, price."
 )]
-#[case::type_mismatch(


apache/arrow-rs#5123

tustvold · 2024-01-08T18:11:57Z

docs/source/user-guide/example-usage.md

@@ -194,7 +194,7 @@ worth noting that using the settings in the `[profile.release]` section will sig

 ```toml
 [dependencies]
-datafusion = { version = "22.0" , features = ["simd"]}
+datafusion = { version = "22.0" }


apache/arrow-rs#5184

tustvold · 2024-01-08T18:12:34Z

datafusion/physical-plan/src/aggregates/mod.rs

@@ -1746,6 +1746,7 @@ mod tests {
    }

    #[tokio::test]
+    #[ignore]


This is failing with a memory exhausted error, I don't believe this is because of an inherent issue with the arrow release, rather a very sensitive test, I don't think it should block the arrow release

We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?

We can update the max_memory of new_spill_ctx(2, 1500) in check_aggregates as long as we understand why we need more memory.
Looks like the last change was actually reduced from 2500 to 1500 in #7587

I adjusted the memory sizes in 0c4a8a1 to get the tests to pass.

However, I don't have a good idea of what requires more memory now. Any thoughts on allocation changes @tustvold (like did we change alignment in 50, or add some new fields to the array / buffer structures?)

We changed the way aggregates are computed and this might have impacted buffer sizing in some way, it is hard to know for sure without investing a lot of time. If it isn't a major change I wouldn't be overly concerned about it. These sorts of test are always extremely fragile.

It could even be that the aggregates are now much faster and therefore we end up buffering more, I don't know 😅

I don't think it is a major change personally

alamb

I looked through this PR and I agree there is nothing that looks like it should block the arrow-rs / arrow 50 release apache/arrow-rs#5234

Thank you @tustvold

datafusion/core/Cargo.toml

alamb · 2024-01-08T18:26:55Z

datafusion/core/tests/dataframe/describe.rs

@@ -40,12 +40,12 @@ async fn describe() -> Result<()> {
        "+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",


This seems like a better result to me (less numeric stability)

alamb · 2024-01-08T18:27:32Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

@@ -43,7 +43,7 @@ async fn csv_query_custom_udf_with_cast() -> Result<()> {
        "+------------------------------------------+",
        "| AVG(custom_sqrt(aggregate_test_100.c11)) |",
        "+------------------------------------------+",
-        "| 0.6584408483418833                       |",
+        "| 0.6584408483418835                       |",


this differs in the last decimal place and thus I think is related to floating point stability and thus this change is fine

datafusion/physical-expr/src/expressions/case.rs

alamb · 2024-01-08T18:29:51Z

datafusion/physical-plan/src/aggregates/mod.rs

@@ -1746,6 +1746,7 @@ mod tests {
    }

    #[tokio::test]
+    #[ignore]


We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?

alamb · 2024-01-08T18:30:03Z

datafusion/sql/tests/sql_integration.rs

@@ -451,10 +451,6 @@ Dml: op=[Insert Into] table=[test_decimal]
    "INSERT INTO test_decimal (nonexistent, price) VALUES (1, 2), (4, 5)",
    "Schema error: No field named nonexistent. Valid fields are id, price."
 )]
-#[case::type_mismatch(
-    "INSERT INTO test_decimal SELECT '2022-01-01', to_timestamp('2022-01-01T12:00:00')",
-    "Error during planning: Cannot automatically convert Timestamp(Nanosecond, None) to Decimal128(10, 2)"


datafusion/sqllogictest/test_files/subquery.slt

…e-0.9.0

…0.9.0

alamb · 2024-01-12T21:32:02Z

benchmarks/Cargo.toml

@@ -29,7 +29,6 @@ rust-version = "1.70"
 [features]
 ci = []
 default = ["mimalloc"]
-simd = ["datafusion/simd"]


arrow 50 removed the manual SIMD implementation and now relies on auto vectorization - apache/arrow-rs#5184

datafusion-cli/src/exec.rs

alamb · 2024-01-12T21:36:58Z

datafusion-cli/src/exec.rs

@@ -340,13 +340,10 @@ mod tests {
        let session_token = "fake_session_token";
        let location = "s3://bucket/path/file.parquet";

-        // Missing region
+        // Missing region, use object_store defaults


object_store defaults now to us-east-1 apache/arrow-rs#5244

alamb · 2024-01-12T21:38:13Z

datafusion/sqllogictest/test_files/repartition_scan.slt

@@ -138,7 +138,7 @@ physical_plan
 SortPreservingMergeExec: [column1@0 ASC NULLS LAST]
 --CoalesceBatchesExec: target_batch_size=8192
 ----FilterExec: column1@0 != 42
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..197], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..201], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:201..403], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:197..394]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]
+------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]


The parquet file appears to be slightly larger and thus the offsets are now slightly different (this can happen because, for example, the metadata written changed (instead of "arrow-rs 49.0.0" it may now say "arrow-rs 50.0.0"

I think it is also because we now write the column sort order information

Perhaps this PR apache/arrow-rs#5110

github-actions bot added the core Core datafusion crate label Jan 5, 2024

Prepare object_store 0.9.0

7e7a57a

tustvold force-pushed the prepare-object-store-0.9.0 branch from 1be2b9a to 7e7a57a Compare January 5, 2024 11:16

Update test

17c1a23

tustvold commented Jan 5, 2024

View reviewed changes

datafusion-cli/src/exec.rs Show resolved Hide resolved

alamb reviewed Jan 5, 2024

View reviewed changes

datafusion-cli/src/exec.rs Show resolved Hide resolved

Update to arrow 50.0.0

6a861df

tustvold changed the title ~~Prepare object_store 0.9.0~~ Prepare object_store 0.9.0 and arrow 50.0.0 Jan 8, 2024

github-actions bot added sql physical-expr Physical Expressions labels Jan 8, 2024

tustvold commented Jan 8, 2024

View reviewed changes

tustvold mentioned this pull request Jan 8, 2024

Prepare arrow 50.0.0 apache/arrow-rs#5291

Merged

alamb mentioned this pull request Jan 8, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 8, 2024 #8786

Closed

7 tasks

alamb reviewed Jan 8, 2024

View reviewed changes

Update sqllogictest

46bec97

github-actions bot added the sqllogictest label Jan 8, 2024

tustvold commented Jan 8, 2024

View reviewed changes

datafusion/sqllogictest/test_files/subquery.slt Outdated Show resolved Hide resolved

tustvold mentioned this pull request Jan 8, 2024

Release arrow-rs version 50.0.0 apache/arrow-rs#5234

Closed

tustvold and others added 6 commits January 9, 2024 09:12

Merge remote-tracking branch 'upstream/main' into prepare-object-stor…

9f723ac

…e-0.9.0

Update sqllogictests

4851323

Format

285fb33

Use nullif

1e4eca2

Merge remote-tracking branch 'apache/main' into prepare-object-store-…

8353982

…0.9.0

Use released version of arrow-rs

99c7177

alamb changed the title ~~Prepare object_store 0.9.0 and arrow 50.0.0~~ Upgrade to object_store 0.9.0 and arrow 50.0.0 Jan 12, 2024

Update README to remove references to SIMD

727b881

github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2024

alamb added 3 commits January 12, 2024 16:21

unpatch datafusion-cli

89cb067

Adjust memory sizes in tests

0c4a8a1

Restore test without explicit region

5a6d80d

alamb reviewed Jan 12, 2024

View reviewed changes

alamb marked this pull request as ready for review January 13, 2024 10:02

alamb approved these changes Jan 13, 2024

View reviewed changes

tustvold merged commit acf0f78 into apache:main Jan 14, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

tustvold commented Jan 5, 2024

alamb left a comment

tustvold Jan 8, 2024

alamb Jan 8, 2024

tustvold Jan 8, 2024

tustvold Jan 8, 2024

tustvold Jan 8, 2024 •

edited

alamb Jan 8, 2024

kazuyukitanimura Jan 8, 2024

alamb Jan 12, 2024

tustvold Jan 12, 2024 •

edited

alamb Jan 12, 2024

alamb left a comment

alamb Jan 8, 2024

alamb Jan 8, 2024

alamb Jan 8, 2024

alamb Jan 8, 2024

alamb Jan 12, 2024

alamb Jan 12, 2024

alamb Jan 12, 2024

tustvold Jan 12, 2024

alamb Jan 12, 2024

		@@ -40,12 +40,12 @@ async fn describe() -> Result<()> {
		"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",

Upgrade to object_store 0.9.0 and arrow 50.0.0 #8758

Upgrade to object_store 0.9.0 and arrow 50.0.0 #8758

Conversation

tustvold commented Jan 5, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jan 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jan 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

tustvold Jan 8, 2024 •

edited

tustvold Jan 12, 2024 •

edited