Split out `arrow-schema` (#2594) #2711

tustvold · 2022-09-12T16:59:05Z

Which issue does this PR close?

Part of #2594

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2022-09-12T16:59:33Z

arrow-schema/Cargo.toml

+bench = false
+
+[dependencies]
+serde = { version = "1.0", default-features = false, features = ["derive"], optional = true }


This is somewhat annoying, but the orphan rule means serde needs to tag along for the ride

arrow-schema/src/error.rs

tustvold · 2022-09-12T17:02:41Z

arrow-schema/src/schema.rs

@@ -179,8 +186,7 @@ impl Schema {

    /// Returns a vector with references to all fields (including nested fields)
    #[inline]
-    #[cfg(feature = "ipc")]
-    pub(crate) fn all_fields(&self) -> Vec<&Field> {
+    pub fn all_fields(&self) -> Vec<&Field> {


This needs to be made public, so that the IPC reader can use it. I don't think this is a big problem

tustvold · 2022-09-12T17:03:50Z

arrow/src/util/decimal.rs

@@ -296,6 +293,791 @@ pub(crate) fn singed_cmp_le_bytes(left: &[u8], right: &[u8]) -> Ordering {
    Ordering::Equal
 }

+// MAX decimal256 value of little-endian format for each precision.


These were moved out of datatypes as they didn't really belong there, as they aren't to do with the schema. I'm not entirely sure they belong here, but they are pub(crate) so we can always move them later

arrow/src/util/decimal.rs

arrow-schema/src/schema.rs

…chema

tustvold · 2022-09-13T15:39:47Z

Working on trying to disentangle pyarrow 😭

tustvold · 2022-09-13T16:08:30Z

arrow/src/pyarrow.rs

-                Self::from_pyarrow(value)
-            }
-        }
+/// A newtype wrapper around a `T: PyArrowConvert` that implements


This is a breaking change, but is necessary to get around the orphan rule

tustvold · 2022-09-13T16:28:56Z

I think this is ready to go now, I've also created #2723 which would theoretically allow dropping the json flag from arrow-schema

tustvold · 2022-09-13T17:56:30Z

Marking as draft pending #2724

alamb

I like everything about this PR except for the change in Error types

arrow-schema/src/error.rs

arrow-schema/src/schema.rs

alamb · 2022-09-13T19:35:41Z

Thanks for starting thsi conversation @tustvold

…chema

tustvold · 2022-09-14T15:26:36Z

arrow-pyarrow-integration-testing/src/lib.rs

@@ -28,9 +28,13 @@ use arrow::compute::kernels;
 use arrow::datatypes::{DataType, Field, Schema};
 use arrow::error::ArrowError;
 use arrow::ffi_stream::ArrowArrayStreamReader;
-use arrow::pyarrow::PyArrowConvert;
+use arrow::pyarrow::{PyArrowConvert, PyArrowException, PyArrowType};


The pyarrow bindings take a bit of a hit from this split, but I don't really see an obvious way around this, unless we push pyo3 into arrow-schema also. Thoughts?

Edit: This doesn't actually work, because the conversions for Schema require the FFI bindings, so I don't think there is a way around this

cc @kszucs / @andygrove

I don't use the python bindings so I don't understand the implications of this change

I will look into this. These are important in DataFusion/Ballista for executing Python UDFs. I will have time to review tomorrow.

I ran out of time today - the Ballista release work took longer than hoped. I will try and look at this over the weekend.

I think we need to wait until apache/datafusion#3483 is passing, and then we can create a PR in https://github.com/apache/arrow-datafusion-python/ to use that version and make sure the tests pass

apache/datafusion#3483 is now ready for review / merge.

Updates in apache/datafusion-python#54

tustvold · 2022-09-14T16:16:10Z

arrow/Cargo.toml

@@ -75,7 +75,7 @@ default = ["csv", "ipc", "json"]
 ipc_compression = ["ipc", "zstd", "lz4"]
 csv = ["csv_crate"]
 ipc = ["flatbuffers"]
-json = ["serde", "serde_json"]
+json = ["serde_json"]


There is a subtle breaking change here, previously json would enabled serde derives on schema, etc... despite this not actually being part of the JSON API. This is now an explicit feature flag on arrow-schema

…chema

alamb

This looks great to me -- thank you @tustvold

It would be nice if someone familiar with pyarrow / pyo3 could weigh in before we merged this, as I don't understand the implications of the changes to that interface

alamb · 2022-09-15T17:11:41Z

.github/workflows/arrow.yml

@@ -63,6 +63,8 @@ jobs:
          cargo run --example read_csv_infer_schema
      - name: Run non-archery based integration-tests
        run: cargo test -p arrow-integration-testing
+      - name: Test arrow-schema with all features


Should we do the same for arrow-buffer as added in #2693?

We can do, but it doesn't have any feature flags that explicitly need testing

maybe we could add a comment for future readers like my self

alamb · 2022-09-15T17:12:34Z

arrow-pyarrow-integration-testing/src/lib.rs

@@ -28,9 +28,13 @@ use arrow::compute::kernels;
 use arrow::datatypes::{DataType, Field, Schema};
 use arrow::error::ArrowError;
 use arrow::ffi_stream::ArrowArrayStreamReader;
-use arrow::pyarrow::PyArrowConvert;
+use arrow::pyarrow::{PyArrowConvert, PyArrowException, PyArrowType};


cc @kszucs / @andygrove

I don't use the python bindings so I don't understand the implications of this change

alamb · 2022-09-15T17:13:31Z

arrow-schema/Cargo.toml

+bench = false
+
+[dependencies]
+serde = { version = "1.0", default-features = false, features = ["derive", "std"], optional = true }


that is certainly a nice (very small) list of dependencies!

…chema

tustvold · 2022-09-21T11:27:52Z

I'm going to get this in, apache/datafusion-python#54 shows that the pyarrow changes don't necessitate major downstream changes. If there is further feedback I will be more than happy to address it in a follow-up PR, prior to the next arrow release.

ursabot · 2022-09-21T12:11:40Z

Benchmark runs are scheduled for baseline = 74f639c and contender = 48cc8be. 48cc8be is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

* Update for changes in apache/arrow-rs#2711 * Clippy * Use Vec conversion * Update to DF 13 (#59) * [DataFrame] - Add write_csv/write_parquet/write_json to DataFrame (#58) * [SessionContext] - Add read_csv/read_parquet/read_avro functions to SessionContext (#57) Co-authored-by: Francis Du <me@francis.run> * remove patch from cargo toml * add notes on git submodule for test data Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com> Co-authored-by: Francis Du <me@francis.run>

Split out arrow-schema (apache#2594)

a2f6a72

tustvold commented Sep 12, 2022

View reviewed changes

arrow-schema/src/error.rs Outdated Show resolved Hide resolved

tustvold added the api-change Changes to the arrow API label Sep 12, 2022

tustvold commented Sep 12, 2022

View reviewed changes

arrow/src/util/decimal.rs Outdated Show resolved Hide resolved

tustvold mentioned this pull request Sep 12, 2022

Split up Arrow Crate #2594

Closed

github-actions bot added the arrow Changes to the arrow crate label Sep 12, 2022

andygrove reviewed Sep 12, 2022

View reviewed changes

arrow-schema/src/schema.rs Outdated Show resolved Hide resolved

andygrove reviewed Sep 12, 2022

View reviewed changes

arrow-schema/src/schema.rs Outdated Show resolved Hide resolved

tustvold added 6 commits September 12, 2022 18:44

Flatten schema

da189d1

Move decimal logic

f2ff606

Fix doc

e865100

Fix tests

3f213db

Merge remote-tracking branch 'upstream/master' into split-out-arrow-s…

8c68812

…chema

Fix integration-test

b01cb8e

alamb mentioned this pull request Sep 13, 2022

Enhance TaskContext and add task failure root cause apache/datafusion#3410

Open

Remove pyarrow orphan

af2652b

tustvold requested a review from andygrove September 13, 2022 14:23

PyArrow fixes

56eb7a6

tustvold commented Sep 13, 2022

View reviewed changes

tustvold mentioned this pull request Sep 13, 2022

Don't Derive Serialize/Deserialize Serde Implementations for Schema Types #2723

Closed

tustvold marked this pull request as draft September 13, 2022 17:57

alamb reviewed Sep 13, 2022

View reviewed changes

arrow-schema/src/error.rs Outdated Show resolved Hide resolved

arrow-schema/src/schema.rs Outdated Show resolved Hide resolved

Move ArrowError to arrow-schema

a94fda1

alamb mentioned this pull request Sep 14, 2022

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

Open

tustvold added 3 commits September 14, 2022 12:33

Fix pyarrow

8efd726

Fix test

02d0e38

Merge remote-tracking branch 'upstream/master' into split-out-arrow-s…

4e1896c

…chema

tustvold force-pushed the split-out-arrow-schema branch from b903441 to 4e1896c Compare September 14, 2022 13:54

tustvold added 2 commits September 14, 2022 15:37

Fix conflicts

6dfabe1

Fix pyarrow

b614881

tustvold marked this pull request as ready for review September 14, 2022 15:22

tustvold commented Sep 14, 2022

View reviewed changes

Tweak feature flags

bdaa1cf

tustvold commented Sep 14, 2022

View reviewed changes

tustvold added 2 commits September 14, 2022 17:34

Test juggling

d963c54

Merge remote-tracking branch 'upstream/master' into split-out-arrow-s…

3c3faf2

…chema

alamb approved these changes Sep 15, 2022

View reviewed changes

tustvold mentioned this pull request Sep 16, 2022

Split out arrow-data into a separate crate #2746

Merged

Merge remote-tracking branch 'upstream/master' into split-out-arrow-s…

6f62bb6

…chema

tustvold force-pushed the split-out-arrow-schema branch from bfba28e to 6f62bb6 Compare September 20, 2022 17:17

tustvold added a commit to tustvold/arrow-datafusion that referenced this pull request Sep 20, 2022

Update for breaking changes in apache/arrow-rs#2711

9de354b

tustvold added a commit to apache/datafusion-python that referenced this pull request Sep 20, 2022

Update for changes in apache/arrow-rs#2711

04cbd33

Derive PyArrowConvert for Vec

4ec95e8

tustvold merged commit 48cc8be into apache:master Sep 21, 2022

tustvold mentioned this pull request Sep 21, 2022

Reduce dependencies of datafusion-sql crate apache/datafusion#3566

Merged

alamb mentioned this pull request Sep 21, 2022

Upgrade to DataFusion 13 apache/datafusion-python#54

Closed

tustvold mentioned this pull request Sep 23, 2022

Do we need to public the MAX_DECIMAL_FOR_LARGER_PRECISION and MIN_DECIMAL_FOR_EACH_PRECISION #2343

Closed

alamb changed the title ~~Split out arrow-schema (#2594)~~ Split out arrow-schema (#2594) Sep 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split out `arrow-schema` (#2594) #2711

Split out `arrow-schema` (#2594) #2711

tustvold commented Sep 12, 2022

tustvold Sep 12, 2022

tustvold Sep 12, 2022

tustvold Sep 12, 2022

tustvold commented Sep 13, 2022

tustvold Sep 13, 2022

tustvold commented Sep 13, 2022

tustvold commented Sep 13, 2022

alamb left a comment

alamb commented Sep 13, 2022

tustvold Sep 14, 2022 •

edited

alamb Sep 15, 2022

andygrove Sep 15, 2022

andygrove Sep 17, 2022

andygrove Sep 17, 2022

alamb Sep 19, 2022

tustvold Sep 20, 2022

tustvold Sep 14, 2022

alamb left a comment

alamb Sep 15, 2022

tustvold Sep 16, 2022

alamb Sep 16, 2022 •

edited

alamb Sep 15, 2022

alamb Sep 15, 2022

tustvold commented Sep 21, 2022

ursabot commented Sep 21, 2022

Split out arrow-schema (#2594) #2711

Split out arrow-schema (#2594) #2711

Conversation

tustvold commented Sep 12, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Sep 13, 2022

Choose a reason for hiding this comment

tustvold commented Sep 13, 2022

tustvold commented Sep 13, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 13, 2022

tustvold Sep 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Sep 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Sep 21, 2022

ursabot commented Sep 21, 2022

Split out `arrow-schema` (#2594) #2711

Split out `arrow-schema` (#2594) #2711

tustvold Sep 14, 2022 •

edited

alamb Sep 16, 2022 •

edited