Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out arrow-schema (#2594) #2711

Merged
merged 20 commits into from Sep 21, 2022
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Part of #2594

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

bench = false

[dependencies]
serde = { version = "1.0", default-features = false, features = ["derive"], optional = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat annoying, but the orphan rule means serde needs to tag along for the ride

arrow-schema/src/error.rs Outdated Show resolved Hide resolved
@tustvold tustvold added the api-change Changes to the arrow API label Sep 12, 2022
@@ -179,8 +186,7 @@ impl Schema {

/// Returns a vector with references to all fields (including nested fields)
#[inline]
#[cfg(feature = "ipc")]
pub(crate) fn all_fields(&self) -> Vec<&Field> {
pub fn all_fields(&self) -> Vec<&Field> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be made public, so that the IPC reader can use it. I don't think this is a big problem

@@ -296,6 +293,791 @@ pub(crate) fn singed_cmp_le_bytes(left: &[u8], right: &[u8]) -> Ordering {
Ordering::Equal
}

// MAX decimal256 value of little-endian format for each precision.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were moved out of datatypes as they didn't really belong there, as they aren't to do with the schema. I'm not entirely sure they belong here, but they are pub(crate) so we can always move them later

arrow/src/util/decimal.rs Outdated Show resolved Hide resolved
@tustvold tustvold mentioned this pull request Sep 12, 2022
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 12, 2022
@tustvold
Copy link
Contributor Author

Working on trying to disentangle pyarrow 😭

Self::from_pyarrow(value)
}
}
/// A newtype wrapper around a `T: PyArrowConvert` that implements
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, but is necessary to get around the orphan rule

@tustvold
Copy link
Contributor Author

I think this is ready to go now, I've also created #2723 which would theoretically allow dropping the json flag from arrow-schema

@tustvold
Copy link
Contributor Author

Marking as draft pending #2724

@tustvold tustvold marked this pull request as draft September 13, 2022 17:57
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like everything about this PR except for the change in Error types

arrow-schema/src/error.rs Outdated Show resolved Hide resolved
arrow-schema/src/schema.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Sep 13, 2022

Thanks for starting thsi conversation @tustvold

@tustvold tustvold marked this pull request as ready for review September 14, 2022 15:22
@@ -28,9 +28,13 @@ use arrow::compute::kernels;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::error::ArrowError;
use arrow::ffi_stream::ArrowArrayStreamReader;
use arrow::pyarrow::PyArrowConvert;
use arrow::pyarrow::{PyArrowConvert, PyArrowException, PyArrowType};
Copy link
Contributor Author

@tustvold tustvold Sep 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyarrow bindings take a bit of a hit from this split, but I don't really see an obvious way around this, unless we push pyo3 into arrow-schema also. Thoughts?

Edit: This doesn't actually work, because the conversions for Schema require the FFI bindings, so I don't think there is a way around this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kszucs / @andygrove

I don't use the python bindings so I don't understand the implications of this change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into this. These are important in DataFusion/Ballista for executing Python UDFs. I will have time to review tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran out of time today - the Ballista release work took longer than hoped. I will try and look at this over the weekend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to wait until apache/datafusion#3483 is passing, and then we can create a PR in https://github.com/apache/arrow-datafusion-python/ to use that version and make sure the tests pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/datafusion#3483 is now ready for review / merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -75,7 +75,7 @@ default = ["csv", "ipc", "json"]
ipc_compression = ["ipc", "zstd", "lz4"]
csv = ["csv_crate"]
ipc = ["flatbuffers"]
json = ["serde", "serde_json"]
json = ["serde_json"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a subtle breaking change here, previously json would enabled serde derives on schema, etc... despite this not actually being part of the JSON API. This is now an explicit feature flag on arrow-schema

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me -- thank you @tustvold

It would be nice if someone familiar with pyarrow / pyo3 could weigh in before we merged this, as I don't understand the implications of the changes to that interface

@@ -63,6 +63,8 @@ jobs:
cargo run --example read_csv_infer_schema
- name: Run non-archery based integration-tests
run: cargo test -p arrow-integration-testing
- name: Test arrow-schema with all features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do the same for arrow-buffer as added in #2693?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do, but it doesn't have any feature flags that explicitly need testing

Copy link
Contributor

@alamb alamb Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could add a comment for future readers like my self

@@ -28,9 +28,13 @@ use arrow::compute::kernels;
use arrow::datatypes::{DataType, Field, Schema};
use arrow::error::ArrowError;
use arrow::ffi_stream::ArrowArrayStreamReader;
use arrow::pyarrow::PyArrowConvert;
use arrow::pyarrow::{PyArrowConvert, PyArrowException, PyArrowType};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kszucs / @andygrove

I don't use the python bindings so I don't understand the implications of this change

bench = false

[dependencies]
serde = { version = "1.0", default-features = false, features = ["derive", "std"], optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is certainly a nice (very small) list of dependencies!

tustvold added a commit to tustvold/arrow-datafusion that referenced this pull request Sep 20, 2022
tustvold added a commit to apache/datafusion-python that referenced this pull request Sep 20, 2022
@tustvold
Copy link
Contributor Author

I'm going to get this in, apache/datafusion-python#54 shows that the pyarrow changes don't necessitate major downstream changes. If there is further feedback I will be more than happy to address it in a follow-up PR, prior to the next arrow release.

@tustvold tustvold merged commit 48cc8be into apache:master Sep 21, 2022
@ursabot
Copy link

ursabot commented Sep 21, 2022

Benchmark runs are scheduled for baseline = 74f639c and contender = 48cc8be. 48cc8be is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb alamb changed the title Split out arrow-schema (#2594) Split out arrow-schema (#2594) Sep 30, 2022
andygrove added a commit to apache/datafusion-python that referenced this pull request Oct 13, 2022
* Update for changes in apache/arrow-rs#2711

* Clippy

* Use Vec conversion

* Update to DF 13 (#59)

* [DataFrame] - Add write_csv/write_parquet/write_json to DataFrame (#58)

* [SessionContext] - Add read_csv/read_parquet/read_avro functions to SessionContext (#57)

Co-authored-by: Francis Du <me@francis.run>

* remove patch from cargo toml

* add notes on git submodule for test data

Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com>
Co-authored-by: Francis Du <me@francis.run>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants