Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: relationship / unification of arrow-rs and arrow2 going forward #1176

Closed
alamb opened this issue Jan 14, 2022 · 46 comments
Closed
Labels
question Further information is requested

Comments

@alamb
Copy link
Contributor

alamb commented Jan 14, 2022

TLDR: please comment on this ticket if you have opinions about if and/or how the community should unite its efforts on a single Rust implementation of Apache Arrow.

Related mailing list thread: https://lists.apache.org/thread/dsyks2ylbonhs8ngnx6529dzfyfdjjzo

There is active discussion and a PR apache/datafusion#1556 about switching the DataFusion project to use the arrow2 Rust implementation of Arrow from @jorgecarleitao. While this DataFusion PR is not yet ready to merge, if DataFusion were to switch to arrow2, that leaves a question of what will happen with this (arrow-rs) code.

Since many of the PRs, contributors and maintainers of this (arrow-rs) crate are part of the DataFusion community, I believe if DataFusion switches to arrow2, much of the maintenance and extension efforts would follow arrow2

arrow2is largely developed by @jorgecarleitao, who is an Apache Arrow PMC member and committer, but the project itself has not been under the Apache Software Foundation’s governance. Additional background can be found on the mailing list archives and past mailing list threads such as this and this

It is my opinion that the Rust / Arrow / DataFusion community has general consensus on:

  1. Having one implementation of Arrow in Rust where we can focus would be better than 2 which split attention and resources
  2. The technical underpinnings of arrow2 are more ergonomic

It is not clear to me if there is a consensus on:

  1. How important the Apache Governance model is (please lend your opinions here!)
  2. How important the stability of APIs / the specific versioning scheme (0.x vs 1.x or later)

Possible ideas for a way forward:

  1. Switch datafusion to arrow2, making no changes to arrow-rs. It could be maintained by anyone who wished to contribute,
  2. Bring arrow2 code into the arrow-rs repo, with appropriate IP clearance and adopt that as the officially maintained arrow implementation (*)
  3. Start more actively porting the more ergonomic parts of arrow2 into arrow-rs to reduce the feature gap as suggested in Discussion: Switch DataFusion to using arrow2? datafusion#1532 (comment) by @tustvold
  4. Others?

Option 2 leaves open the question of “how does arrow2 development move forward” – where would patches be sent, for example? I would hope we can find a way that is compatible with Apache governance, but I don't think we have a specific proposal yet, and it also depends in large part on what @jorgecarleitao is comfortable with

So, for any users of this crate not also in the DataFusion community, what are your hopes / needs / plans from this crate? How important is the apache governance to you? Please tell us your thoughts!

@alamb alamb added the question Further information is requested label Jan 14, 2022
@andygrove
Copy link
Member

My personal hope is that arrow2 can be donated to the ASF as a 0.x project. Given how widely used arrow2 is, I believe that it is becoming the de-facto implementation anyway and IMHO there would be benefits to having Apache governance.

For many of us who work for large corporations, we have to seek permission to contribute to open source projects. For example, I have permission to contribute to Apache Arrow and its subprojects, but I cannot contribute to arrow2 while it is an independent project (not that I would necessarily be contributing anyway, but others may be in a similar position).

I see no reason why the stable arrow crate cannot continue to exist and evolve independently of arrow2, assuming that there are contributors motivated to do so.

@alamb
Copy link
Contributor Author

alamb commented Jan 15, 2022

For the record, I would be willing to help maintain arrow2 if it were donated into the ASF doing things like coordinating release process / voting, bug fixing, answering questions, and the like

@houqp
Copy link
Member

houqp commented Jan 16, 2022

Thank you @alamb for starting and driving this discussion. Great summary on the current community consensus.

It is not clear to me if there is a consensus on:
How important the Apache Governance model is (please lend your opinions here!)

Personally, I think the Apache model works better for relatively slow moving monolith projects, while arrow2/parquet2 are fast evolving projects with a vision to be broken up into even smaller modular crates. @alamb has done an exceptional work on driving the arrow-rs releases. But seeing how much effort and time it takes, I would consider it a unnecessary overhead for arrow2 at its current stage. @jorgecarleitao was able to react to user feedbacks fast and release 3-4 new versions in a week for arrow2, this is simply not possible with the Apache Governance model. That said, I think the Apache voting process is very useful when you need high confidence on the quality of every single release and has a large diverse set of PMC members who can participate in the voting in a timely manner. But arrow2 seems still pretty far away from this.

@andygrove brought up a good point that it might become an issue for large corporations with restrictive open source contribution guide lines. This is the first time I am aware of this issue, previously I was under the impression that software license is all what matters. On the other hand, I am guessing ASF is not the only governance that's allowed? Perhaps we could help @jorgecarleitao come up with a different compatible governance model for arrow2 until it's ready for the ASF contribution? If Andy wants to contribute to arrow2 now but is blocked by lack of governance, then I would consider this a serious issue that we should address. Otherwise I would optimize for iteration velocity over governance until it becomes a real problem.

In short, from what I have seen so far, the upside from adopting the Apache governance model is to unblock potential contributions from big corporations. The downside is it will slow down our iteration process and potentially even disincentivize @jorgecarleitao from actively working on the project. Reading from his past emails, I get the feeling that he did try very hard to pass the IP clearance and donate arrow2 to ASF last year, but got frustrated by the bureaucracy. I am personally much more concerned about latter than the former.

How important the stability of APIs / the specific versioning scheme (0.x vs 1.x or later)

IMHO, this is not important as long as it is well communicated to the users. i.e. be explicit that we are special and please treat our 8.x as 0.x until we say otherwise. But Jorge has a strong opinion on this and want to strictly follow what the rest of the Rust ecosystem does. I also understand where he is coming from and respect his stance on this.

Switch datafusion to arrow2, making no changes to arrow-rs. It could be maintained by anyone who wished to contribute,

I agree with @andygrove on this. As long as there is community interests in this, we should probably still open arrow-rs up for contributions. This is not the result I want to see, but I have a feeling that this is likely what is going to happen :(

Start more actively porting the more ergonomic parts of arrow2 into arrow-rs

I think this is certainly doable, but then I stand by my previous comment that it won't be a good use of our time unless there is fundamental design tradeoffs in arrow-rs that are not compatible with arrow2's design. Simply replicating the design another project has is not a good reason to start a fork IMHO. I know @tustvold has a fairly strong opinion on this option and is more familiar with the parquet code base than I do, so perhaps he could help shed some light on this.

Option 2 leaves open the question of “how does arrow2 development move forward” – where would patches be sent, for example?

Just throwing out random idea here, one potential variant of option 2 is we use arrow-rs as the place to maintain stable arrow2 branches and let arrow2 iterate as fast as it could without the fear of introducing breaking changes. While the stable branch will cherry-pick compatible commits for a specific 0.x release that we want to maintain for X months. This way, we can still direct all contributions back to arrow2. The downside is I don't know how much interests the community has for a stable API considering we just decided to stop maintaining stable releases for arrow-rs.

@tustvold
Copy link
Contributor

tustvold commented Jan 16, 2022

I know @tustvold has a fairly strong opinion on this option and is more familiar with the parquet code base than I do, so perhaps he could help shed some light on this.

I would just like to get away from this situation where we have two concurrent projects. It is just demoralizing, draining, and to be completely honest it just seems a tad unnecessary. Whilst I do not like the idea of porting stuff across, and yes it would be an annoying use of time, I am willing to contribute to such an effort if it sees an end to this situation. It overcomes what is otherwise a potentially indefinite political and bureaucratic discussion with pure technical brute force. It is very similar in my mind to option 2, simply changing the merge direction.

Ultimately I'm going to end up porting code regardless, I would prefer an outcome that allows me to save others the same effort 😄

@houqp
Copy link
Member

houqp commented Jan 20, 2022

I totally agree with you @tustvold . We are all engineers looking to solve interesting technical problems together, not playing political games after all.

@sunchao
Copy link
Member

sunchao commented Mar 8, 2022

For many of us who work for large corporations, we have to seek permission to contribute to open source projects. For example, I have permission to contribute to Apache Arrow and its subprojects, but I cannot contribute to arrow2 while it is an independent project (not that I would necessarily be contributing anyway, but others may be in a similar position).

I'd like to echo @andygrove 's point here. The only reason we're able to contribute to Arrow/Parquet rust implementation is because it's under the governance of Apache. Otherwise, it'd be very hard for us.

@jorgecarleitao
Copy link
Member

I would just like to get away from this situation where we have two concurrent projects. [...]

I agree. I agree that the situation is not productive. I am sorry that I caused frustration to people here.

Whilst I do not like the idea of porting stuff across, and yes it would be an annoying use of time, I am willing to contribute to such an effort if it sees an end to this situation.

I am also willing to contribute to such an effort.

What do you think about something to the effect of:

  • Arrow2 is donated to Apache Arrow and its development ceases in jorgecarleitao/arrow2
  • The core of arrow2 (array/, bitmap/, offsets.rs, types/) are lifted to a crate living in this repo (e.g. arrow-core or something).
  • the core receives relevant methods from arrow-rs; add methods existing in arrow-rs with "deprecated" to give time for arrow-rs users to use them.
  • arrow-rs' FFI of arrow2 is moved to a separate crate and replaces arrow-rs' one
  • Arrow-rs' compute is migrated to use arrow-core
  • Arrow-rs' IO except IPC is migrated to use arrow-core
  • Arrow-rs' IPC IO is replaced by arrow2's implementation with necessary adjustments
  • arrow-core will add RunEndArray (this is missing there atm)
  • RecordBatch (arrow-rs) and Chunk (arrow2) co-exist to give room for both communities to use (in core or something else)
  • Development and governance follows Apache and this repo of community-driven development.

This could result in the following changes to arrow-rs:

  • ArrayData is removed
  • It becomes interoperable with Vec but no longer aligned with cache lines.
  • IPC support is improved with e.g. unsafe free, big endian support, mmap of IPC files
  • There is code churn related to from vs from_slice (we can switch to arrow-rs names)
  • slicing of arrays become easier (handling of offsets)

It would also end the arrow-arrow2 split e.g. removing the un-productive discussions around "which is better", and combine development efforts.

Some challenges:

  • Arrow2 uses Box<dyn Array> as children to allow easy mutation; arrow-rs uses Arc<dyn Array>
  • Arrow2 does not have TimestampArray nor DecimalArray, and instead sticks to the physical types only

@tustvold
Copy link
Contributor

I am also willing to contribute to such an effort.

That would be fantastic, finding a way to unify our efforts would be amazing

What do you think about something to the effect of

Let me take some time to think about the implications of this and how we might do it in an incremental fashion. Whilst I have no particular affection for ArrayData, it is a fairly useful escape hatch and is important some more advanced use-cases (Vec's allocator support is still unstable IIRC).

@ritchie46
Copy link
Contributor

I am also willing to contribute to such an effort.

I am also willing to help on this. And if this would go forward consider me as invested as long term maintainer of the project as well.

@alamb
Copy link
Contributor Author

alamb commented Feb 15, 2023

Thank you @jorgecarleitao @ritchie46 and @tustvold

Arrow2 is donated to Apache Arrow and its development ceases in jorgecarleitao/arrow2

I am willing to run the "Ip clearance" process to get the arrow2 codebase into the arrow-rs repository. I have done this before for object_store and while it takes some time I think it is worth it in this case.

As for the technical plans, the only thing I feel (very) strongly is that there is a migration path that doesn't involve a "all downstream crates need to rewrite the world" -- given the above discussions, it sounds like there are already some good thoughts on this matter so I trust it will be in good hands.

I would also be interested in what some of the other recently active contributors / maintainers of arrow-rs such @viirya @askoa and @iajoiner think of these ideas

FYI @liukun4515

@andygrove
Copy link
Member

This is fantastic news! Thank you @jorgecarleitao and @ritchie46!

@jorgecarleitao
Copy link
Member

That would be fantastic, finding a way to unify our efforts would be amazing

Agree. :D

Let me take some time to think about the implications of this and how we might do it in an incremental fashion.

incremental fashion 👍

Do let me know if we need a small demo of the current arrow2 design to motivate it or something else.

One idea is to wrap arrow2 arrays in arrow-rs newtypes struct PrimitiveArray<T>(PrimitiveArrayInternal<T>);, so that arrow-rs users continue to see (almost) the same API; internally we could glue them together via From<> implementations - just an idea.

I am willing to run the "Ip clearance" process to get the arrow2 codebase into the arrow-rs repository. I have done this before for object_store and while it takes some time I think it is worth it in this case.

Thank you 🙇 - I will certainly help in gathering evidence regarding contributors.

As for the technical plans, the only thing I feel (very) strongly is that there is a migration path that doesn't involve a "all downstream crates need to rewrite the world"

@alamb - I agree - I believe ultimately some rewriting will have to happen as we have arrow2 and arrow-rs users with two APIs sometimes with the same name and different signatures. I do think that the way arrow-rs has been re-writing its APIs to be very conductive and productive (i.e. gradually, deprecation warnings, etc.) 🎩 tip to the team here.

@askoa
Copy link
Contributor

askoa commented Feb 15, 2023

@alamb Thanks for the tag. Though I can't gauge the benefits of arrow2 v arrow-rs, I agree to the notion that the change should happen incrementally to reduce the impact for downstream of both projects.

Removing ArrayData is a huge change for arrow-rs. Also, my impression is that ArrayData allowed to slice arrays to different logical views on same underlying physical data. So it is possible to have many logical arrays built on top of one physical array. By removing ArrayData we might increase memory footprint if users want different logical view of same physical data.

I, as usual, continue to pick up issues based on my availability.

@ritchie46
Copy link
Contributor

it is possible to have many logical arrays built on top of one physical array. By removing ArrayData we might increase memory footprint if users want different logical view of same physical data.

This is perfectly possible in arrow2 as well. The logical types are stored by the DataType of the physical array. Memory footprint is only dictated by physical arrays.

@viirya
Copy link
Member

viirya commented Feb 15, 2023

Thanks @alamb for pinging me. I've read @jorgecarleitao's comments (and previous comments) when it was posted yesterday. I think my feelings come from two aspects.

As a downstream crate user of arrow-rs, ideally we hope that the unification of arrow-rs and arrow2 wouldn't bring too much API changes, although sometimes it might be not avoidable. Based on the proposed change list and our code dependency, our impact might be slight because there are middle layers like DataFusion so some changes will be ingested there, but I'm also wondering how much it could be on projects which are fully coupled with. At first glance at the list of change in @jorgecarleitao's comment, seems at least API stuffs could be existing for a while.

As a contributor / maintainer of arrow-rs or an open source contributor of Rust arrow implementation, this is a fantastic news definitely. We can contribute to and benefit from others works in one single place. I'd be very pleasant to see this happens, and definitely very willing to help on it. 💪

@tustvold
Copy link
Contributor

tustvold commented Feb 15, 2023

What do people think about doing something like #1799 and adding a layout enumeration inside of ArrayData consisting of the arrow2 array abstractions, or some variation thereupon? I think this should be possible without major public API changes, avoiding churning any downstreams, whilst still gaining us many of the things @jorgecarleitao mentioned above? We could then at our leisure update the arrow-rs arrays / builders to use them #1811

I'm mainly trying to avoid making changes to Array and its implementations, as any change there will be painful.

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Feb 16, 2023

This comment is not about how we merge arrow-rs and arrow2, just trying to outline the overall issues I observed in arrow-rs that resulted in arrow2. It may serve as an inspiration to find a solution that combines both.

1. Buffer is untyped

ptr::reading a buffer requires knowing the layout it was created from. This means that the struct Buffer cannot be semantically used. In other words, the information it carries is insufficient to use it - something else, usually its position in the ArrayData and the DataType, is required to use it in any way.

2. Buffer logical memory region is unknown

we need to know the offset, length and physical type to get its logical memory region. This is stored in ArrayData. Whether we need to offset the buffer or not with ArrayData is dependent on the physical type. How we should apply offset is also dependent on the physical type (offsets of bitmaps are measured in bits).

Any user of ArrayData will need to correctly apply these rules when reading any of its Buffers.

3. ArrayData is untyped

We need to use DataType (a logical type) to discern, at runtime, how to interpret its buffers and children. I.e. any physical typing is erased from ArrayData when it is built.

#1799 is an idea to address this.

4. Efficient use of ArrayData requires self-referencing

This is done performance reasons: since accessing an element requires runtime indirection to the buffers of ArrayData, which is expensive, we need to self-reference them to get a good performance. Self-referencing is considered an anti-pattern in Rust.

#1811 is one way to address this.

5. Buffer and MutableBuffer layout is inconsistent with Vec

Most Rust uses Vec. However, we allocate according to cache lines irrespectively of the type, resulting in a layout incompatible with Vec. This results in our inability to easily data from Buffer / MutableBuffer back to a Vec. The other way is possible due to foreign allocated capabilities of Buffer.

How arrow2 solves these

Problem 1. and 2.

  • Buffer is typed
  • Buffer contains an offset and length
#[derive(Clone)]
pub struct Buffer<T> {
    /// the internal byte buffer.
    data: Arc<Bytes<T>>,

    /// The offset into the buffer.
    offset: usize,

    // the length of the buffer. Given a region `data` of N elements, [offset..offset+length] is visible
    // to this buffer.
    length: usize,
}
  • Bitmap is a first-class citizen
#[derive(Clone)]
pub struct Bitmap {
    bytes: Arc<Bytes<u8>>,
    // both are measured in bits. They are used to bound the bitmap to a region of Bytes.
    offset: usize,
    length: usize,
    // this is a cache: it is computed on initialization
    unset_bits: usize,
}

#[derive(Clone)]
pub struct MutableBitmap {
    buffer: Vec<u8>,
    // invariant: length.saturating_add(7) / 8 == buffer.len();
    length: usize,
}

These together are necessary and sufficient to represent all physical regions of an Arrow Array.

Problem 3. and 4.

An array is composed by the individual parts. Examples:

#[derive(Clone)]
pub struct PrimitiveArray<T: NativeType> {
    data_type: DataType,
    values: Buffer<T>,
    validity: Option<Bitmap>,
}

#[derive(Clone)]
pub struct BooleanArray {
    data_type: DataType,
    values: Bitmap,
    validity: Option<Bitmap>,
}

#[derive(Clone)]
pub struct BinaryArray<O: Offset> {
    data_type: DataType,
    offsets: OffsetsBuffer<O>,  // a newtype buffer where elements are guaranteed to be in increasing order
    values: Buffer<u8>,
    validity: Option<Bitmap>,
}

Because Buffer implements Deref to Target [T], we can access individual values without self-referencing.

Problem 5.

Arrow2 uses Vec as its "native" container. In particular, it does not use MutableBuffer and instead uses std::vec::Vec. There is no "freeze"; instead, there is Arc<Bytes<T>> (a small variation of Arc<Vec<T>> to allow for foreign allocated regions).

Wrap up

What I am trying to say with "remove ArrayData" is that the moment we expose such API to users, we expose a significant amount of footguns, almost all of them unsafe. In my opinion, we should strive to remove such footguns. In my opinion, this could be done by exposing all the necessary functionality around individual arrays on the arrays themselves (or their Mutable counterparts), so that, as a user, you can only use those APIs.

For example,

impl<T: NativeType> PrimitiveArray<T> {
  pub fn try_new(
          data_type: DataType,
          values: Buffer<T>,
          validity: Option<Bitmap>,
      ) -> Result<Self, Error>;
}

Provides a complete construction of a PrimitiveArray in Arrow2 - ArrayData::null_count is the unset bits validity, ArrayData::offset is values.offset().

A simpler construct (both are O(1)):

impl<T: NativeType> PrimitiveArray<T> {
    pub fn from_vec(values: Vec<T>) -> Self {
        Self::try_new(T::PRIMITIVE.into(), values.into(), None).unwrap()
    }

With

impl<T: NativeType> PrimitiveArray<T> {
    #[must_use]
    pub fn into_inner(self) -> (DataType, Buffer<T>, Option<Bitmap>) {
        let Self {
            data_type,
            values,
            validity,
        } = self;
        (data_type, values, validity)
    }
}

a user reverts to its individual parts. Because both values and validity contain the individual offsets and physical information (via generic), Buffer behave like bytes::Bytes and Bitmap like a normal container (e.g. iter() skips by offset and runs up to length).

I am not saying we need to do this in arrow-rs, just pointing out that this is the appeal of arrow2 - imo it provides a very easy mental model of the data and corresponding invariants because every struct encapsulates the necessary information (via types or values) to use it regardless of its origin (e.g. using Buffer requires information from its parent ArrayData).

@tustvold
Copy link
Contributor

tustvold commented Feb 16, 2023

What did you think of putting the arrow2 arrays inside of ArrayData, this would achieve all of the above without changing the public array APIs, at least initially?

Edit: I've had a poke around in DataFusion and confirmed it makes fairly limited of ArrayData directly, mostly just using it to access the null buffer, length or downcast out of an Arc. This is some empirical evidence to my hunch that we can make quite aggressive changes at this level without it being overly painful for downstreams, certainly significantly less painful than altering the array abstractions.

@jorgecarleitao
Copy link
Member

@tustvold do you mean in line with #1799 ? I think that that would work. :)

I do wonder: to preserve arrow-rs public API, wouldn't making arrow-rs arrays newtypes of arrow2 arrays be enough?

@tustvold
Copy link
Contributor

wouldn't making arrow-rs arrays newtypes of arrow2 arrays be enough?

Something like that is definitely the intended end state, but we need some way to get there incrementally 😅

I also have a vague hope that the new ArrayData enumeration will allow moving away from trait object downcasting, which is confusing and ergonomically unfortunate

@jorgecarleitao
Copy link
Member

Something like that is definitely the intended end state, but we need some way to get there incrementally 😅

Got it. 👍

I also have a vague hope that the new ArrayData enumeration will allow moving away from trait object downcasting, which is confusing and ergonomically unfortunate

Interesting! That is not something I had thought about and makes a lot of sense. 👍

From my end, your proposal: :shipit:

@tustvold
Copy link
Contributor

Wonderful, I'll make a start getting the basic abstractions in place next week, and will then create some tickets for the various migration work that will fall out of this so that we can divide and conquer on this. Hopefully by the time the IP clearance completes we will have gotten arrow-rs into a state where the arrow2 arrays can just be dropped in.

@ritchie46
Copy link
Contributor

Can these new-types be added to a separate crate and thereby leaving arrow2 as an arrow-core crate? This seems like a good separation of concerns to me and will also ensure parallel compilation and probably less complexity.

@tustvold
Copy link
Contributor

Can these new-types be added to a separate crate and thereby leaving arrow2 as an arrow-core crate

I would expect the arrow2 arrays to largely replace what is currently in arrow-data, effectively arrow-data would become the arrow2 array abstractions. Any other ported functionality, e.g. IPC, would then go into the corresponding crate, e.g. arrow-ipc. I think that is what you are asking for?

@ritchie46
Copy link
Contributor

Any other ported functionality, e.g. IPC, would then go into the corresponding crate, e.g. arrow-ipc. I think that is what you are asking for?

I think so. I think it is a combination of arrow-data and arrow-buffer. Maybe I am missing another one. The IO and compute can definitely be done in subcrates.

What I also meant with the newtypes are the ArrayData abstractions. Users of current arrow2 don't use/need that abstraction, so I think it is worth adding that in a separate crate that is opt-in.

tustvold added a commit that referenced this issue Feb 23, 2023
* Remove RawPtrBox (#1811) (#1176)

* Clippy

* Extract get_offsets function
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 23, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 23, 2023
@tustvold
Copy link
Contributor

For those following along

Together these work towards allowing us to slowly deprecate and remove the untyped MutableBuffer and Buffer, in favor of arrow2-style Vec and ScalarBuffer respectively. If people have any spare cycles any feedback on these PRs and/or the general approach would be much appreciated.

@sundy-li
Copy link
Contributor

@alamb Glad to see arrow-rs and arrow2 could be merged together, I would like to contribute to help these processes.

@tustvold
Copy link
Contributor

What do people think of migrating the arrow2 arrays to be newtypes around the statically typed ArrayData once ready, i.e. much like we are doing for the arrow-rs arrays? This would have a few advantages:

  • Work can start on merging the codebases without having to wait for IP clearance, I hope to have a first cut of the ArrayData ready for the release in just under two weeks
  • Allows for incremental changes to arrow2, in much the same vein as we are doing for arrow-rs
  • We can find potential integration issues sooner
  • We can immediately divide and conquer
  • Avoids issues with overloaded method and type names, etc... each codebase can keep its current array abstractions as is, gradually migrating as/when desired

I don't really see a way to avoid doing something similar to this, without a break-the-world arrow2 release, and I think bringing this forward has some compelling advantages?

@ritchie46
Copy link
Contributor

Can we take PrimitiveArray as an example?

This currently is a small wrapper around a typed Buffer<T> and an optional bitmap Option<Bitmap>.

pub struct PrimitiveArray<T: NativeType> {
    data_type: DataType,
    values: Buffer<T>,
    validity: Option<Bitmap>,
}

As I understand you want to convert it to

struct PrimitiveArray<T: NativeType> {
  array_data: ArrayData
  phantom_data: Phantom<T>
}

enum ArrrayData {
   Primitive{..},
   Utf8{..},
   ...
}

This would cost an extra check upon random access of the array as the variant of the ArrayData needs to be unpacked, or there needs to be used extra unsafe and hope the compiler is able to compile away those branches that will not be hit. Besides that to me it feels the wrong way around to go from known type with an untyped buffer inside. The other way around makes much more sense to me.

@tustvold
Copy link
Contributor

Besides that to me it feels the wrong way around to go from known type with an untyped buffer inside

It won't be untyped it would contain PrimitiveArrayData ?

@ritchie46
Copy link
Contributor

It won't be untyped it would contain PrimitiveArrayData ?

Ah, I see. In that case it makes sense 👍 . By the look of it, it seems exactly the same as an arrow2 PrimitiveArray? Couldn't we do with one type in that case?

@tustvold
Copy link
Contributor

Couldn't we do with one type in that case

We theoretically could, I was proposing not to for the reasons outlined in #1176 (comment)

@ritchie46
Copy link
Contributor

Couldn't we do with one type in that case

We theoretically could, I was proposing not to for the reasons outlined in #1176 (comment)

I understand. For PrimitiveArray it seems fine as the newtype seems to be exactly the same for other type I can imagine it getting a bit more complicated, so no comment yet.

@tustvold
Copy link
Contributor

tustvold commented Feb 28, 2023

for other type I can imagine it getting a bit more complicated

Perhaps you might like to check out #3769 which contains the remaining abstractions, and let me know what you think there?

tustvold added a commit that referenced this issue Mar 1, 2023
* Zero-copy Vec conversion (#3516) (#1176)

* Fix doc

* More tests

* Review feedback

* More tests
@alamb
Copy link
Contributor Author

alamb commented Mar 1, 2023

As an update here, I am working on creating a summary of what I think the proposal is in this ticket that won't require reading all the context, and then I will create an "epic" style ticket that breaks the work down into more manageable pieces. I expect to be ready in the next few days

@alamb
Copy link
Contributor Author

alamb commented Mar 3, 2023

Here is a summary of what I think the plan is. I used Google Slides to make it easier to comment on diagrams:
https://docs.google.com/presentation/d/1cqQEpC-kJES2Mng152r_qZyaOqHjtb5YFuseSTWyulU/edit#slide=id.p

Here is a copy for anyone who prefers PDF: arrow-rs + arrow2.pdf

Very much looking forward to feedback. Thank you @tustvold and @ritchie46 who helped create this summary

I plan to make the tracking ticket next week

@alamb
Copy link
Contributor Author

alamb commented Mar 7, 2023

I have filed jorgecarleitao/arrow2#1429 with the proposal of how this could work and some alternatives. Please provide your feedback there.

Unless I hear otherwise I plan to close this particular issue in a few days and we can continue the discussion on jorgecarleitao/arrow2#1429

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests