Split up Arrow Crate #2594

tustvold · 2022-08-26T21:16:43Z

TLDR rather than fighting entropy lets just brute-force compilation

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The arrow crate is getting rather large, and is starting to show up as a non-trivial bottleneck when compiling code, see #2170. There have been some efforts to reduce the amount of generated code, see #1858, but this is going to be a perpetual losing battle against new feature additions.

I think there are a couple of problems currently:

Limited build parallelism, especially if codegen-units is set low
Upstream crates have to "depend" on functionality they don't need, e.g. parquet depending on compute kernels
Minor changes force large amounts of recompilation, with incremental compilation only helping marginally
Codegen is rarely linear in complexity, consequently larger codegen units take longer than the same amount of code in smaller units

All these conspire to often result in an arrow shaped hole in compilation, where CPUs are left idle.

Some numbers from my local machine

Release with default features: 232 seconds
Release with default features without comparison kernels: 150 seconds
Release with default features without compute kernels: 70 seconds
Release without default features without compute kernels: 60 seconds

The vast majority of the time all bar a single core is idle.

Describe the solution you'd like

I would like to propose we split up the arrow crate, into a number of sub-crates that are then re-exported by the top-level arrow crate. Users can then choose to depend on the batteries included arrow crate, or more granular crates.

Initially I would propose the following split:

arrow-csv: CSV reader support
arrow-ipc: IPC support
arrow-json: JSON support (related to Make JSON support Optional via Feature Flag #2300)
arrow-compute: contents of compute module
arrow-test: arrow test_utils (not published)
arrow-core: everything else

There is definitely scope for splitting up the crates further after this, in particular the comparison kernels might be a good candidate to live on their own, but I think lets start small and go from there. I suspect there is a fair amount of disentangling that will be necessary to achieve this.

Describe alternatives you've considered

Feature flags are another way this can be handled, however, they have a couple of limitations:

It is impractical to test the full combinatorial explosion of combinations, which allows for bugs to sneak through
They are unified for a target which limits build parallelism, just because say DataFusion depends on arrow with CSV support, shouldn't force the parquet crate to wait for this to compile before it can start compiling
Poor UX:
- Discoverability is limited, it can be hard to determine what features gate what functionality
- Hard to determine if the feature flag set is minimal, no equivalent of cargo-udeps
- It can be a non-trivial detective exercise to determine why a given feature is being enabled
- Necessitate counter-intuitive hacks to play nicely in multi-crate workspaces - see workspace hack

Additional context

@jimexist recently drove an initiative to do something similar to DataFusion which has worked very well - apache/datafusion#1750

FYI @alamb @jhorstmann @nevi-me

The text was updated successfully, but these errors were encountered:

* Split out integration test plumbing (#2594) (#2300) * Fix RAT

alamb · 2022-08-31T13:49:10Z

I like this idea, for what it is worth. 👍

tustvold · 2022-09-09T07:42:32Z

I've started work on this with #2693, I think the final split will likely end up being a different from what was initially proposed based on what components can easily be separated. I next plan to split out array data, followed by the arrays themselves. This should then allow splitting out some of the heavier kernels, e.g. sort, compare, cast, etc...

alamb · 2022-09-09T13:50:15Z

If the history with the datafusion split is any indication, this work is likely to end up generating lots of PRs

You can see how @jimexist broke that down and we tracked it in apache/datafusion#1750 -- perhaps something similar could be applied here.

I am happy to try and review these mechanical PRs more quickly so we can get the project done more quickly

andygrove · 2022-09-09T13:51:20Z

How feasible would it be to move the type definitions (such as DataType, Field, and Schema) into an arrow-types crate? Maybe this is a path to arrow2 being able to use a common type system.

tustvold · 2022-09-09T13:53:47Z

I don't see an obvious reason why that would not be possible, I'm not sure how generally useful the types will be without the array definitions though...

andygrove · 2022-09-09T13:59:39Z

The datafusion-sql crate only uses arrow::datatypes for example, so just depending on arrow-types there would presumably help with compilation times.

andygrove · 2022-09-09T14:00:39Z

@jorgecarleitao Would arrow2 and its ecosystem benefit from having an arrow-types crate as discussed here?

jorgecarleitao · 2022-09-12T05:47:29Z

Hey, Thanks for the ping!

I think it would not benefit arrow2 directly right now as it has different declarations for Field (e.g. we do not have dict_id on it). Arrow2 also has an extension (DataType::Extension).

With that said, imo it is still a good design - there are systems that only require DataType, Field, Schema, and functionality to read them from a file. One example is a data catalog based on arrow logical types.

I think that datafusion's logical plans could also only depend on types, but I could be wrong (it depends on how List scalars are represented there?).

alamb · 2022-09-12T10:49:05Z

I think that datafusion's logical plans could also only depend on types, but I could be wrong (it depends on how List scalars are represented there?).

List scalars are represented as Vec<Box<Field> so I agree it may be the case the logical plans could depend only on type

https://github.com/apache/arrow-datafusion/blob/d16457a0ba129b077935078e5cf89d028f598e0b/datafusion/common/src/scalar.rs#L81

tustvold · 2022-09-12T17:08:59Z

I've created #2711 which splits out the schema definitions into a crate called arrow-schema. I thought this was more clearly the logical types than something called arrow-types. PTAL 😄

maxburke · 2022-09-14T16:27:37Z

As a downstream user of Arrow, one of the things we find is that we need to fork Arrow-ecosystem crates to quickly integrate patches for missing features or bugs and I think one thing that I'm dreading is having to do that with an exploded Arrow crate, having to fork half a dozen Arrow packages, Parquet, all of the Datafusion crates, ..., seems like it'll be a royal pain.

tustvold · 2022-09-14T16:32:34Z

Hi @maxburke, the intention is to follow the work that was already performed for DataFusion, and would not involve splitting the repository. So I think you shouldn't have to maintain any more or less forks following this? The [patch.crates-io] directive would need to be for every crate though

* Split out arrow-string (#2594) * Doc * Clippy

* Split out arrow-ord (#2594) * Make LexicographicalComparator public * Tweak CI * Fix SIMD * Doc

alamb · 2022-12-19T21:10:23Z

I wonder is it done yet 🙏

tustvold · 2022-12-19T21:51:37Z

arrow-row and arrow-arith then yes, will likely do tomorrow

alamb · 2022-12-20T01:18:28Z

* Split out arrow-row (#2594) * Fix CI * Fix doc * More SortOptions to arrow_schema

* Split out arrow-arith (#2594) * Update CI * Fix clippy * Update docs * Feature flag * Fix CI * Cleanup dependencies

tustvold added the enhancement label Aug 26, 2022

tustvold mentioned this issue Aug 27, 2022

Add dyn_cmp_dict feature flag to gate dyn comparison of dictionary arrays #2596

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 27, 2022

Split out integration test plumbing (apache#2594) (apache#2300)

e3fb3a1

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 27, 2022

Split out integration test plumbing (apache#2594) (apache#2300)

426065a

tustvold mentioned this issue Aug 27, 2022

Split out integration test plumbing (#2594) (#2300) #2598

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Aug 27, 2022

Split out integration test plumbing (apache#2594) (apache#2300)

ea14618

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 8, 2022

Deprecate RecordBatch::concat (apache#2594)

2fa4c47

tustvold mentioned this issue Sep 8, 2022

Deprecate RecordBatch::concat replace with concat_batches (#2594) #2683

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 9, 2022

Split out arrow-buffer crate (apache#2594)

75582d5

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 9, 2022

Split out arrow-buffer crate (apache#2594)

b9cc1fc

tustvold mentioned this issue Sep 9, 2022

Split out arrow-buffer crate (#2594) #2693

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Sep 12, 2022

Split out arrow-schema (apache#2594)

a2f6a72

tustvold mentioned this issue Sep 12, 2022

Split out arrow-schema (#2594) #2711

Merged

This was referenced Sep 13, 2022

Don't Derive Serialize/Deserialize Serde Implementations for Schema Types #2723

Closed

Move JSON Test Format To integration-testing #2724

Merged

tustvold mentioned this issue Sep 14, 2022

Ability to get "source" / and error stack from an ArrowError to help debugging #2725

Open

tustvold mentioned this issue Nov 30, 2022

Move nullif to arrow-select (#2594) #3241

Merged

tustvold mentioned this issue Dec 2, 2022

Add BooleanArray::from_unary and BooleanArray::from_binary #3258

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022

Split out arrow-string (apache#2594)

9799439

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022

Split out arrow-string (apache#2594)

1b0ce96

tustvold mentioned this issue Dec 8, 2022

Split out arrow-string (#2594) #3295

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022

Split out arrow-ord (apache#2594)

3357ebc

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022

Split out arrow-ord (apache#2594)

c59ccec

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022

Split out arrow-ord (apache#2594)

7c4222e

tustvold mentioned this issue Dec 8, 2022

Split out arrow-ord (#2594) #3299

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 8, 2022

Split out arrow-ord (apache#2594)

d71f8eb

tustvold mentioned this issue Dec 19, 2022

refactor: Reduce how much code is instantiated for comparisons #2365

Closed

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022

Split out arrow-row (apache#2594)

2997b40

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022

Split out arrow-row (apache#2594)

f04d6b4

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022

Split out arrow-row (apache#2594)

9ad9d68

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 20, 2022

Split out arrow-row (apache#2594)

5c6ce3a

tustvold added a commit to tustvold/arrow-rs that referenced this issue Dec 21, 2022

Split out arrow-arith (apache#2594)

d3137a8

tustvold mentioned this issue Dec 21, 2022

Split out arrow-arith (#2594) #3384

Merged

tustvold closed this as completed in #3384 Dec 21, 2022

tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 23, 2023

Update MIRI for split crates (apache#2594)

fe27061

tustvold added a commit to tustvold/arrow-rs that referenced this issue Feb 23, 2023

Update MIRI for split crates (apache#2594)

0ba9074

tustvold mentioned this issue Feb 23, 2023

Update MIRI for split crates (#2594) #3754

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up Arrow Crate #2594

Split up Arrow Crate #2594

tustvold commented Aug 26, 2022 •

edited

Loading

alamb commented Aug 31, 2022

tustvold commented Sep 9, 2022

alamb commented Sep 9, 2022

andygrove commented Sep 9, 2022

tustvold commented Sep 9, 2022

andygrove commented Sep 9, 2022

andygrove commented Sep 9, 2022

jorgecarleitao commented Sep 12, 2022

alamb commented Sep 12, 2022

tustvold commented Sep 12, 2022

maxburke commented Sep 14, 2022

tustvold commented Sep 14, 2022 •

edited

Loading

alamb commented Dec 19, 2022

tustvold commented Dec 19, 2022 •

edited

Loading

alamb commented Dec 20, 2022

Split up Arrow Crate #2594

Split up Arrow Crate #2594

Comments

tustvold commented Aug 26, 2022 • edited Loading

alamb commented Aug 31, 2022

tustvold commented Sep 9, 2022

alamb commented Sep 9, 2022

andygrove commented Sep 9, 2022

tustvold commented Sep 9, 2022

andygrove commented Sep 9, 2022

andygrove commented Sep 9, 2022

jorgecarleitao commented Sep 12, 2022

alamb commented Sep 12, 2022

tustvold commented Sep 12, 2022

maxburke commented Sep 14, 2022

tustvold commented Sep 14, 2022 • edited Loading

alamb commented Dec 19, 2022

tustvold commented Dec 19, 2022 • edited Loading

alamb commented Dec 20, 2022

tustvold commented Aug 26, 2022 •

edited

Loading

tustvold commented Sep 14, 2022 •

edited

Loading

tustvold commented Dec 19, 2022 •

edited

Loading