persist: Migrate stats code to `arrow-rs` #27009

ParkMyCar · 2024-05-09T15:45:43Z

This PR refactors all of the Part/statistics related Persist code to use arrow-rs instead of arrow2.

Because of the switch there are a few mechanical changes that I needed to make:

Non-optional primitive types (e.g. u32) now use Arrow arrays instead of buffers. This is because arrow-rs has both ArrowNativeType (e.g. u32) and ArrowPrimitiveType (e.g. UInt32Type). A ScalarBuffer is generic over a native type, while a PrimitiveArray is generic over a primitive type. Trying to bridge the gap between the native type and the primitive type in a generic impl of StatsFrom was tricky, and using just PrimitiveArrays was easier.
Arrays in arrow-rs do not impl From<*Builder> like in arrow2. So I introduced trait ColumnFinish which is used to finalize builders into their corresponding array type.
Not all array builders implement Default, so I had to remove the generic impl of ColumnMut and instead manually add an impl.

I ran nightly and there are few failures but they seem to be unrelated. @bkirwi are there any specific tests I should look out for when changing stats related code?

Motivation

We discovered a bug in arrow2 w.r.t. writing nested Parquet, which is what writing structured data in Persist relies on. Because of that we need to migrate all of our existing use cases of arrow2 to arrow-rs.

Related #24830

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/a

bkirwi · 2024-05-14T15:15:15Z

I ran nightly and there are few failures but they seem to be unrelated. @bkirwi are there any specific tests I should look out for when changing stats related code?

@danhhz is the expert on stats specifically! For filter pushdown in general... there's not much of a targeted suite in Nightly, but many of the randomized nightly tests do seem to reliably cover the feature, including parallel workload and some of the generative testing frameworks. (And the interpreter has a good set of targeted unit tests.)

bkirwi

I haven't looked into the individual implementation yet, but structurally this seems quite sensible. (And overall I like the new Arrow APIs.)

The commit name etc. is a bit vague - but feel free to ignore if you're planning on squashing of course.

src/persist-types/src/columnar.rs

src/persist-types/src/dyn_struct.rs

rjobanp

LGTM - I agree the API seems a lot clearer now!

rjobanp · 2024-05-15T13:49:27Z

src/persist-types/src/parquet.rs

+        .set_statistics_enabled(EnabledStatistics::None)
+        .set_compression(Compression::UNCOMPRESSED)
+        .set_writer_version(WriterVersion::PARQUET_2_0)
+        .set_data_page_size_limit(1024 * 1024)


does this match the default size previously set by arrow2? I wonder if this would have any encode/decode perf impact if different.

does this match the default size previously set by arrow2?
Exactly!

It probably has some perf impact, but at the moment wanted to keep things 1:1 with arrow2

rjobanp · 2024-05-15T13:50:05Z

src/persist-types/src/parquet.rs

+
+    let schema = Arc::new(ArrowSchema::new(fields));
+    let props = WriterProperties::builder()
+        .set_dictionary_enabled(false)


why do we not use dictionary encoding?

I can't remember but arrow2 might not support dictionary encoding for all types? Also though encodings and compression is something we haven't looked into yet so for the time being we've defaulted to Encoding::Plain. But I'm hoping to experiment with this more soon!

danhhz

A big diff, but pretty straightforward! \o/

Let's wrap up discussion of what else we can do to derisk/test this in the slack thread we already have about it, but otherwise this seems good to go!

danhhz · 2024-05-15T17:00:45Z

Cargo.lock

@@ -5459,14 +5438,14 @@ name = "mz-persist-types"
 version = "0.0.0"
 dependencies = [
 "anyhow",
- "arrow2",
+ "arrow",


Any idea why arrow2 doesn't dissappear from the dep graph completely?

We still have it for writing (k, v, t, d) to S3 if the persist_use_arrow_rs_library is off. I do plan to remove that flag soon, assuming the rollout goes well

Ah, right! lol 🤦

danhhz · 2024-05-15T18:08:05Z

src/persist-types/src/stats.rs

-            let lower = arrow2::compute::aggregate::min_boolean(&array).unwrap_or_default();
-            let upper = arrow2::compute::aggregate::max_boolean(&array).unwrap_or_default();
+    impl StatsFrom<BooleanBuffer> for PrimitiveStats<bool> {
+        fn stats_from(col: &BooleanBuffer, validity: ValidityRef) -> Self {


Happy to leave this for a follow-up, but should we update all these "validity" names to match the arrow jargon, which seems to be something like "logical nulls"?

That sounds good to me! I'll do that in a follow-up

danhhz · 2024-05-15T18:23:30Z

src/persist-types/src/codec_impls.rs

+impl ColumnMut<()> for BooleanBufferBuilder {
+    fn new(_cfg: &()) -> Self {
+        // Note(parkmycar): This capacity was picked arbitrarily.
+        BooleanBufferBuilder::new(128)


Looks like BooleanBuilder::default() below ends up using 1024 for the capacity. We could make this match to be less arbitrary?

Ahh great call! Let me update this

This PR adds some more testing around our `Part` statistics. Specifically it adds two new tests: 1. A `proptest` for correctness. We generate arbitrary `ColumnType`s, and use that `ColumnType` to generate an arbitrary `Vec<Row>`, then we calculate stats on that collection of `Row`s and assert that every `Row` would be included in the stats. 2. A test for stats stability. We use `proptest` with a constant seed to generate 1,000 instances of `RelationDesc`s with at most 4 columns, then a collection of at most 8 `Row`s for these `RelationDesc`s. We generate statistics for all 1,000 scenarios and then take a JSON snapshot of the stats. This test helps us track if any changes occur to our statistics generation. I'm curious what folks thoughts are on the second test, I'm more than happy to not merge it and use it only to validate #27009, if we don't think it provides a ton of signal. ### Motivation Protect against stats breaking, e.g. in changes like #27009 ### Tips for reviewer The PR is broken up into 2 commits: 1. Proptest strategies to generate `Datum`s from a `ColumnType`, and adding the first test. 2. The snapshot test. ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)).  - [x] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): - N/a

ParkMyCar force-pushed the persist/arrow-rs_columnar branch from 265ac32 to 37da153 Compare May 13, 2024 16:49

ParkMyCar mentioned this pull request May 13, 2024

persist: Write structured columnar data to S3 #26561

Open

5 tasks

ParkMyCar force-pushed the persist/arrow-rs_columnar branch 2 times, most recently from 26b85a4 to de8b9bd Compare May 13, 2024 21:29

ParkMyCar marked this pull request as ready for review May 13, 2024 21:29

ParkMyCar requested review from a team as code owners May 13, 2024 21:29

ParkMyCar requested review from danhhz and bkirwi May 13, 2024 21:38

bkirwi reviewed May 14, 2024

View reviewed changes

src/persist-types/src/columnar.rs Outdated Show resolved Hide resolved

src/persist-types/src/dyn_struct.rs Show resolved Hide resolved

This was referenced May 14, 2024

[dnm] persist: explore removing DynStruct #27084

Draft

[dnm] persist: Set stats audit to 100% for arrow-rs refactor #27087

Closed

rjobanp approved these changes May 15, 2024

View reviewed changes

danhhz approved these changes May 15, 2024

View reviewed changes

ParkMyCar mentioned this pull request May 15, 2024

persist: Add more testing around stats #27117

Merged

5 tasks

ParkMyCar added 3 commits May 16, 2024 11:53

start

540bd01

merge ColumnFinish into ColumnPush

ca19580

update default capacity we use for BooleanBufferBuilder

84b7d1a

ParkMyCar force-pushed the persist/arrow-rs_columnar branch from 4de3ad4 to 84b7d1a Compare May 16, 2024 15:54

ParkMyCar merged commit 4ab17bf into MaterializeInc:main May 16, 2024
74 checks passed

materialize-bot mentioned this pull request May 16, 2024

release: v0.100.0 required reviews #27141

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

persist: Migrate stats code to `arrow-rs` #27009

persist: Migrate stats code to `arrow-rs` #27009

ParkMyCar commented May 9, 2024 •

edited

bkirwi commented May 14, 2024

bkirwi left a comment

rjobanp left a comment

rjobanp May 15, 2024

ParkMyCar May 15, 2024

rjobanp May 15, 2024

ParkMyCar May 15, 2024

danhhz left a comment

danhhz May 15, 2024

ParkMyCar May 16, 2024

danhhz May 16, 2024

danhhz May 15, 2024

ParkMyCar May 16, 2024

danhhz May 15, 2024

ParkMyCar May 16, 2024

persist: Migrate stats code to arrow-rs #27009

persist: Migrate stats code to arrow-rs #27009

Conversation

ParkMyCar commented May 9, 2024 • edited

Motivation

Checklist

bkirwi commented May 14, 2024

bkirwi left a comment

Choose a reason for hiding this comment

rjobanp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danhhz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

persist: Migrate stats code to `arrow-rs` #27009

persist: Migrate stats code to `arrow-rs` #27009

ParkMyCar commented May 9, 2024 •

edited