Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist: Migrate stats code to arrow-rs #27009

Merged
merged 3 commits into from
May 16, 2024

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented May 9, 2024

This PR refactors all of the Part/statistics related Persist code to use arrow-rs instead of arrow2.

Because of the switch there are a few mechanical changes that I needed to make:

  1. Non-optional primitive types (e.g. u32) now use Arrow arrays instead of buffers. This is because arrow-rs has both ArrowNativeType (e.g. u32) and ArrowPrimitiveType (e.g. UInt32Type). A ScalarBuffer is generic over a native type, while a PrimitiveArray is generic over a primitive type. Trying to bridge the gap between the native type and the primitive type in a generic impl of StatsFrom was tricky, and using just PrimitiveArrays was easier.
  2. Arrays in arrow-rs do not impl From<*Builder> like in arrow2. So I introduced trait ColumnFinish which is used to finalize builders into their corresponding array type.
  3. Not all array builders implement Default, so I had to remove the generic impl of ColumnMut and instead manually add an impl.

I ran nightly and there are few failures but they seem to be unrelated. @bkirwi are there any specific tests I should look out for when changing stats related code?

Motivation

We discovered a bug in arrow2 w.r.t. writing nested Parquet, which is what writing structured data in Persist relies on. Because of that we need to migrate all of our existing use cases of arrow2 to arrow-rs.

Related #24830

Checklist

@ParkMyCar ParkMyCar force-pushed the persist/arrow-rs_columnar branch 2 times, most recently from 26b85a4 to de8b9bd Compare May 13, 2024 21:29
@ParkMyCar ParkMyCar marked this pull request as ready for review May 13, 2024 21:29
@ParkMyCar ParkMyCar requested review from a team as code owners May 13, 2024 21:29
@ParkMyCar ParkMyCar requested review from danhhz and bkirwi May 13, 2024 21:38
@bkirwi
Copy link
Contributor

bkirwi commented May 14, 2024

I ran nightly and there are few failures but they seem to be unrelated. @bkirwi are there any specific tests I should look out for when changing stats related code?

@danhhz is the expert on stats specifically! For filter pushdown in general... there's not much of a targeted suite in Nightly, but many of the randomized nightly tests do seem to reliably cover the feature, including parallel workload and some of the generative testing frameworks. (And the interpreter has a good set of targeted unit tests.)

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the individual implementation yet, but structurally this seems quite sensible. (And overall I like the new Arrow APIs.)

The commit name etc. is a bit vague - but feel free to ignore if you're planning on squashing of course.

src/persist-types/src/columnar.rs Outdated Show resolved Hide resolved
src/persist-types/src/dyn_struct.rs Show resolved Hide resolved
Copy link
Contributor

@rjobanp rjobanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I agree the API seems a lot clearer now!

.set_statistics_enabled(EnabledStatistics::None)
.set_compression(Compression::UNCOMPRESSED)
.set_writer_version(WriterVersion::PARQUET_2_0)
.set_data_page_size_limit(1024 * 1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this match the default size previously set by arrow2? I wonder if this would have any encode/decode perf impact if different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this match the default size previously set by arrow2?
Exactly!

It probably has some perf impact, but at the moment wanted to keep things 1:1 with arrow2


let schema = Arc::new(ArrowSchema::new(fields));
let props = WriterProperties::builder()
.set_dictionary_enabled(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we not use dictionary encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember but arrow2 might not support dictionary encoding for all types? Also though encodings and compression is something we haven't looked into yet so for the time being we've defaulted to Encoding::Plain. But I'm hoping to experiment with this more soon!

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A big diff, but pretty straightforward! \o/

Let's wrap up discussion of what else we can do to derisk/test this in the slack thread we already have about it, but otherwise this seems good to go!

@@ -5459,14 +5438,14 @@ name = "mz-persist-types"
version = "0.0.0"
dependencies = [
"anyhow",
"arrow2",
"arrow",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why arrow2 doesn't dissappear from the dep graph completely?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have it for writing (k, v, t, d) to S3 if the persist_use_arrow_rs_library is off. I do plan to remove that flag soon, assuming the rollout goes well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right! lol 🤦

let lower = arrow2::compute::aggregate::min_boolean(&array).unwrap_or_default();
let upper = arrow2::compute::aggregate::max_boolean(&array).unwrap_or_default();
impl StatsFrom<BooleanBuffer> for PrimitiveStats<bool> {
fn stats_from(col: &BooleanBuffer, validity: ValidityRef) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to leave this for a follow-up, but should we update all these "validity" names to match the arrow jargon, which seems to be something like "logical nulls"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me! I'll do that in a follow-up

impl ColumnMut<()> for BooleanBufferBuilder {
fn new(_cfg: &()) -> Self {
// Note(parkmycar): This capacity was picked arbitrarily.
BooleanBufferBuilder::new(128)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like BooleanBuilder::default() below ends up using 1024 for the capacity. We could make this match to be less arbitrary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh great call! Let me update this

@ParkMyCar ParkMyCar merged commit 4ab17bf into MaterializeInc:main May 16, 2024
74 checks passed
ParkMyCar added a commit that referenced this pull request May 16, 2024
This PR adds some more testing around our `Part` statistics.
Specifically it adds two new tests:

1. A `proptest` for correctness. We generate arbitrary `ColumnType`s,
and use that `ColumnType` to generate an arbitrary `Vec<Row>`, then we
calculate stats on that collection of `Row`s and assert that every `Row`
would be included in the stats.
2. A test for stats stability. We use `proptest` with a constant seed to
generate 1,000 instances of `RelationDesc`s with at most 4 columns, then
a collection of at most 8 `Row`s for these `RelationDesc`s. We generate
statistics for all 1,000 scenarios and then take a JSON snapshot of the
stats. This test helps us track if any changes occur to our statistics
generation.

I'm curious what folks thoughts are on the second test, I'm more than
happy to not merge it and use it only to validate
#27009, if we don't
think it provides a ton of signal.

### Motivation

Protect against stats breaking, e.g. in changes like
#27009

### Tips for reviewer

The PR is broken up into 2 commits:

1. Proptest strategies to generate `Datum`s from a `ColumnType`, and
adding the first test.
2. The snapshot test.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [x] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
  - N/a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants