-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: Migrate stats code to arrow-rs
#27009
persist: Migrate stats code to arrow-rs
#27009
Conversation
265ac32
to
37da153
Compare
26b85a4
to
de8b9bd
Compare
@danhhz is the expert on stats specifically! For filter pushdown in general... there's not much of a targeted suite in Nightly, but many of the randomized nightly tests do seem to reliably cover the feature, including parallel workload and some of the generative testing frameworks. (And the interpreter has a good set of targeted unit tests.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into the individual implementation yet, but structurally this seems quite sensible. (And overall I like the new Arrow APIs.)
The commit name etc. is a bit vague - but feel free to ignore if you're planning on squashing of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I agree the API seems a lot clearer now!
.set_statistics_enabled(EnabledStatistics::None) | ||
.set_compression(Compression::UNCOMPRESSED) | ||
.set_writer_version(WriterVersion::PARQUET_2_0) | ||
.set_data_page_size_limit(1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this match the default size previously set by arrow2? I wonder if this would have any encode/decode perf impact if different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this match the default size previously set by arrow2?
Exactly!
It probably has some perf impact, but at the moment wanted to keep things 1:1 with arrow2
|
||
let schema = Arc::new(ArrowSchema::new(fields)); | ||
let props = WriterProperties::builder() | ||
.set_dictionary_enabled(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we not use dictionary encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember but arrow2
might not support dictionary encoding for all types? Also though encodings and compression is something we haven't looked into yet so for the time being we've defaulted to Encoding::Plain
. But I'm hoping to experiment with this more soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A big diff, but pretty straightforward! \o/
Let's wrap up discussion of what else we can do to derisk/test this in the slack thread we already have about it, but otherwise this seems good to go!
@@ -5459,14 +5438,14 @@ name = "mz-persist-types" | |||
version = "0.0.0" | |||
dependencies = [ | |||
"anyhow", | |||
"arrow2", | |||
"arrow", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why arrow2 doesn't dissappear from the dep graph completely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have it for writing (k, v, t, d)
to S3 if the persist_use_arrow_rs_library
is off. I do plan to remove that flag soon, assuming the rollout goes well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right! lol 🤦
let lower = arrow2::compute::aggregate::min_boolean(&array).unwrap_or_default(); | ||
let upper = arrow2::compute::aggregate::max_boolean(&array).unwrap_or_default(); | ||
impl StatsFrom<BooleanBuffer> for PrimitiveStats<bool> { | ||
fn stats_from(col: &BooleanBuffer, validity: ValidityRef) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to leave this for a follow-up, but should we update all these "validity" names to match the arrow
jargon, which seems to be something like "logical nulls"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me! I'll do that in a follow-up
src/persist-types/src/codec_impls.rs
Outdated
impl ColumnMut<()> for BooleanBufferBuilder { | ||
fn new(_cfg: &()) -> Self { | ||
// Note(parkmycar): This capacity was picked arbitrarily. | ||
BooleanBufferBuilder::new(128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like BooleanBuilder::default()
below ends up using 1024
for the capacity. We could make this match to be less arbitrary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh great call! Let me update this
4de3ad4
to
84b7d1a
Compare
This PR adds some more testing around our `Part` statistics. Specifically it adds two new tests: 1. A `proptest` for correctness. We generate arbitrary `ColumnType`s, and use that `ColumnType` to generate an arbitrary `Vec<Row>`, then we calculate stats on that collection of `Row`s and assert that every `Row` would be included in the stats. 2. A test for stats stability. We use `proptest` with a constant seed to generate 1,000 instances of `RelationDesc`s with at most 4 columns, then a collection of at most 8 `Row`s for these `RelationDesc`s. We generate statistics for all 1,000 scenarios and then take a JSON snapshot of the stats. This test helps us track if any changes occur to our statistics generation. I'm curious what folks thoughts are on the second test, I'm more than happy to not merge it and use it only to validate #27009, if we don't think it provides a ton of signal. ### Motivation Protect against stats breaking, e.g. in changes like #27009 ### Tips for reviewer The PR is broken up into 2 commits: 1. Proptest strategies to generate `Datum`s from a `ColumnType`, and adding the first test. 2. The snapshot test. ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. ([trigger-ci for additional test/nightly runs](https://trigger-ci.dev.materialize.com/)) - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [x] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): - N/a
This PR refactors all of the
Part
/statistics related Persist code to usearrow-rs
instead ofarrow2
.Because of the switch there are a few mechanical changes that I needed to make:
u32
) now use Arrow arrays instead of buffers. This is becausearrow-rs
has bothArrowNativeType
(e.g.u32
) andArrowPrimitiveType
(e.g.UInt32Type
). AScalarBuffer
is generic over a native type, while aPrimitiveArray
is generic over a primitive type. Trying to bridge the gap between the native type and the primitive type in a generic impl ofStatsFrom
was tricky, and using justPrimitiveArray
s was easier.arrow-rs
do not implFrom<*Builder>
like inarrow2
. So I introducedtrait ColumnFinish
which is used to finalize builders into their corresponding array type.Default
, so I had to remove the generic impl ofColumnMut
and instead manually add an impl.I ran nightly and there are few failures but they seem to be unrelated. @bkirwi are there any specific tests I should look out for when changing stats related code?
Motivation
We discovered a bug in
arrow2
w.r.t. writing nested Parquet, which is what writing structured data in Persist relies on. Because of that we need to migrate all of our existing use cases ofarrow2
toarrow-rs
.Related #24830
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.