New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support reading and writingStringView
and BinaryView
in parquet (part 2)
#5557
base: master
Are you sure you want to change the base?
Conversation
17695d3
to
1c7260e
Compare
The parquet build for wasm32 check always failed, but it works on my machine... |
Thanks @ariesdevil -- I think @tustvold is away for a few days. I will try and give this PR a look over the next day or two. Very exciting! |
d2f1a30
to
fdd2e48
Compare
StringView
and BinaryView
in parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ariesdevil -- I think this is looking close to something we can merge 🙏
To merge this, I think we need:
- Benchmarks for reading/writing StringView / BinaryViews
- Cover a few more cases for round tripping
I think there is quite a bit more performance to be had as well by avoiding string copies. This PR will copy the data I think twice (once out of parquet buffer and once out of the offset buffer).
It would be nice to avoid at least one of those copies prior to merge, but I also think once we have the test coverage it would be ok to merge this PR and then do additional optimizations as follow on PRs
We can add benchmarks here:
- reading
arrow-rs/parquet/benches/arrow_reader.rs
Lines 867 to 868 in 7e134f4
// string benchmarks //============================== - writing:
arrow-rs/parquet/benches/arrow_writer.rs
Lines 382 to 392 in 7e134f4
let batch = create_string_bench_batch(4096, 0.25, 0.75).unwrap(); group.throughput(Throughput::Bytes( batch .columns() .iter() .map(|f| f.get_array_memory_size() as u64) .sum(), )); group.bench_function("4096 values string", |b| { b.iter(|| write_batch(&batch).unwrap()) });
End to end round trip tests:
This is really nice step forward -- thank you @ariesdevil |
f37c3dd
to
f65807f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ariesdevil -- I think this PR looks like a really nice incremental step.
As I think we have identified, it can be significantly improved performance wise but having the basic tests and benchmarks in place we are in a good space to optimize it.
I left some other smaller style / documentation comments that might be nice to cleanup, but I don't think they are strictly required
Let's wait another day for @tustvold to see if he wants to review (i think he is still not back yet) but otherwise we'll merge this PR and add additional features as a follow on PR
a22ee14
to
d286521
Compare
I plan to give this another review today |
The integration test https://github.com/apache/arrow-rs/actions/runs/8553472637/job/23436722309?pr=5557 seems to have failed for reasons unrelated to this PR. I have restarted it
|
I am sorry @ariesdevil -- I need more time and a fresh set of eyes to review the new implementation in this PR. I will find time over the next few days. |
Hi @alamb ,the new implementation has indeed changed a lot. It should have been implemented in another PR, but the original performance is really poor. Thanks for your patience and have a nice weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ariesdevil
I left some comments on this PR
I would like to propose merging in the first version you had (with tests and benchmarks) as a separate PR: #5618. Then we can proceed with this PR (or a new one) for the optimization
I think this will help improve the cycle time for reviews as the PRs are smaller and the reviews can be kept more focused. Please let me know what you think
StringView
and BinaryView
in parquetStringView
and BinaryView
in parquet (part 2)
Perhaps we can either rebase this PR against main or maybe start a new PR |
From the talk here ( #5618 (comment) ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String? Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting |
a76d47e
to
4a543e2
Compare
It's a design choices that require @alamb and @tustvold judgment. |
I hope to review this PR later today and if not I will do it tomorrow. I need to make sure I have enough contiguous time |
I agree with @mapleFU 's observation that the special handling StringView / BinaryView will be substantially more code. The reason that it would be valuable is that i think it can save a copy For example as I understand it, from the parquet encodings doc
So the data looks like this (length prefix, followed by the bytes):
To make a StringArray, those bytes must be copied to a new buffer so they are contiguous:
However, for a StringView array, the raw bytes can be used without copying
|
So we should write without views schema and use a view reader to read to view array? |
I am not sure -- I am trying to figure it out / what the current PR does. I am not as familar with this code as I would like |
OK, if any questions you want a quick answer, you can DM me on Discord channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a while playing around with this PR -- I think it is a really nice improvement
Here is what I think is needed to merge this PR:
- Validate utf8 data (I left comments). I think we need a test for that too -- I'll make a PR to make that easier to add.
- Add a test for the remaining code in
parquet/src/arrow/array_reader/byte_view_array.rs
, or throw a "not implemented" error.
Benchmarks
According to the benchmarks, this PR improves performance substantially (like 100x). Really nice 👏 (though I expect some of that will be reduced when we do utf8 validation)
cargo bench --features="arrow test_common experimental" --bench arrow_reader -- StringViewArray
...
Running benches/arrow_reader.rs (target/release/deps/arrow_reader-757988025cac113a)
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs
time: [161.81 µs 163.56 µs 165.11 µs]
change: [-99.642% -99.639% -99.635%] (p = 0.00 < 0.05)
Performance has improved.
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs
time: [166.35 µs 168.61 µs 170.69 µs]
change: [-99.630% -99.625% -99.621%] (p = 0.00 < 0.05)
Performance has improved.
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs
time: [114.39 µs 114.97 µs 115.56 µs]
change: [-98.851% -98.834% -98.820%] (p = 0.00 < 0.05)
Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
19 (19.00%) low severe
3 (3.00%) low mild
2 (2.00%) high mild
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs
time: [154.80 µs 156.25 µs 157.57 µs]
change: [-99.133% -99.128% -99.122%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
2 (2.00%) high mild
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs
time: [155.55 µs 156.45 µs 157.34 µs]
change: [-99.190% -99.179% -99.170%] (p = 0.00 < 0.05)
Performance has improved.
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs
time: [99.371 µs 100.55 µs 101.76 µs]
change: [-97.896% -97.868% -97.836%] (p = 0.00 < 0.05)
Performance has improved.
Found 32 outliers among 100 measurements (32.00%)
14 (14.00%) low severe
11 (11.00%) high mild
7 (7.00%) high severe
Test coverage
I also verified that the code is tested by the roundtrip uising llvm-cov
$ cargo llvm-cov --html test --package parquet --lib 'arrow::arrow_writer::tests::arrow_writer_binary_view
Full report is here: coverage.zip
This shows the view_buffer is well covered:
However, it seems like byte_view_array
isn't very well covered:
} | ||
|
||
let offset = self.buffers.len() as u32; | ||
self.buffers.extend_from_slice(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still requires copying the data once (into self.buffers
) we could make it even faster in a follow on PR if the buffer could be kept entirely
For example, if we could add the buffer from the outside buf
https://github.com/apache/arrow-rs/blob/d713d9e221b2f88d3a48624fc95bea3a1f6182a5/parquet/src/arrow/array_reader/byte_view_array.rs#L341-L340
So instead of
while self.offset < self.buf.len() && read != to_read {
output.try_push(&buf[start_offset..end_offset], self.validate_utf8)?;
...
}
Something like
// remember all data, zero copy
output.add_buffer(buf.clone())
while self.offset < self.buf.len() && read != to_read {
output.try_push(start_offset, end_offset, self.validate_utf8)?;
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this just work for Encoding::Plain
, if we using Encoding::DELTA
the page data not contain full value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this approach in 71f51aa, but ran into some problems, if we don't read the whole page data at once, then we add the same page data to the output every time a read occurs, so I revert this.
I wrote new tests in #5639 |
fef3c9f
to
cbf08f4
Compare
I am on vacation this week but I hope to find some time later today or tomorrow to give this another look. Thanks @ariesdevil |
No need to rush, enjoy your vacation, just so I can have time to rethink how to continue optimization 😃 |
I see some new updates -- I plan to look at this PR later this week (probably won't be until Wednesday) |
Which issue does this PR close?
Closes #5530
Rationale for this change
Use view_buffer instead of offset_buffer for view type parquet reader
What changes are included in this PR?
Use view_buffer instead of offset_buffer for view type parquet reader
Are there any user-facing changes?
Yes