feat: support reading and writing`StringView` and `BinaryView` in parquet (part 2) #5557

ariesdevil · 2024-03-26T16:45:07Z

Which issue does this PR close?

Closes #5530

Rationale for this change

Use view_buffer instead of offset_buffer for view type parquet reader

What changes are included in this PR?

Use view_buffer instead of offset_buffer for view type parquet reader

Are there any user-facing changes?

Yes

ariesdevil · 2024-03-28T16:33:02Z

The parquet build for wasm32 check always failed, but it works on my machine...

ariesdevil · 2024-03-28T16:37:37Z

Hi @alamb @tustvold , this PR is not really ready for review, however, I may need some advice, and then I'll continue.

alamb · 2024-03-28T17:11:13Z

Thanks @ariesdevil -- I think @tustvold is away for a few days. I will try and give this PR a look over the next day or two. Very exciting!

cc @XiangpengHao

parquet/src/arrow/buffer/view_buffer.rs

alamb

Thank you @ariesdevil -- I think this is looking close to something we can merge 🙏

To merge this, I think we need:

Benchmarks for reading/writing StringView / BinaryViews
Cover a few more cases for round tripping

I think there is quite a bit more performance to be had as well by avoiding string copies. This PR will copy the data I think twice (once out of parquet buffer and once out of the offset buffer).

It would be nice to avoid at least one of those copies prior to merge, but I also think once we have the test coverage it would be ok to merge this PR and then do additional optimizations as follow on PRs

We can add benchmarks here:

reading

arrow-rs/parquet/benches/arrow_reader.rs

Lines 867 to 868 in 7e134f4

// string benchmarks

//==============================

writing:

arrow-rs/parquet/benches/arrow_writer.rs

Lines 382 to 392 in 7e134f4

    
           let batch = create_string_bench_batch(4096, 0.25, 0.75).unwrap(); 
        
           group.throughput(Throughput::Bytes( 
        
               batch 
        
                   .columns() 
        
                   .iter() 
        
                   .map(|f| f.get_array_memory_size() as u64) 
        
                   .sum(), 
        
           )); 
        
           group.bench_function("4096 values string", |b| { 
        
               b.iter(|| write_batch(&batch).unwrap()) 
        
           });

End to end round trip tests:

parquet/src/arrow/buffer/offset_buffer.rs

parquet/src/arrow/arrow_writer/mod.rs

alamb · 2024-04-01T18:58:19Z

This is really nice step forward -- thank you @ariesdevil

alamb

Thank you @ariesdevil -- I think this PR looks like a really nice incremental step.

As I think we have identified, it can be significantly improved performance wise but having the basic tests and benchmarks in place we are in a good space to optimize it.

I left some other smaller style / documentation comments that might be nice to cleanup, but I don't think they are strictly required

Let's wait another day for @tustvold to see if he wants to review (i think he is still not back yet) but otherwise we'll merge this PR and add additional features as a follow on PR

arrow-array/src/array/byte_view_array.rs

arrow/src/util/bench_util.rs

parquet/benches/arrow_writer.rs

parquet/src/arrow/buffer/offset_buffer.rs

alamb · 2024-04-04T15:25:21Z

I plan to give this another review today

alamb · 2024-04-05T20:44:10Z

The integration test https://github.com/apache/arrow-rs/actions/runs/8553472637/job/23436722309?pr=5557 seems to have failed for reasons unrelated to this PR. I have restarted it

################# FAILURES #################
FAILED TEST: union Rust producing,  Rust consuming
<class 'RuntimeError'>: Command failed: /build/rust/debug/flight-test-integration-client --host localhost --port=34799 --path /tmp/arrow-integration-92iwbbih/generated_union.json
With output:
--------------
Error: tonic::transport::Error(Transport, hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })))

alamb · 2024-04-05T20:49:03Z

I am sorry @ariesdevil -- I need more time and a fresh set of eyes to review the new implementation in this PR. I will find time over the next few days.

ariesdevil · 2024-04-06T03:44:15Z

Hi @alamb ,the new implementation has indeed changed a lot. It should have been implemented in another PR, but the original performance is really poor.

Thanks for your patience and have a nice weekend.

alamb

Thank you @ariesdevil

I left some comments on this PR

I would like to propose merging in the first version you had (with tests and benchmarks) as a separate PR: #5618. Then we can proceed with this PR (or a new one) for the optimization

I think this will help improve the cycle time for reviews as the PRs are smaller and the reviews can be kept more focused. Please let me know what you think

parquet/src/arrow/buffer/view_buffer.rs

parquet/src/arrow/array_reader/byte_view_array.rs

parquet/src/arrow/buffer/view_buffer.rs

parquet/src/arrow/array_reader/byte_view_array.rs

alamb · 2024-04-09T19:33:24Z

Perhaps we can either rebase this PR against main or maybe start a new PR

mapleFU · 2024-04-10T03:36:39Z

From the talk here ( #5618 (comment) ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String?

Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting

ariesdevil · 2024-04-10T08:56:32Z

From the talk here ( #5618 (comment) ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String?

Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting↳

It's a design choices that require @alamb and @tustvold judgment.

alamb · 2024-04-11T16:54:25Z

I hope to review this PR later today and if not I will do it tomorrow. I need to make sure I have enough contiguous time

alamb · 2024-04-12T12:40:22Z

From the talk here ( #5618 (comment) ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String?

Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting

I agree with @mapleFU 's observation that the special handling StringView / BinaryView will be substantially more code.

The reason that it would be valuable is that i think it can save a copy

For example as I understand it, from the parquet encodings doc

...
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
...

So the data looks like this (length prefix, followed by the bytes):

\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

To make a StringArray, those bytes must be copied to a new buffer so they are contiguous:

offets: [0, 3, 29]
data: fooabcdefghijklmnoprstuvwxyz

However, for a StringView array, the raw bytes can be used without copying

views: [(len: 3, data:"foo"), (len:26, offset8)] 
\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

ariesdevil · 2024-04-12T12:48:48Z

So we should write without views schema and use a view reader to read to view array?

alamb · 2024-04-12T12:51:49Z

So we should write without views schema and use a view reader to read to view array?

I am not sure -- I am trying to figure it out / what the current PR does. I am not as familar with this code as I would like

ariesdevil · 2024-04-12T12:56:33Z

So we should write without views schema and use a view reader to read to view array?↳

I am not sure -- I am trying to figure it out / what the current PR does. I am not as familar with this code as I would like↳

OK, if any questions you want a quick answer, you can DM me on Discord channel.

alamb

I spent a while playing around with this PR -- I think it is a really nice improvement

Here is what I think is needed to merge this PR:

Validate utf8 data (I left comments). I think we need a test for that too -- I'll make a PR to make that easier to add.
Add a test for the remaining code in parquet/src/arrow/array_reader/byte_view_array.rs, or throw a "not implemented" error.

Benchmarks

According to the benchmarks, this PR improves performance substantially (like 100x). Really nice 👏 (though I expect some of that will be reduced when we do utf8 validation)

cargo bench --features="arrow test_common experimental" --bench arrow_reader  -- StringViewArray
...

     Running benches/arrow_reader.rs (target/release/deps/arrow_reader-757988025cac113a)
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs
                        time:   [161.81 µs 163.56 µs 165.11 µs]
                        change: [-99.642% -99.639% -99.635%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs
                        time:   [166.35 µs 168.61 µs 170.69 µs]
                        change: [-99.630% -99.625% -99.621%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs
                        time:   [114.39 µs 114.97 µs 115.56 µs]
                        change: [-98.851% -98.834% -98.820%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  19 (19.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs
                        time:   [154.80 µs 156.25 µs 157.57 µs]
                        change: [-99.133% -99.128% -99.122%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs
                        time:   [155.55 µs 156.45 µs 157.34 µs]
                        change: [-99.190% -99.179% -99.170%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs
                        time:   [99.371 µs 100.55 µs 101.76 µs]
                        change: [-97.896% -97.868% -97.836%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 32 outliers among 100 measurements (32.00%)
  14 (14.00%) low severe
  11 (11.00%) high mild
  7 (7.00%) high severe

Test coverage

I also verified that the code is tested by the roundtrip uising llvm-cov

$ cargo llvm-cov --html test --package parquet --lib 'arrow::arrow_writer::tests::arrow_writer_binary_view

Full report is here: coverage.zip

This shows the view_buffer is well covered:

However, it seems like byte_view_array isn't very well covered:

parquet/src/arrow/buffer/view_buffer.rs

alamb · 2024-04-12T14:04:24Z

parquet/src/arrow/buffer/view_buffer.rs

+        }
+
+        let offset = self.buffers.len() as u32;
+        self.buffers.extend_from_slice(data);


This still requires copying the data once (into self.buffers) we could make it even faster in a follow on PR if the buffer could be kept entirely

For example, if we could add the buffer from the outside buf https://github.com/apache/arrow-rs/blob/d713d9e221b2f88d3a48624fc95bea3a1f6182a5/parquet/src/arrow/array_reader/byte_view_array.rs#L341-L340

So instead of

while self.offset < self.buf.len() && read != to_read { output.try_push(&buf[start_offset..end_offset], self.validate_utf8)?; ... }

Something like

// remember all data, zero copy output.add_buffer(buf.clone()) while self.offset < self.buf.len() && read != to_read { output.try_push(start_offset, end_offset, self.validate_utf8)?; ... }

I think this just work for Encoding::Plain, if we using Encoding::DELTA the page data not contain full value.

I tried this approach in 71f51aa, but ran into some problems, if we don't read the whole page data at once, then we add the same page data to the output every time a read occurs, so I revert this.

alamb · 2024-04-12T15:57:37Z

Validate utf8 data (I left comments). I think we need a test for that too -- I'll make a PR to make that easier to add.

I wrote new tests in #5639

alamb · 2024-04-16T15:31:43Z

I am on vacation this week but I hope to find some time later today or tomorrow to give this another look. Thanks @ariesdevil

ariesdevil · 2024-04-16T15:35:23Z

No need to rush, enjoy your vacation, just so I can have time to rethink how to continue optimization 😃

This reverts commit 71f51aa.

alamb · 2024-04-22T15:30:45Z

I see some new updates -- I plan to look at this PR later this week (probably won't be until Wednesday)

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Mar 26, 2024

ariesdevil force-pushed the parquet branch 5 times, most recently from 17695d3 to 1c7260e Compare March 28, 2024 16:23

ariesdevil marked this pull request as ready for review March 28, 2024 16:33

alamb mentioned this pull request Mar 28, 2024

DataFusion weekly project plan (Andrew Lamb) - March 25, 2024 apache/datafusion#9796

Closed

6 tasks

cgbur reviewed Mar 29, 2024

View reviewed changes

parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved

ariesdevil force-pushed the parquet branch 2 times, most recently from d2f1a30 to fdd2e48 Compare March 31, 2024 15:18

alamb changed the title ~~feat: support string and binary view read write parquet~~ feat: support reading and writingStringView and BinaryView in parquet Apr 1, 2024

alamb reviewed Apr 1, 2024

View reviewed changes

parquet/src/arrow/buffer/offset_buffer.rs Outdated Show resolved Hide resolved

parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved

parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Apr 1, 2024

DataFusion weekly project plan (Andrew Lamb) - April 1, 2024 apache/datafusion#9899

Closed

7 tasks

ariesdevil force-pushed the parquet branch 2 times, most recently from f37c3dd to f65807f Compare April 2, 2024 15:17

alamb previously approved these changes Apr 2, 2024

View reviewed changes

arrow-array/src/array/byte_view_array.rs Outdated Show resolved Hide resolved

arrow/src/util/bench_util.rs Outdated Show resolved Hide resolved

parquet/benches/arrow_writer.rs Outdated Show resolved Hide resolved

parquet/src/arrow/buffer/offset_buffer.rs Outdated Show resolved Hide resolved

ariesdevil force-pushed the parquet branch 3 times, most recently from a22ee14 to d286521 Compare April 4, 2024 10:56

alamb mentioned this pull request Apr 8, 2024

DataFusion weekly project plan (Andrew Lamb) - April 8, 2024 apache/datafusion#10002

Closed

9 tasks

alamb reviewed Apr 9, 2024

View reviewed changes

alamb mentioned this pull request Apr 9, 2024

Encapsulate View logic for GenericByteViewArray #5619

Open

alamb changed the title ~~feat: support reading and writingStringView and BinaryView in parquet~~ feat: support reading and writingStringView and BinaryView in parquet (part 2) Apr 9, 2024

ariesdevil force-pushed the parquet branch 2 times, most recently from a76d47e to 4a543e2 Compare April 10, 2024 08:51

ariesdevil force-pushed the parquet branch from 4a543e2 to d713d9e Compare April 10, 2024 08:59

alamb mentioned this pull request Apr 12, 2024

Implement StringViewArray and BinaryViewArray reading/writing in parquet #5530

Open

alamb mentioned this pull request Apr 12, 2024

Display support for StringViewArray and BinaryViewArray #5509

Closed

alamb reviewed Apr 12, 2024

View reviewed changes

alamb mentioned this pull request Apr 12, 2024

Add more invalid utf8 parquet reader tests #5639

Merged

ariesdevil force-pushed the parquet branch 3 times, most recently from fef3c9f to cbf08f4 Compare April 15, 2024 13:13

ariesdevil added 4 commits April 22, 2024 19:52

using view_buffer instead of offset_buffer for view type parquet reader

8f84e44

add check valid utf8

cd146a0

use different buffer for different encoding

5def24a

Revert "use different buffer for different encoding"

a3213dd

This reverts commit 71f51aa.

ariesdevil force-pushed the parquet branch from cbf08f4 to a3213dd Compare April 22, 2024 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support reading and writing`StringView` and `BinaryView` in parquet (part 2) #5557

feat: support reading and writing`StringView` and `BinaryView` in parquet (part 2) #5557

ariesdevil commented Mar 26, 2024 •

edited

ariesdevil commented Mar 28, 2024

ariesdevil commented Mar 28, 2024

alamb commented Mar 28, 2024

alamb left a comment

alamb commented Apr 1, 2024

alamb left a comment

alamb commented Apr 4, 2024

alamb commented Apr 5, 2024

alamb commented Apr 5, 2024

ariesdevil commented Apr 6, 2024

alamb left a comment

alamb commented Apr 9, 2024

mapleFU commented Apr 10, 2024 •

edited

ariesdevil commented Apr 10, 2024

alamb commented Apr 11, 2024

alamb commented Apr 12, 2024

ariesdevil commented Apr 12, 2024

alamb commented Apr 12, 2024

ariesdevil commented Apr 12, 2024

alamb left a comment

alamb Apr 12, 2024

ariesdevil Apr 12, 2024

ariesdevil Apr 14, 2024 •

edited

alamb commented Apr 12, 2024

alamb commented Apr 16, 2024

ariesdevil commented Apr 16, 2024

alamb commented Apr 22, 2024

	let batch = create_string_bench_batch(4096, 0.25, 0.75).unwrap();
	group.throughput(Throughput::Bytes(
	batch
	.columns()
	.iter()
	.map(\|f\| f.get_array_memory_size() as u64)
	.sum(),
	));
	group.bench_function("4096 values string", \|b\| {
	b.iter(\|\| write_batch(&batch).unwrap())
	});

feat: support reading and writingStringView and BinaryView in parquet (part 2) #5557

Are you sure you want to change the base?

feat: support reading and writingStringView and BinaryView in parquet (part 2) #5557

Conversation

ariesdevil commented Mar 26, 2024 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

ariesdevil commented Mar 28, 2024

ariesdevil commented Mar 28, 2024

alamb commented Mar 28, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Apr 1, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Apr 4, 2024

alamb commented Apr 5, 2024

alamb commented Apr 5, 2024

ariesdevil commented Apr 6, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Apr 9, 2024

mapleFU commented Apr 10, 2024 • edited

ariesdevil commented Apr 10, 2024

alamb commented Apr 11, 2024

alamb commented Apr 12, 2024

ariesdevil commented Apr 12, 2024

alamb commented Apr 12, 2024

ariesdevil commented Apr 12, 2024

alamb left a comment

Choose a reason for hiding this comment

Benchmarks

Test coverage

alamb Apr 12, 2024

Choose a reason for hiding this comment

ariesdevil Apr 12, 2024

Choose a reason for hiding this comment

ariesdevil Apr 14, 2024 • edited

Choose a reason for hiding this comment

alamb commented Apr 12, 2024

alamb commented Apr 16, 2024

ariesdevil commented Apr 16, 2024

alamb commented Apr 22, 2024

feat: support reading and writing`StringView` and `BinaryView` in parquet (part 2) #5557

feat: support reading and writing`StringView` and `BinaryView` in parquet (part 2) #5557

ariesdevil commented Mar 26, 2024 •

edited

mapleFU commented Apr 10, 2024 •

edited

ariesdevil Apr 14, 2024 •

edited