Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support reading and writingStringView and BinaryView in parquet (part 2) #5557

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ariesdevil
Copy link
Contributor

@ariesdevil ariesdevil commented Mar 26, 2024

Which issue does this PR close?

Closes #5530

Rationale for this change

Use view_buffer instead of offset_buffer for view type parquet reader

What changes are included in this PR?

Use view_buffer instead of offset_buffer for view type parquet reader

Are there any user-facing changes?

Yes

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Mar 26, 2024
@ariesdevil ariesdevil force-pushed the parquet branch 5 times, most recently from 17695d3 to 1c7260e Compare March 28, 2024 16:23
@ariesdevil
Copy link
Contributor Author

The parquet build for wasm32 check always failed, but it works on my machine...

@ariesdevil ariesdevil marked this pull request as ready for review March 28, 2024 16:33
@ariesdevil
Copy link
Contributor Author

Hi @alamb @tustvold , this PR is not really ready for review, however, I may need some advice, and then I'll continue.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2024

Thanks @ariesdevil -- I think @tustvold is away for a few days. I will try and give this PR a look over the next day or two. Very exciting!

cc @XiangpengHao

@ariesdevil ariesdevil force-pushed the parquet branch 2 times, most recently from d2f1a30 to fdd2e48 Compare March 31, 2024 15:18
@alamb alamb changed the title feat: support string and binary view read write parquet feat: support reading and writingStringView and BinaryView in parquet Apr 1, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ariesdevil -- I think this is looking close to something we can merge 🙏

To merge this, I think we need:

  1. Benchmarks for reading/writing StringView / BinaryViews
  2. Cover a few more cases for round tripping

I think there is quite a bit more performance to be had as well by avoiding string copies. This PR will copy the data I think twice (once out of parquet buffer and once out of the offset buffer).

It would be nice to avoid at least one of those copies prior to merge, but I also think once we have the test coverage it would be ok to merge this PR and then do additional optimizations as follow on PRs

We can add benchmarks here:

  • reading
    // string benchmarks
    //==============================
  • writing:
    let batch = create_string_bench_batch(4096, 0.25, 0.75).unwrap();
    group.throughput(Throughput::Bytes(
    batch
    .columns()
    .iter()
    .map(|f| f.get_array_memory_size() as u64)
    .sum(),
    ));
    group.bench_function("4096 values string", |b| {
    b.iter(|| write_batch(&batch).unwrap())
    });

End to end round trip tests:

parquet/src/arrow/buffer/offset_buffer.rs Outdated Show resolved Hide resolved
parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/arrow/arrow_writer/mod.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Apr 1, 2024

This is really nice step forward -- thank you @ariesdevil

@ariesdevil ariesdevil force-pushed the parquet branch 2 times, most recently from f37c3dd to f65807f Compare April 2, 2024 15:17
alamb
alamb previously approved these changes Apr 2, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ariesdevil -- I think this PR looks like a really nice incremental step.

As I think we have identified, it can be significantly improved performance wise but having the basic tests and benchmarks in place we are in a good space to optimize it.

I left some other smaller style / documentation comments that might be nice to cleanup, but I don't think they are strictly required

Let's wait another day for @tustvold to see if he wants to review (i think he is still not back yet) but otherwise we'll merge this PR and add additional features as a follow on PR

arrow-array/src/array/byte_view_array.rs Outdated Show resolved Hide resolved
arrow/src/util/bench_util.rs Outdated Show resolved Hide resolved
parquet/benches/arrow_writer.rs Outdated Show resolved Hide resolved
parquet/src/arrow/buffer/offset_buffer.rs Outdated Show resolved Hide resolved
@ariesdevil ariesdevil force-pushed the parquet branch 3 times, most recently from a22ee14 to d286521 Compare April 4, 2024 10:56
@alamb
Copy link
Contributor

alamb commented Apr 4, 2024

I plan to give this another review today

@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

The integration test https://github.com/apache/arrow-rs/actions/runs/8553472637/job/23436722309?pr=5557 seems to have failed for reasons unrelated to this PR. I have restarted it

################# FAILURES #################
FAILED TEST: union Rust producing,  Rust consuming
<class 'RuntimeError'>: Command failed: /build/rust/debug/flight-test-integration-client --host localhost --port=34799 --path /tmp/arrow-integration-92iwbbih/generated_union.json
With output:
--------------
Error: tonic::transport::Error(Transport, hyper::Error(Connect, ConnectError("tcp connect error", Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" })))

@alamb
Copy link
Contributor

alamb commented Apr 5, 2024

I am sorry @ariesdevil -- I need more time and a fresh set of eyes to review the new implementation in this PR. I will find time over the next few days.

@ariesdevil
Copy link
Contributor Author

Hi @alamb ,the new implementation has indeed changed a lot. It should have been implemented in another PR, but the original performance is really poor.

Thanks for your patience and have a nice weekend.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ariesdevil

I left some comments on this PR

I would like to propose merging in the first version you had (with tests and benchmarks) as a separate PR: #5618. Then we can proceed with this PR (or a new one) for the optimization

I think this will help improve the cycle time for reviews as the PRs are smaller and the reviews can be kept more focused. Please let me know what you think

parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved
parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved
parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved
parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved
@alamb alamb changed the title feat: support reading and writingStringView and BinaryView in parquet feat: support reading and writingStringView and BinaryView in parquet (part 2) Apr 9, 2024
@alamb
Copy link
Contributor

alamb commented Apr 9, 2024

Perhaps we can either rebase this PR against main or maybe start a new PR

@mapleFU
Copy link
Member

mapleFU commented Apr 10, 2024

From the talk here ( #5618 (comment) ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String?

Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting

@ariesdevil ariesdevil force-pushed the parquet branch 2 times, most recently from a76d47e to 4a543e2 Compare April 10, 2024 08:51
@ariesdevil
Copy link
Contributor Author

From the talk here ( #5618 (comment) ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String?

Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting↳

It's a design choices that require @alamb and @tustvold judgment.

@alamb
Copy link
Contributor

alamb commented Apr 11, 2024

I hope to review this PR later today and if not I will do it tomorrow. I need to make sure I have enough contiguous time

@alamb
Copy link
Contributor

alamb commented Apr 12, 2024

From the talk here ( #5618 (comment) ) Views type would keeps views schema. However, I think write by StringView/BinaryView doesn't means a read type is BinaryView/StringView. Should we just use Binary/String?

Otherwise, we need extra checking for handling written-by view and read by string. If StringView/BinaryView is keep in metadata, hope a parquet-testing file for this could be add to testing legacy parquet reader could reading this to string/binary without casting

I agree with @mapleFU 's observation that the special handling StringView / BinaryView will be substantially more code.

The reason that it would be valuable is that i think it can save a copy

For example as I understand it, from the parquet encodings doc

...
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array
...

So the data looks like this (length prefix, followed by the bytes):

\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

To make a StringArray, those bytes must be copied to a new buffer so they are contiguous:

offets: [0, 3, 29]
data: fooabcdefghijklmnoprstuvwxyz

However, for a StringView array, the raw bytes can be used without copying

views: [(len: 3, data:"foo"), (len:26, offset8)] 
\3\0\0\0foo\26\0\0\0abcdefghijklmnoprstuvwxyz

@ariesdevil
Copy link
Contributor Author

So we should write without views schema and use a view reader to read to view array?

@alamb
Copy link
Contributor

alamb commented Apr 12, 2024

So we should write without views schema and use a view reader to read to view array?

I am not sure -- I am trying to figure it out / what the current PR does. I am not as familar with this code as I would like

@ariesdevil
Copy link
Contributor Author

So we should write without views schema and use a view reader to read to view array?↳

I am not sure -- I am trying to figure it out / what the current PR does. I am not as familar with this code as I would like↳

OK, if any questions you want a quick answer, you can DM me on Discord channel.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while playing around with this PR -- I think it is a really nice improvement

Here is what I think is needed to merge this PR:

  1. Validate utf8 data (I left comments). I think we need a test for that too -- I'll make a PR to make that easier to add.
  2. Add a test for the remaining code in parquet/src/arrow/array_reader/byte_view_array.rs, or throw a "not implemented" error.

Benchmarks

According to the benchmarks, this PR improves performance substantially (like 100x). Really nice 👏 (though I expect some of that will be reduced when we do utf8 validation)

cargo bench --features="arrow test_common experimental" --bench arrow_reader  -- StringViewArray
...

     Running benches/arrow_reader.rs (target/release/deps/arrow_reader-757988025cac113a)
arrow_array_reader/StringViewArray/plain encoded, mandatory, no NULLs
                        time:   [161.81 µs 163.56 µs 165.11 µs]
                        change: [-99.642% -99.639% -99.635%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringViewArray/plain encoded, optional, no NULLs
                        time:   [166.35 µs 168.61 µs 170.69 µs]
                        change: [-99.630% -99.625% -99.621%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringViewArray/plain encoded, optional, half NULLs
                        time:   [114.39 µs 114.97 µs 115.56 µs]
                        change: [-98.851% -98.834% -98.820%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  19 (19.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
arrow_array_reader/StringViewArray/dictionary encoded, mandatory, no NULLs
                        time:   [154.80 µs 156.25 µs 157.57 µs]
                        change: [-99.133% -99.128% -99.122%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
arrow_array_reader/StringViewArray/dictionary encoded, optional, no NULLs
                        time:   [155.55 µs 156.45 µs 157.34 µs]
                        change: [-99.190% -99.179% -99.170%] (p = 0.00 < 0.05)
                        Performance has improved.
arrow_array_reader/StringViewArray/dictionary encoded, optional, half NULLs
                        time:   [99.371 µs 100.55 µs 101.76 µs]
                        change: [-97.896% -97.868% -97.836%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 32 outliers among 100 measurements (32.00%)
  14 (14.00%) low severe
  11 (11.00%) high mild
  7 (7.00%) high severe

Test coverage

I also verified that the code is tested by the roundtrip uising llvm-cov

$ cargo llvm-cov --html test --package parquet --lib 'arrow::arrow_writer::tests::arrow_writer_binary_view

Full report is here: coverage.zip

This shows the view_buffer is well covered:

Screenshot 2024-04-12 at 10 17 41 AM

However, it seems like byte_view_array isn't very well covered:

Screenshot 2024-04-12 at 10 21 15 AM

parquet/src/arrow/buffer/view_buffer.rs Outdated Show resolved Hide resolved
parquet/src/arrow/buffer/view_buffer.rs Show resolved Hide resolved
}

let offset = self.buffers.len() as u32;
self.buffers.extend_from_slice(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still requires copying the data once (into self.buffers) we could make it even faster in a follow on PR if the buffer could be kept entirely

For example, if we could add the buffer from the outside buf https://github.com/apache/arrow-rs/blob/d713d9e221b2f88d3a48624fc95bea3a1f6182a5/parquet/src/arrow/array_reader/byte_view_array.rs#L341-L340

So instead of

        while self.offset < self.buf.len() && read != to_read {
            output.try_push(&buf[start_offset..end_offset], self.validate_utf8)?;
            ...
        }

Something like

        // remember all data, zero copy
        output.add_buffer(buf.clone())
        while self.offset < self.buf.len() && read != to_read {
            output.try_push(start_offset, end_offset, self.validate_utf8)?;
            ...
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just work for Encoding::Plain, if we using Encoding::DELTA the page data not contain full value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this approach in 71f51aa, but ran into some problems, if we don't read the whole page data at once, then we add the same page data to the output every time a read occurs, so I revert this.

@alamb
Copy link
Contributor

alamb commented Apr 12, 2024

  • Validate utf8 data (I left comments). I think we need a test for that too -- I'll make a PR to make that easier to add.

I wrote new tests in #5639

@ariesdevil ariesdevil force-pushed the parquet branch 3 times, most recently from fef3c9f to cbf08f4 Compare April 15, 2024 13:13
@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

I am on vacation this week but I hope to find some time later today or tomorrow to give this another look. Thanks @ariesdevil

@ariesdevil
Copy link
Contributor Author

No need to rush, enjoy your vacation, just so I can have time to rethink how to continue optimization 😃

@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

I see some new updates -- I plan to look at this PR later this week (probably won't be until Wednesday)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement StringViewArray and BinaryViewArray reading/writing in parquet
5 participants