add bytes_estimate for binary push in parquet deserialize #1308

sundy-li · 2022-11-30T12:49:18Z

values: Vec::with_capacity(capacity * 24),

Now Binary<O> will allocate too much memory even if the binary has a small size per item.

Here we will reverse the capacity when pushing the 101st item by calculating the byte size per row.
Add shrink_to_fit() during finish method.

codecov · 2022-11-30T13:11:24Z

Codecov Report

Base: 83.12% // Head: 83.12% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (145f406) compared to base (1417f88).
Patch coverage: 64.28% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1308      +/-   ##
==========================================
- Coverage   83.12%   83.12%   -0.01%     
==========================================
  Files         370      370              
  Lines       40158    40169      +11     
==========================================
+ Hits        33383    33390       +7     
- Misses       6775     6779       +4

Impacted Files	Coverage Δ
src/io/parquet/read/deserialize/binary/utils.rs	`65.30% <37.50%> (-2.83%)`	⬇️
src/io/parquet/read/deserialize/binary/basic.rs	`80.43% <100.00%> (+0.24%)`	⬆️
src/io/ipc/read/stream_async.rs	`75.34% <0.00%> (-1.37%)`	⬇️
src/bitmap/utils/slice_iterator.rs	`97.56% <0.00%> (-1.22%)`	⬇️
src/io/ipc/read/file_async.rs	`60.82% <0.00%> (-0.38%)`	⬇️
src/offset.rs	`85.29% <0.00%> (-0.37%)`	⬇️
src/io/ipc/read/file.rs	`97.76% <0.00%> (+1.33%)`	⬆️
src/chunk.rs	`90.47% <0.00%> (+7.14%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ritchie46 · 2022-11-30T13:19:31Z

Is the image before are after? Have you also got the counterpart? 😋

src/io/parquet/read/deserialize/binary/utils.rs

src/io/parquet/read/deserialize/binary/basic.rs

sundy-li · 2022-11-30T14:43:32Z

Have you also got the counterpart?

It depends on the data, for a binary column with URL data, it improves profile percent by 19.42% ---> 15.69%.

Even with this pr, arrow2 read & decode large binary column is still slow.

I'll share a script and the data later, which shows that duckdb outperforms arrow2 & arrow-rs 2 times by reading same parquet files (I still can't find the reason).

sundy-li · 2022-11-30T15:34:11Z

Here is the Bench script to read url column from parquet files using arrow2 & arrow1 compare with duckdb.

The result on my 16-core arch is: 

arrow: 501 ms ~ 550ms
arrow2: 816ms ~ 832 ms
duckdb: 390ms ~ 430ms

with this commit, arrow2 cost -> 628 ms

cc @ritchie46 maybe you will be interested in the result.

ritchie46 · 2022-11-30T21:04:34Z

I'll share a script and the data later, which shows that duckdb outperforms arrow2 & arrow-rs 2 times by reading same parquet files (I still can't find the reason).

Yes, I also found those differences with duckdb. One easy win will be eliding the offset checks, but there is definitely more to win.

sundy-li · 2022-12-04T04:51:50Z

@ritchie46
Duckdb is faster reading parquet because it does not convert parquet to an arrow array. If the parquet is in plain encoded, it will reference the plain bytes buffer and create a vector point to this buffer, this operation avoids copying the whole binary.

void StringColumnReader::PlainReference(shared_ptr<ByteBuffer> plain_data, Vector &result) {
	StringVector::AddBuffer(result, make_buffer<ParquetStringVectorBuffer>(move(plain_data)));
}

string_t StringParquetValueConversion::PlainRead(ByteBuffer &plain_data, ColumnReader &reader) {
	auto &scr = ((StringColumnReader &)reader);
	uint32_t str_len = scr.fixed_width_string_length == 0 ? plain_data.read<uint32_t>() : scr.fixed_width_string_length;

	plain_data.available(str_len);
	auto actual_str_len = ((StringColumnReader &)reader).VerifyString(plain_data.ptr, str_len);
	auto ret_str = string_t(plain_data.ptr, actual_str_len);
	plain_data.inc(str_len);
	return ret_str;
}

jorgecarleitao

Thanks @sundy-li ! Looks good to me. I do agree that we can prob do better - could you rebase against main and re-run the bench? The offset check has been fixed in main, so we should see some differences

sundy-li · 2022-12-10T13:25:34Z

latest main: cost -> 770ms ~ 796 ms
this pr rebase latest main: cost -> 610ms ~ 651 ms

But it's better to introduce a vector decode path and reuse the decode buffer as arrow-rs. #1324

The current approach is streaming decode based which is pushed row by row.

…itao#1308)

sundy-li added 3 commits November 28, 2022 20:13

add bytes_estimate for binary push

c35d29c

add capacity

0563c0d

add shirink to fit when finish binary

576dae2

sundy-li requested review from ritchie46 and jorgecarleitao November 30, 2022 12:51

sundy-li added 3 commits November 30, 2022 20:51

lint

a2dc347

fix

fdbf55a

fix

8ac9f1d

ritchie46 reviewed Nov 30, 2022

View reviewed changes

src/io/parquet/read/deserialize/binary/utils.rs Show resolved Hide resolved

ritchie46 reviewed Nov 30, 2022

View reviewed changes

src/io/parquet/read/deserialize/binary/basic.rs Outdated Show resolved Hide resolved

jorgecarleitao approved these changes Dec 10, 2022

View reviewed changes

Merge branch 'main' into binary-estimate

747f016

merge main

145f406

jorgecarleitao merged commit 1fcfd7c into jorgecarleitao:main Dec 12, 2022

ritchie46 pushed a commit to ritchie46/arrow2 that referenced this pull request Mar 29, 2023

add bytes_estimate for binary push in parquet deserialize (jorgecarle…

5145968

…itao#1308)

ritchie46 pushed a commit to ritchie46/arrow2 that referenced this pull request Apr 5, 2023

add bytes_estimate for binary push in parquet deserialize (jorgecarle…

b086679

…itao#1308)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add bytes_estimate for binary push in parquet deserialize #1308

add bytes_estimate for binary push in parquet deserialize #1308

sundy-li commented Nov 30, 2022 •

edited

codecov bot commented Nov 30, 2022 •

edited

ritchie46 commented Nov 30, 2022

sundy-li commented Nov 30, 2022 •

edited

sundy-li commented Nov 30, 2022 •

edited

ritchie46 commented Nov 30, 2022

sundy-li commented Dec 4, 2022 •

edited

jorgecarleitao left a comment

sundy-li commented Dec 10, 2022 •

edited

add bytes_estimate for binary push in parquet deserialize #1308

add bytes_estimate for binary push in parquet deserialize #1308

Conversation

sundy-li commented Nov 30, 2022 • edited

codecov bot commented Nov 30, 2022 • edited

Codecov Report

ritchie46 commented Nov 30, 2022

sundy-li commented Nov 30, 2022 • edited

sundy-li commented Nov 30, 2022 • edited

ritchie46 commented Nov 30, 2022

sundy-li commented Dec 4, 2022 • edited

jorgecarleitao left a comment

Choose a reason for hiding this comment

sundy-li commented Dec 10, 2022 • edited

sundy-li commented Nov 30, 2022 •

edited

codecov bot commented Nov 30, 2022 •

edited

sundy-li commented Nov 30, 2022 •

edited

sundy-li commented Nov 30, 2022 •

edited

sundy-li commented Dec 4, 2022 •

edited

sundy-li commented Dec 10, 2022 •

edited