Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let the StringBuilder use BinaryBuilder #2181

Merged
merged 2 commits into from Jul 28, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Jul 27, 2022

Which issue does this PR close?

Closes #2156.

Rationale for this change

Using less memory and faster.

Benchmark

Tested on Intel Ubuntu

Benchmarking bench_primitive/bench_string: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
bench_primitive/bench_string                                                                             
                        time:   [1.7409 ms 1.7420 ms 1.7432 ms]
                        thrpt:  [3.6413 GiB/s 3.6440 GiB/s 3.6462 GiB/s]
                 change:
                        time:   [-7.9510% -7.7647% -7.5385%] (p = 0.00 < 0.05)
                        thrpt:  [+8.1532% +8.4184% +8.6378%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

What changes are included in this PR?

  1. Update the String Builder to use Binary Builder
  2. Add some methods to Binary Builder because we use them in the String Builder.
  3. Update tests.
  4. Add from<binary_array> for string array.

Are there any user-facing changes?

Delete the pub method StringBuilder::append(bool).

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@@ -385,12 +406,6 @@ pub type StringArray = GenericStringArray<i32>;
/// ```
pub type LargeStringArray = GenericStringArray<i64>;

impl<T: OffsetSizeTrait> From<GenericListArray<T>> for GenericStringArray<T> {
fn from(v: GenericListArray<T>) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move, not remove

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 27, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #2181 (7c843ca) into master (37dd037) will decrease coverage by 0.49%.
The diff coverage is 47.29%.

@@            Coverage Diff             @@
##           master    #2181      +/-   ##
==========================================
- Coverage   82.87%   82.38%   -0.50%     
==========================================
  Files         237      239       +2     
  Lines       61465    62096     +631     
==========================================
+ Hits        50940    51155     +215     
- Misses      10525    10941     +416     
Impacted Files Coverage Δ
object_store/src/aws.rs 0.00% <0.00%> (ø)
object_store/src/azure.rs 0.00% <0.00%> (ø)
object_store/src/gcp.rs 0.00% <0.00%> (ø)
object_store/src/multipart.rs 0.00% <0.00%> (ø)
object_store/src/throttle.rs 94.63% <0.00%> (-1.89%) ⬇️
parquet/src/arrow/async_reader.rs 0.00% <0.00%> (ø)
parquet/src/column/page.rs 83.33% <0.00%> (-15.36%) ⬇️
arrow/src/array/builder/boolean_builder.rs 84.26% <66.66%> (-2.47%) ⬇️
arrow/src/array/array_string.rs 92.05% <71.42%> (-5.01%) ⬇️
object_store/src/local.rs 79.47% <72.78%> (-3.42%) ⬇️
... and 21 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

impl<OffsetSize: OffsetSizeTrait> From<GenericBinaryArray<OffsetSize>>
for GenericStringArray<OffsetSize>
{
fn from(v: GenericBinaryArray<OffsetSize>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would allow non-utf8 data within a StringArray using safe APIs, which would break our safety guarantees

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. But this should not be done in this PR. Because it follows the style of the currentStringArray::from_list(https://github.com/apache/arrow-rs/blob/master/arrow/src/array/array_string.rs#L122-L143) which is also unsafe.

We could file a follow-up issue to track this. Maybe implementing both safe and unsafe styles is a good choice:

impl StringArray {
    unsafe fn from_list_unchecked(...) {...}
    unsafe fn from_binary_unchecked (...) {...}
}

impl From<ListArray> for StringArray {
    /// safe method with utf-8 checking
    fn from(...) {...}
}

impl From<BinaryArray> for StringArray {
    /// safe method with utf-8 checking
    fn from(...) {...}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #2205 to track this.

@tustvold tustvold merged commit 9e47779 into apache:master Jul 28, 2022
@ursabot
Copy link

ursabot commented Jul 28, 2022

Benchmark runs are scheduled for baseline = 82e0512 and contender = 9e47779. 9e47779 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@HaoYang670 HaoYang670 deleted the str_builder_uses_binary_builder branch July 28, 2022 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The GenericStringBuilder should use GenericBinaryBuilder
4 participants