Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Substring_by_char #1784

Merged
merged 8 commits into from Jun 6, 2022
Merged

Add Substring_by_char #1784

merged 8 commits into from Jun 6, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Jun 4, 2022

Which issue does this PR close?

Closes #1768.

Rationale for this change

Support substring by char

What changes are included in this PR?

  1. new pub function substring_by_char
  2. tests
  3. benchmark
  4. update docs

Performance

substring utf8 (start = 0, length = None)                                                                            
                        time:   [20.035 ms 20.056 ms 20.081 ms]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe

substring utf8 (start = 1, length = str_len - 1)                                                                            
                        time:   [21.334 ms 21.358 ms 21.388 ms]
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

substring utf8 by char  time:   [117.18 ms 117.29 ms 117.41 ms]                                   
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

1/6 of the speed of substring(by byte)

Are there any user-facing changes?

Yes.

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 4, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 4, 2022

Codecov Report

Merging #1784 (fcc0ee3) into master (1a64677) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1784      +/-   ##
==========================================
+ Coverage   83.39%   83.42%   +0.02%     
==========================================
  Files         198      198              
  Lines       56142    56303     +161     
==========================================
+ Hits        46822    46969     +147     
- Misses       9320     9334      +14     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 99.51% <100.00%> (+0.09%) ⬆️
parquet/src/util/cursor.rs 62.18% <0.00%> (-1.69%) ⬇️
parquet/src/file/serialized_reader.rs 94.46% <0.00%> (-1.17%) ⬇️
arrow/src/datatypes/datatype.rs 65.42% <0.00%> (-0.38%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.46% <0.00%> (-0.20%) ⬇️
arrow/src/ipc/reader.rs 90.73% <0.00%> (-0.11%) ⬇️
parquet/src/file/writer.rs 92.84% <0.00%> (-0.02%) ⬇️
parquet/src/arrow/arrow_writer.rs 97.76% <0.00%> (-0.02%) ⬇️
parquet/src/arrow/schema.rs 96.81% <0.00%> (-0.01%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a64677...fcc0ee3. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, left some comments on how it could be improved - but happy for this to be a follow up PR

arrow/src/compute/kernels/substring.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/substring.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/substring.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/substring.rs Outdated Show resolved Hide resolved
length.to_usize().unwrap().min(char_count - start)
});

val.chars().skip(start).take(length).collect::<String>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a string temporary, only to copy out of it, is likely a significant portion of the slow-down. That combined with the null handling.

This could definitely be handled as a separate PR, but you might want to consider doing something like (not tested).

let nulls = // align bitmap to 0 offset, copying if already aligned
let mut vals = BufferBuilder::<u8>::new(array.value_data().len());
let mut indexes = BufferBuilder::<OffsetSize>::new(array.len() + 1);
indexes.append(0);

for val in array.iter() {
  let char_count = val.chars().count();
  let start = if start >= 0 {
      start.to_usize().unwrap().min(char_count)
  } else {
      char_count - (-start).to_usize().unwrap().min(char_count)
  };
  let length = length.map_or(char_count - start, |length| {
      length.to_usize().unwrap().min(char_count - start)
  });

  let mut start_byte = 0;
  let mut end_byte = val.len();
  for ((idx, (byte_idx, _)) in val.char_indices().enumerate() {
    if idx == start {
      start_byte = byte_idx;
    } else if idx == start + length {
      end_byte = byte_idx;
      break
    }
  }
  // Could even be unchecked
  vals.append_slice(&val[start_byte..end_byte]);
  indexes.append(vals.len() as _);
}

let data = ArrayDataBuilder::new(array.data_type()).len(array.len()).add_buffer(vals.finish())
  .add_buffer(indexes.finish()).add_buffer(vals.finish());

Ok(GenericStringArray::<OffsetSize>::from(unsafe {data.build_unchecked()}))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked by #1800

(input_vals.clone(), -4, Some(4), vec!["ello", "", "⊢x:T"]),
];

cases.into_iter().try_for_each::<_, Result<()>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could extract this as a function instead of duplicating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked by #1801

}

fn generic_string_by_char_with_non_zero_offset<O: OffsetSizeTrait>() -> Result<()> {
let values = "S→T = Πx:S.T";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to follow to just construct the regular array, and then call array.slice(1, 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked by #1801

@@ -1083,6 +1135,164 @@ mod tests {
generic_string_with_non_zero_offset::<i64>()
}

fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the test coverage 👍

HaoYang670 and others added 4 commits June 6, 2022 20:24
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
@tustvold tustvold merged commit dac2132 into apache:master Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add substring_by_char kernels for slicing on character boundaries
3 participants