Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: enhance the substring kernel #1531

Open
9 of 12 tasks
Tracked by #1546
HaoYang670 opened this issue Apr 10, 2022 · 12 comments
Open
9 of 12 tasks
Tracked by #1546

Epic: enhance the substring kernel #1531

HaoYang670 opened this issue Apr 10, 2022 · 12 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@HaoYang670
Copy link
Contributor

HaoYang670 commented Apr 10, 2022

This is an epic issue to enhance the substring kernel from performance, safety and supported types.
You can find more info from the discussion: #1529 (comment)

List of individual tasks is below:

Performance

Safety

New Features

@HaoYang670 HaoYang670 added the enhancement Any new improvement worthy of a entry in the changelog label Apr 10, 2022
@alamb
Copy link
Contributor

alamb commented Apr 11, 2022

Thanks @HaoYang670

I suggest:

  • Mark the current substring function as unsafe AND RENAME TO substring_bytes_unchecked
  • Implement safe substring for string array AND NAME substring_bytes
  • Implement substring_by_char for string array AND NAME substring

@HaoYang670
Copy link
Contributor Author

Hi @alamb. After thinking twice about the renaming, I suggest we leave the substring by byte as the implicit version, and make substring by char as an explicit version, so that we can make minimize API change (because the current substring is by byte) and easily extend the function:

function name Supported array
substring StringArray(by byte)(slow), BinaryArray, FixedSizeBinary, FixedSizedListArray, ListArray
unsafe substring StringArray(by byte)(fast), and more if we find invalid outputs in the future
substring_by_char StringArray

In this way, the back compatibility will not be broken.

@alamb
Copy link
Contributor

alamb commented Apr 13, 2022

Thank you for your thoughts @HaoYang670

I still think switching substring to use chars is preferable because:

  1. For ascii (single byte utf8) text, bytes and chars are equivalent
  2. If someone has multi-byte utf8 string data the substring calculations are likely subtlety incorrect and they are in danger if creating invalid utf8 when using substring
  3. If we make substring (by bytes) safe, the performance will regress which some people will regard as backwards incompatible as well

Perhaps other maintainers such as @nevi-me @viirya @sunchao, @tustvold and @jhorstmann have some thoughts about if/how we should change substring to handle utf8 ?

@tustvold
Copy link
Contributor

tustvold commented Apr 13, 2022

Coming at the problem from a Rust perspective, as opposed to potentially a query engine perspective, I personally would expect it to behave similarly to the standard library. That is:

That being said I don't really understand the use-cases of substring within a query engine context, i.e. if there is some SQL primitive it maps onto or something, and so if there is some standard practice here...

With regards to the checked bytes API, whatever it is called, there is a bit-twiddling trick you can do to make it relatively cheap - see here. Assuming the contents of the string before were valid UTF-8, it is sufficient to just verify you aren't splitting a codepoint - you do not need to revalidate the entire array.

@alamb
Copy link
Contributor

alamb commented Apr 13, 2022

I think @tustvold 's perspective matches @HaoYang670 's proposal #1531 (comment) . I am convinced by @HaoYang670 's proposal

@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Apr 13, 2022

there is a bit-twiddling trick you can do to make it relatively cheap - see here.

This is a fantastic point! Maybe the cost of checking utf-8 validation can be so cheap that we don't need the unsafe substring at all!

@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Apr 13, 2022

So I think the first step is to add the utf-8 validation checking into the current substring function and test the performance.

@viirya
Copy link
Member

viirya commented Apr 13, 2022

Hmm, @tustvold and @HaoYang670 's proposal looks good. I think It is good idea to follow up the standard library.

Bytes by default, with validation that it isn't splitting a codepoint
Additional Unsafe unchecked bytes APIs
Separate APIs for operating on chars.

About the proposal, my question is, do we really need two APIs for substring on bytes? Is the validation is necessary? As we have separate API for operating on chars, users should use it for utf8 cases, I think. What use cases it could be that substring is called to operate on bytes, but it still needs to make sure not messing up utf8 content?

@HaoYang670
Copy link
Contributor Author

Hi @viirya. In general:
Speed: substring by byte unchecked >= substring by byte checked >> substring by char
Safety: substring by char > substring by byte checked > substring by byte unchecked. (If the input array has an invalid utf8 string, substring by byte checked might not find it, but substring by char can always find it)
User-friendly: substring by char > substring by byte checked > substring by byte unchecked

That's why we need 3 versions of substring.

@viirya
Copy link
Member

viirya commented Apr 14, 2022

Thanks @HaoYang670 . It is clear from the above list, substring by byte checked is always in the middle in many aspects. So it makes me wondering if it is really needed? For safety case, we want to make sure valid utf8 result, we definitely choose substring by char. If we don't care about it, we use substring by byte unchecked. I'm not sure what substring by byte checked can provide us.

@HaoYang670
Copy link
Contributor Author

HaoYang670 commented Apr 14, 2022

We can design a situation where substring by byte checked is the best choice:

Suppose we want to get the substrings with the byte length of the shortest string in the array. We could do in this way by using the length kernel and the aggregation kernel:

substring(string_array, start = 0, length = Some(min(length(string_array))))

Then, give an input string_array like this:

let input  = [
    "14 bytes  🗻",
    "This string has 24 bytes",
    "🗻 13 bytes"
];

We will get an invalid string array if using substring by byte unchecked, which is dangerous.
By using substring by byte checked we can find the error as early as possible.

@HaoYang670
Copy link
Contributor Author

We don't need to implement the unsafe substring based on the benchmark result of #1577. The utf-8 validation checking only decreases the speed by about 10%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

4 participants