Epic: enhance the `substring` kernel #1531

HaoYang670 · 2022-04-10T11:52:33Z

alamb · 2022-04-11T21:52:44Z

I suggest:

Mark the current substring function as unsafe AND RENAME TO substring_bytes_unchecked
Implement safe substring for string array AND NAME substring_bytes
Implement substring_by_char for string array AND NAME substring

HaoYang670 · 2022-04-13T06:36:10Z

Hi @alamb. After thinking twice about the renaming, I suggest we leave the substring by byte as the implicit version, and make substring by char as an explicit version, so that we can make minimize API change (because the current substring is by byte) and easily extend the function:

function name	Supported array
substring	StringArray(by byte)(slow), BinaryArray, FixedSizeBinary, FixedSizedListArray, ListArray
unsafe substring	StringArray(by byte)(fast), and more if we find invalid outputs in the future
substring_by_char	StringArray

In this way, the back compatibility will not be broken.

alamb · 2022-04-13T12:43:27Z

Thank you for your thoughts @HaoYang670

I still think switching substring to use chars is preferable because:

For ascii (single byte utf8) text, bytes and chars are equivalent
If someone has multi-byte utf8 string data the substring calculations are likely subtlety incorrect and they are in danger if creating invalid utf8 when using substring
If we make substring (by bytes) safe, the performance will regress which some people will regard as backwards incompatible as well

Perhaps other maintainers such as @nevi-me @viirya @sunchao, @tustvold and @jhorstmann have some thoughts about if/how we should change substring to handle utf8 ?

tustvold · 2022-04-13T12:56:22Z

Coming at the problem from a Rust perspective, as opposed to potentially a query engine perspective, I personally would expect it to behave similarly to the standard library. That is:

Bytes by default, with validation that it isn't splitting a codepoint
Additional Unsafe unchecked bytes APIs
Separate APIs for operating on chars.

That being said I don't really understand the use-cases of substring within a query engine context, i.e. if there is some SQL primitive it maps onto or something, and so if there is some standard practice here...

With regards to the checked bytes API, whatever it is called, there is a bit-twiddling trick you can do to make it relatively cheap - see here. Assuming the contents of the string before were valid UTF-8, it is sufficient to just verify you aren't splitting a codepoint - you do not need to revalidate the entire array.

alamb · 2022-04-13T13:05:37Z

I think @tustvold 's perspective matches @HaoYang670 's proposal #1531 (comment) . I am convinced by @HaoYang670 's proposal

HaoYang670 · 2022-04-13T13:06:24Z

there is a bit-twiddling trick you can do to make it relatively cheap - see here.

This is a fantastic point! Maybe the cost of checking utf-8 validation can be so cheap that we don't need the unsafe substring at all!

HaoYang670 · 2022-04-13T13:12:01Z

So I think the first step is to add the utf-8 validation checking into the current substring function and test the performance.

viirya · 2022-04-13T17:21:40Z

Hmm, @tustvold and @HaoYang670 's proposal looks good. I think It is good idea to follow up the standard library.

Bytes by default, with validation that it isn't splitting a codepoint
Additional Unsafe unchecked bytes APIs
Separate APIs for operating on chars.

About the proposal, my question is, do we really need two APIs for substring on bytes? Is the validation is necessary? As we have separate API for operating on chars, users should use it for utf8 cases, I think. What use cases it could be that substring is called to operate on bytes, but it still needs to make sure not messing up utf8 content?

HaoYang670 · 2022-04-14T01:54:07Z

Hi @viirya. In general:
Speed: substring by byte unchecked >= substring by byte checked >> substring by char
Safety: substring by char > substring by byte checked > substring by byte unchecked. (If the input array has an invalid utf8 string, substring by byte checked might not find it, but substring by char can always find it)
User-friendly: substring by char > substring by byte checked > substring by byte unchecked

That's why we need 3 versions of substring.

viirya · 2022-04-14T02:06:40Z

Thanks @HaoYang670 . It is clear from the above list, substring by byte checked is always in the middle in many aspects. So it makes me wondering if it is really needed? For safety case, we want to make sure valid utf8 result, we definitely choose substring by char. If we don't care about it, we use substring by byte unchecked. I'm not sure what substring by byte checked can provide us.

HaoYang670 · 2022-04-14T06:17:11Z

We can design a situation where substring by byte checked is the best choice:

Suppose we want to get the substrings with the byte length of the shortest string in the array. We could do in this way by using the length kernel and the aggregation kernel:

substring(string_array, start = 0, length = Some(min(length(string_array))))

Then, give an input string_array like this:

let input  = [
    "14 bytes  🗻",
    "This string has 24 bytes",
    "🗻 13 bytes"
];

We will get an invalid string array if using substring by byte unchecked, which is dangerous.
By using substring by byte checked we can find the error as early as possible.

HaoYang670 · 2022-04-20T01:25:07Z

We don't need to implement the unsafe substring based on the benchmark result of #1577. The utf-8 validation checking only decreases the speed by about 10%.

HaoYang670 added the enhancement Any new improvement worthy of a entry in the changelog label Apr 10, 2022

HaoYang670 mentioned this issue Apr 10, 2022

Improve doc string for substring kernel #1529

Merged

HaoYang670 mentioned this issue Apr 12, 2022

Mark the current substring function as unsafe and rename it. #1541

Closed

HaoYang670 mentioned this issue Apr 13, 2022

Do not merge! Rename the substring function and mark it as unsafe #1543

Closed

alamb mentioned this issue Apr 14, 2022

Add CI check for full validation mode #1546

Merged

4 tasks

HaoYang670 mentioned this issue Apr 17, 2022

Add utf-8 validation checking to substring kernel #1575

Closed

HaoYang670 mentioned this issue Apr 20, 2022

Add support for BinaryArray in substring kernel #1593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: enhance the `substring` kernel #1531

Epic: enhance the `substring` kernel #1531

HaoYang670 commented Apr 10, 2022 •

edited

alamb commented Apr 11, 2022

HaoYang670 commented Apr 13, 2022

alamb commented Apr 13, 2022

tustvold commented Apr 13, 2022 •

edited

alamb commented Apr 13, 2022 •

edited

HaoYang670 commented Apr 13, 2022 •

edited

HaoYang670 commented Apr 13, 2022 •

edited

viirya commented Apr 13, 2022

HaoYang670 commented Apr 14, 2022

viirya commented Apr 14, 2022 •

edited

HaoYang670 commented Apr 14, 2022 •

edited

HaoYang670 commented Apr 20, 2022

Epic: enhance the substring kernel #1531

Epic: enhance the substring kernel #1531

Comments

HaoYang670 commented Apr 10, 2022 • edited

List of individual tasks is below:

Performance

Safety

New Features

alamb commented Apr 11, 2022

HaoYang670 commented Apr 13, 2022

alamb commented Apr 13, 2022

tustvold commented Apr 13, 2022 • edited

alamb commented Apr 13, 2022 • edited

HaoYang670 commented Apr 13, 2022 • edited

HaoYang670 commented Apr 13, 2022 • edited

viirya commented Apr 13, 2022

HaoYang670 commented Apr 14, 2022

viirya commented Apr 14, 2022 • edited

HaoYang670 commented Apr 14, 2022 • edited

HaoYang670 commented Apr 20, 2022

Epic: enhance the `substring` kernel #1531

Epic: enhance the `substring` kernel #1531

HaoYang670 commented Apr 10, 2022 •

edited

tustvold commented Apr 13, 2022 •

edited

alamb commented Apr 13, 2022 •

edited

HaoYang670 commented Apr 13, 2022 •

edited

HaoYang670 commented Apr 13, 2022 •

edited

viirya commented Apr 14, 2022 •

edited

HaoYang670 commented Apr 14, 2022 •

edited