Add fast function to characterize a UTF-8 string #423

lemire · 2024-05-09T03:47:50Z

We should quickly return whether the maximal code point value is no larger than...

127
255
65535
1114111 (fallback case).

This is useful for the PyUnicode_New function in Python.

The function should return 1114111 as soon as a value exceeding 65535 is found.

lemire · 2024-05-09T04:07:05Z

See https://lemire.me/blog/2024/05/09/how-fast-can-construct-small-list-of-strings-in-c-for-python/

maksverver · 2024-05-15T23:20:21Z

For the purpose of implementing the ASCII-only fast path you described in your blog post, the proposed function does too much work.

I think there are two potential functions that are useful:

A function that determines if a string is ASCII-only. This is much easier to implement than categorizing max_val, which, in the general case, requires decoding UTF-8 sequences, plus you can actually return earlier (as soon as you found any byte with the top bit set).
A function that categorizes max_val and also the decoded length in Unicode codepoints. (These could be two separate functions as well, but it seems likely it's more efficient to combine them.) The length is necessary because PyUnicode_New() needs both size and max_val to allocate the destination buffer.

Also note that function 2 is only useful if you are ready to implement your own UTF-8 decoding (or hook heavily into the Python internals to fill the allocated buffer). That's a lot trickier than it sounds initially, because there are a ton of edge cases that Python currently handles in an implementation-specific way: invalid UTF-8 sequences, surrogate pairs (which are illegal in UTF-8), codepoints above 1114111 (which are encodable in UTF-8 but invalid in Unicode strings), etc. etc.

Anyway, I think the function you proposed either does too much work, or too little.

lemire · 2024-05-15T23:48:23Z

@maksverver We already have an ascii checker...

/**
 * Validate the ASCII string and stop on error.
 *
 * Overridden by each implementation.
 *
 * @param buf the ASCII string to validate.
 * @param len the length of the string in bytes.
 * @return a result pair struct (of type simdutf::error containing the two fields error and count) with an error code and either position of the error (in the input in code units) if any, or the number of code units validated if successful.
 */
simdutf_warn_unused result validate_ascii_with_errors(const char *buf, size_t len) noexcept;

Your second proposal is interesting. Let us do that!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast function to characterize a UTF-8 string #423

Add fast function to characterize a UTF-8 string #423

lemire commented May 9, 2024

lemire commented May 9, 2024

maksverver commented May 15, 2024

lemire commented May 15, 2024

Add fast function to characterize a UTF-8 string #423

Add fast function to characterize a UTF-8 string #423

Comments

lemire commented May 9, 2024

lemire commented May 9, 2024

maksverver commented May 15, 2024

lemire commented May 15, 2024