Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast function to characterize a UTF-8 string #423

Open
lemire opened this issue May 9, 2024 · 3 comments
Open

Add fast function to characterize a UTF-8 string #423

lemire opened this issue May 9, 2024 · 3 comments

Comments

@lemire
Copy link
Member

lemire commented May 9, 2024

We should quickly return whether the maximal code point value is no larger than...

  • 127
  • 255
  • 65535
  • 1114111 (fallback case).

This is useful for the PyUnicode_New function in Python.

The function should return 1114111 as soon as a value exceeding 65535 is found.

@lemire
Copy link
Member Author

lemire commented May 9, 2024

@maksverver
Copy link

For the purpose of implementing the ASCII-only fast path you described in your blog post, the proposed function does too much work.

I think there are two potential functions that are useful:

  1. A function that determines if a string is ASCII-only. This is much easier to implement than categorizing max_val, which, in the general case, requires decoding UTF-8 sequences, plus you can actually return earlier (as soon as you found any byte with the top bit set).

  2. A function that categorizes max_val and also the decoded length in Unicode codepoints. (These could be two separate functions as well, but it seems likely it's more efficient to combine them.) The length is necessary because PyUnicode_New() needs both size and max_val to allocate the destination buffer.

Also note that function 2 is only useful if you are ready to implement your own UTF-8 decoding (or hook heavily into the Python internals to fill the allocated buffer). That's a lot trickier than it sounds initially, because there are a ton of edge cases that Python currently handles in an implementation-specific way: invalid UTF-8 sequences, surrogate pairs (which are illegal in UTF-8), codepoints above 1114111 (which are encodable in UTF-8 but invalid in Unicode strings), etc. etc.

Anyway, I think the function you proposed either does too much work, or too little.

@lemire
Copy link
Member Author

lemire commented May 15, 2024

@maksverver We already have an ascii checker...

/**
 * Validate the ASCII string and stop on error.
 *
 * Overridden by each implementation.
 *
 * @param buf the ASCII string to validate.
 * @param len the length of the string in bytes.
 * @return a result pair struct (of type simdutf::error containing the two fields error and count) with an error code and either position of the error (in the input in code units) if any, or the number of code units validated if successful.
 */
simdutf_warn_unused result validate_ascii_with_errors(const char *buf, size_t len) noexcept;

Your second proposal is interesting. Let us do that!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants