New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fast function to characterize a UTF-8 string #423
Comments
For the purpose of implementing the ASCII-only fast path you described in your blog post, the proposed function does too much work. I think there are two potential functions that are useful:
Also note that function 2 is only useful if you are ready to implement your own UTF-8 decoding (or hook heavily into the Python internals to fill the allocated buffer). That's a lot trickier than it sounds initially, because there are a ton of edge cases that Python currently handles in an implementation-specific way: invalid UTF-8 sequences, surrogate pairs (which are illegal in UTF-8), codepoints above 1114111 (which are encodable in UTF-8 but invalid in Unicode strings), etc. etc. Anyway, I think the function you proposed either does too much work, or too little. |
@maksverver We already have an ascii checker... /**
* Validate the ASCII string and stop on error.
*
* Overridden by each implementation.
*
* @param buf the ASCII string to validate.
* @param len the length of the string in bytes.
* @return a result pair struct (of type simdutf::error containing the two fields error and count) with an error code and either position of the error (in the input in code units) if any, or the number of code units validated if successful.
*/
simdutf_warn_unused result validate_ascii_with_errors(const char *buf, size_t len) noexcept; Your second proposal is interesting. Let us do that!!! |
We should quickly return whether the maximal code point value is no larger than...
This is useful for the PyUnicode_New function in Python.
The function should return 1114111 as soon as a value exceeding 65535 is found.
The text was updated successfully, but these errors were encountered: