New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISO <-> UTF transcoding #159
Comments
We've talked about this before. It would be interesting to have a transcoder for the general case “single byte ASCII based encoding.” I can try to do that once I'm done with the writeup. |
Let me add that the idea should be credited to @clausecker |
@clausecker If you assume good AVX-512 support, it seems that vpermi2b would go a long way on this problem. Supporting it efficiently with AVX/NEON is a fun challenge. |
Bun would use this. JavaScript strings are either latin1 or utf16. We frequently need to convert from utf8 (from disk/network) to either latin1 or utf16. Currently, we validate ascii with errors. if ascii, we do a memcpy and if not ascii, we convert to UTF-16 starting at the first non-ascii character. This works okay |
Feedback as to the motivation of a feature is important to us. |
Computing the UTF-8 size of a Latin 1 string quickly (AVX edition) https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/ |
We currently fully support Latin1 (IEC_8859-1), the most popular ISO format, in our main branch. It is unclear whether we should extend to other European ISO formats. My suspicions is that it would see little use. I am thinking about closing this issue. |
The different ISO encodings can be transcoded to/from UTF formats.
https://en.m.wikipedia.org/wiki/ISO/IEC_8859-1
The text was updated successfully, but these errors were encountered: