Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISO <-> UTF transcoding #159

Open
lemire opened this issue Aug 1, 2022 · 7 comments
Open

ISO <-> UTF transcoding #159

lemire opened this issue Aug 1, 2022 · 7 comments

Comments

@lemire
Copy link
Member

lemire commented Aug 1, 2022

The different ISO encodings can be transcoded to/from UTF formats.

https://en.m.wikipedia.org/wiki/ISO/IEC_8859-1

@clausecker
Copy link
Collaborator

We've talked about this before. It would be interesting to have a transcoder for the general case “single byte ASCII based encoding.” I can try to do that once I'm done with the writeup.

@lemire
Copy link
Member Author

lemire commented Aug 1, 2022

Let me add that the idea should be credited to @clausecker

@lemire
Copy link
Member Author

lemire commented Aug 2, 2022

@clausecker If you assume good AVX-512 support, it seems that vpermi2b would go a long way on this problem.

Supporting it efficiently with AVX/NEON is a fun challenge.

@Jarred-Sumner
Copy link

Bun would use this. JavaScript strings are either latin1 or utf16. We frequently need to convert from utf8 (from disk/network) to either latin1 or utf16. Currently, we validate ascii with errors. if ascii, we do a memcpy and if not ascii, we convert to UTF-16 starting at the first non-ascii character. This works okay

@lemire
Copy link
Member Author

lemire commented Dec 18, 2022

Feedback as to the motivation of a feature is important to us.

@lemire
Copy link
Member Author

lemire commented Feb 16, 2023

Computing the UTF-8 size of a Latin 1 string quickly (AVX edition) https://lemire.me/blog/2023/02/16/computing-the-utf-8-size-of-a-latin-1-string-quickly-avx-edition/

@lemire
Copy link
Member Author

lemire commented Oct 5, 2023

We currently fully support Latin1 (IEC_8859-1), the most popular ISO format, in our main branch.

It is unclear whether we should extend to other European ISO formats. My suspicions is that it would see little use.

I am thinking about closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants