Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plan to add string iterator? #259

Open
windowsair opened this issue Jul 2, 2023 · 4 comments
Open

Any plan to add string iterator? #259

windowsair opened this issue Jul 2, 2023 · 4 comments

Comments

@windowsair
Copy link

It would be great if there was some way to iterate over something like each UTF8 character.

Are there plans to add this in the future?

@lemire
Copy link
Member

lemire commented Jul 2, 2023

Can you elaborate? If you mean iterating over the code point values, we do fast transcoding to UTF-32. After UTF-32 trancoding, iteration is trivial.

For large inputs, it might make sense to do block-wise UTF-32 decoding to avoid overwhelming the cache.

Byte-by-byte processing is inefficient and not something we want to encourage.

@windowsair
Copy link
Author

Yes, I think what I want is to iterate over code point values or something like that.

char buf[] = "Foo © bar 𝌆 baz ☃ qux";

...

for (char8_t x: itor) {
    std::cout << x;  // and we can get each UTF8 character like "©", "𝌆", "☃"
}

It is easier to implement a simple iterator. However, trying to take advantage of features such as SIMD may be difficult, for byte-by-byte iteration.

@lemire
Copy link
Member

lemire commented Jul 3, 2023

Interesting.

@WojciechMula
Copy link
Collaborator

I think a character-wise iterator can anyway use SIMD backend, just silently transcode bigger chunks of the input. The question is, if in such cases performance is the key. I mean, if you want to analyse the input char-by-char means you need some extra processing of char code itself.

Anyway, we need to define that it yields char32_t or uint32_t or uint64_t (the last one for completeness).

It would be good to take a look how the Go runtime handles iteration over UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants