Any plan to add string iterator? #259

windowsair · 2023-07-02T03:18:40Z

It would be great if there was some way to iterate over something like each UTF8 character.

Are there plans to add this in the future?

lemire · 2023-07-02T20:26:07Z

Can you elaborate? If you mean iterating over the code point values, we do fast transcoding to UTF-32. After UTF-32 trancoding, iteration is trivial.

For large inputs, it might make sense to do block-wise UTF-32 decoding to avoid overwhelming the cache.

Byte-by-byte processing is inefficient and not something we want to encourage.

windowsair · 2023-07-03T11:56:08Z

Yes, I think what I want is to iterate over code point values or something like that.

char buf[] = "Foo © bar 𝌆 baz ☃ qux";

...

for (char8_t x: itor) {
    std::cout << x;  // and we can get each UTF8 character like "©", "𝌆", "☃"
}

It is easier to implement a simple iterator. However, trying to take advantage of features such as SIMD may be difficult, for byte-by-byte iteration.

lemire · 2023-07-03T13:09:54Z

Interesting.

WojciechMula · 2024-03-23T21:14:54Z

I think a character-wise iterator can anyway use SIMD backend, just silently transcode bigger chunks of the input. The question is, if in such cases performance is the key. I mean, if you want to analyse the input char-by-char means you need some extra processing of char code itself.

Anyway, we need to define that it yields char32_t or uint32_t or uint64_t (the last one for completeness).

It would be good to take a look how the Go runtime handles iteration over UTF-8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plan to add string iterator? #259

Any plan to add string iterator? #259

windowsair commented Jul 2, 2023

lemire commented Jul 2, 2023

windowsair commented Jul 3, 2023

lemire commented Jul 3, 2023

WojciechMula commented Mar 23, 2024

Any plan to add string iterator? #259

Any plan to add string iterator? #259

Comments

windowsair commented Jul 2, 2023

lemire commented Jul 2, 2023

windowsair commented Jul 3, 2023

lemire commented Jul 3, 2023

WojciechMula commented Mar 23, 2024