Grapheme cluster, normalization, collation #80

pr8x · 2021-08-04T19:21:28Z

Hey, Are there any plans to implement more advanced text processing facilitates like the ones mentioned above?

lemire · 2021-08-04T19:41:34Z

Yes!!! Absolutely.

MBkkt · 2023-07-09T10:36:53Z

@lemire
Hello, would you interested to discuss the appropriate API/implementation for this?
I think I would be interested in implementing this for simdutf during my weekends. Especially since there are no other volunteers in sight :)

I'm working on the iresearch library (an alternative to Lucene) and its integration into ArangoDB. Now for utf8 normalization, stemming, etc, we use ~~boost~~ text which is very slow :(
We've made some patches for it, but I'm still not happy with its performance.

References for discussion:
https://github.com/tzlaine/text ~~boost~~ text
https://github.com/unicode-rs some rust set of libraries
icu, but it's well known

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grapheme cluster, normalization, collation #80

Grapheme cluster, normalization, collation #80

pr8x commented Aug 4, 2021

lemire commented Aug 4, 2021

MBkkt commented Jul 9, 2023

Grapheme cluster, normalization, collation #80

Grapheme cluster, normalization, collation #80

Comments

pr8x commented Aug 4, 2021

lemire commented Aug 4, 2021

MBkkt commented Jul 9, 2023