Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme cluster, normalization, collation #80

Open
pr8x opened this issue Aug 4, 2021 · 2 comments
Open

Grapheme cluster, normalization, collation #80

pr8x opened this issue Aug 4, 2021 · 2 comments

Comments

@pr8x
Copy link

pr8x commented Aug 4, 2021

Hey, Are there any plans to implement more advanced text processing facilitates like the ones mentioned above?

@lemire
Copy link
Member

lemire commented Aug 4, 2021

Yes!!! Absolutely.

@MBkkt
Copy link

MBkkt commented Jul 9, 2023

@lemire
Hello, would you interested to discuss the appropriate API/implementation for this?
I think I would be interested in implementing this for simdutf during my weekends. Especially since there are no other volunteers in sight :)

I'm working on the iresearch library (an alternative to Lucene) and its integration into ArangoDB. Now for utf8 normalization, stemming, etc, we use boost text which is very slow :(
We've made some patches for it, but I'm still not happy with its performance.

References for discussion:
https://github.com/tzlaine/text boost text
https://github.com/unicode-rs some rust set of libraries
icu, but it's well known

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants