Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using generic function to compute alphabet score for Cyrillic and latin #115

Closed
wants to merge 11 commits into from

Conversation

egorgrachev
Copy link

#111

it degrades performance a bit, mostly because we re-compute alphabet lang map instead of having static variable. It's probably possible to optimise

on my pc cargo bench went from 5kk to 8kk ns

@greyblake
Copy link
Owner

Ah, damn. Sorry! I've created that issue mostly as a reminder for myself.
I've addressed it in #116, I will take a look at your PR when I have time and fresh, but likely I will have to bounce it, because the whole idea of that refactoring was actually to make cyrillic::alphabet_calculate_scores to improve performance by leveraging optimization that was implemented for latin langs.

@egorgrachev
Copy link
Author

No problem, I didn't notice #116
I see that you pre-compute lang_map, I thought about the same when I mentioned that "It's probably possible to optimise"

Most likely no need to merge my PR then :)

Also for my understanding - how important the performance for you in this code?
The algo code is not very easy to understand, but it may be done way cleaner with HashSet / HashMap (or BTreeSet / TreeMap) for char-langs, char_scores and lang_scores.
I profiled the code a bit and for me most of the time were spent in get_all_chars_in_langs and get_alphabet_lang_map. I can take a look at refactoring to simplify / improve readability if you think it make sense to do so

@greyblake
Copy link
Owner

@egorgrachev Thank you for understanding.

Also for my understanding - how important the performance for you in this code?
It's crucial. There were number of optimization from different people since the project started in 2016.
The performance is crucial, because whatlang kind of competes with https://github.com/pemistahl/lingua-rs. Due to the model lingua uses it has better accuracy, so what whatlang can offer is:

Generally I would prefer readability over performance and avoid premature optimizations.
But this project is a rare project of mine, where I'd like to squeeze the maxium from hardware. That said I'd like to keep the codebase well documented and understandable.

Whatlang is also used by Sonic search engine.
Guys from meilisearch also recently contributed so I'd assume they're using Whatlang already or considering using it for Melisearch.

I am closing this PR in favor of #116

@greyblake greyblake closed this May 1, 2022
@greyblake
Copy link
Owner

@egorgrachev If you're interested in further contributions, let's stay in touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants