Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Performance improvements in loops #111

Closed
adbar opened this issue Sep 20, 2021 · 3 comments · Fixed by #113
Closed

[Proposal] Performance improvements in loops #111

adbar opened this issue Sep 20, 2021 · 3 comments · Fixed by #113
Labels
enhancement New feature or request

Comments

@adbar
Copy link
Contributor

adbar commented Sep 20, 2021

Is your feature request related to a problem? Please describe.
Hi, I was wondering if it could be possible to improve the performance of certain loops. For example, you do use list comprehensions but not everywhere. Since you have a speed benchmark you'd see if it works in the comparison with chardet.

Describe the solution you'd like
Here are loops where things could be improved:

Additional context
I could help work on a PR if you're interested.

@adbar adbar added the enhancement New feature or request label Sep 20, 2021
@Ousret
Copy link
Collaborator

Ousret commented Sep 20, 2021

Hi,

I am open to suggestions/PR. The actual performance are pretty good for its category.
But I am thinking of ways to improve the performances. I have a few ideas.

In my opinion, list comprehensions lower the readability when used everywhere.
The samples you extracted are not really of concerns when speaking about performances.

30 to 40 % of the time is consumed in the charset_normalizer.md (mess-detector) module.
Either optimize the call tree from the function mess_ratio or find ways to lower the dependence from it or update the current MD plugin(s). Whatever I try out of this scope result in meaningless performance gain.

I am convinced that with a substantial effort we could make it up to two time faster while keeping accuracy.
In another matter, performance optimization wont matter if I am not convinced that we will ensure stability.

@adbar
Copy link
Contributor Author

adbar commented Sep 21, 2021

Hi @Ousret, I totally get your point. I first started to refactor the code in #113, feel free to amend the PR.

A clear problem in md.py IMHO are the multiple variable assignments (e.g. resets). They could be turned into multiple assignments which may be a bit faster or somehow handled another way, what do you think?
Concerning character counts maybe len() over a list of characters or so?

@Ousret
Copy link
Collaborator

Ousret commented Sep 23, 2021

They could be turned into multiple assignments which may be a bit faster or somehow handled another way, what do you think?

I do not think so. I tried many alternative that failed to bring substantial improvement.

Concerning character counts maybe len() over a list of characters or so?

I could be wrong but that would not matter. Python len() is pretty optimized.

Ousret added a commit that referenced this issue Sep 24, 2021
* reviewed encoding language associations: caches and sets defined

* use list comprehension for language association (#111)

* use list comprehension and filter in char analysis (#111)

* refactored variable inits in md.py

* models: move regex compilation to constants

* detection of Japanese characters: simplify syntax

* amend detected_ranges semantics

Co-authored-by: Aarni Koskela <akx@iki.fi>
Co-authored-by: Ahmed TAHRI <ahmed.tahri@cloudnursery.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants