[Proposal] Performance improvements in loops #111

adbar · 2021-09-20T15:07:17Z

Is your feature request related to a problem? Please describe.
Hi, I was wondering if it could be possible to improve the performance of certain loops. For example, you do use list comprehensions but not everywhere. Since you have a speed benchmark you'd see if it works in the comparison with chardet.

Describe the solution you'd like
Here are loops where things could be improved:

https://github.com/Ousret/charset_normalizer/blob/bf70e9c83054a6d128593e90499d879f2e21394a/charset_normalizer/models.py#L229 (some kind of list comprehension)
https://github.com/Ousret/charset_normalizer/blob/bf70e9c83054a6d128593e90499d879f2e21394a/charset_normalizer/cd.py#L105 (some kind of list comprehension)
https://github.com/Ousret/charset_normalizer/blob/bf70e9c83054a6d128593e90499d879f2e21394a/charset_normalizer/cd.py#L87 (define sets out of the loop)

Additional context
I could help work on a PR if you're interested.

Ousret · 2021-09-20T18:35:24Z

Hi,

I am open to suggestions/PR. The actual performance are pretty good for its category.
But I am thinking of ways to improve the performances. I have a few ideas.

In my opinion, list comprehensions lower the readability when used everywhere.
The samples you extracted are not really of concerns when speaking about performances.

30 to 40 % of the time is consumed in the charset_normalizer.md (mess-detector) module.
Either optimize the call tree from the function mess_ratio or find ways to lower the dependence from it or update the current MD plugin(s). Whatever I try out of this scope result in meaningless performance gain.

I am convinced that with a substantial effort we could make it up to two time faster while keeping accuracy.
In another matter, performance optimization wont matter if I am not convinced that we will ensure stability.

adbar · 2021-09-21T11:38:53Z

Hi @Ousret, I totally get your point. I first started to refactor the code in #113, feel free to amend the PR.

A clear problem in md.py IMHO are the multiple variable assignments (e.g. resets). They could be turned into multiple assignments which may be a bit faster or somehow handled another way, what do you think?
Concerning character counts maybe len() over a list of characters or so?

Ousret · 2021-09-23T17:37:48Z

They could be turned into multiple assignments which may be a bit faster or somehow handled another way, what do you think?

I do not think so. I tried many alternative that failed to bring substantial improvement.

Concerning character counts maybe len() over a list of characters or so?

I could be wrong but that would not matter. Python len() is pretty optimized.

* reviewed encoding language associations: caches and sets defined * use list comprehension for language association (#111) * use list comprehension and filter in char analysis (#111) * refactored variable inits in md.py * models: move regex compilation to constants * detection of Japanese characters: simplify syntax * amend detected_ranges semantics Co-authored-by: Aarni Koskela <akx@iki.fi> Co-authored-by: Ahmed TAHRI <ahmed.tahri@cloudnursery.dev>

adbar added the enhancement New feature or request label Sep 20, 2021

adbar mentioned this issue Sep 21, 2021

Refactoring for potential performance improvements in loops #113

Merged

Ousret closed this as completed in #113 Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Performance improvements in loops #111

[Proposal] Performance improvements in loops #111

adbar commented Sep 20, 2021

Ousret commented Sep 20, 2021 •

edited

adbar commented Sep 21, 2021

Ousret commented Sep 23, 2021

[Proposal] Performance improvements in loops #111

[Proposal] Performance improvements in loops #111

Comments

adbar commented Sep 20, 2021

Ousret commented Sep 20, 2021 • edited

adbar commented Sep 21, 2021

Ousret commented Sep 23, 2021

Ousret commented Sep 20, 2021 •

edited