Re-use decoded buffer for short texts #175

nijel · 2022-03-24T12:11:45Z

This avoids issues with detecting string boundaries while improving
performance (avoids multiple decoding of the sequence).

Fixes #174

This avoids issues with detecting string boundaries while improving performance (avoids multiple decoding of the sequence). Fixes Ousret#174

Ousret

Thanks for the proposal, some initial quick thoughts.

charset_normalizer/api.py

data/sample-polish.txt

codecov-commenter · 2022-06-18T14:43:19Z

Codecov Report

Merging #175 (bca1033) into master (7cbd7fc) will increase coverage by 0.07%.
The diff coverage is 85.00%.

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
+ Coverage   89.79%   89.86%   +0.07%     
==========================================
  Files          11       11              
  Lines        1205     1214       +9     
==========================================
+ Hits         1082     1091       +9     
  Misses        123      123

Impacted Files	Coverage Δ
charset_normalizer/api.py	`86.82% <66.66%> (-0.11%)`	⬇️
charset_normalizer/utils.py	`86.17% <92.59%> (+0.83%)`	⬆️
charset_normalizer/version.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cbd7fc...bca1033. Read the comment docs.

not meant to be publicly exposed

plus disable re-use on mb strings

bug discovered in Python, Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space.

Ousret · 2022-06-18T15:33:11Z

This PR does improve the overall quality and performance of the project and fixed an unexpected issue (in cpython).
👌

Ousret

LGTM

nijel force-pushed the master branch from 6534c01 to c201454 Compare March 24, 2022 12:14

Re-use decoded buffer for short texts

4908082

This avoids issues with detecting string boundaries while improving performance (avoids multiple decoding of the sequence). Fixes Ousret#174

nijel force-pushed the master branch from c201454 to 4908082 Compare March 24, 2022 12:18

Ousret self-requested a review March 24, 2022 12:22

Ousret requested changes Mar 24, 2022

View reviewed changes

charset_normalizer/api.py Outdated Show resolved Hide resolved

charset_normalizer/api.py Outdated Show resolved Hide resolved

charset_normalizer/api.py Show resolved Hide resolved

data/sample-polish.txt Show resolved Hide resolved

Merge branch 'master' into master

0c10e94

Ousret added 7 commits June 18, 2022 16:54

🎨 move cut_sequence_chunks to utils.py

4acf225

not meant to be publicly exposed

🔖 Bump version to 2.1.0.dev0

c7c0c35

🎨 bit of simplification around cut_sequence_chunks

b646f28

plus disable re-use on mb strings

🐛 Workaround a potential bug in Python isspace table character

dea700d

bug discovered in Python, Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space.

📝 Add provisional entry in changelog

aed4bc7

🎨 reformat file isort, black

a5665c3

🎨 fix flake8 lint warn

bca1033

Ousret approved these changes Jun 18, 2022

View reviewed changes

Ousret merged commit 4846792 into Ousret:master Jun 18, 2022

Ousret mentioned this pull request Jun 19, 2022

Release 2.1.0 #195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-use decoded buffer for short texts #175

Re-use decoded buffer for short texts #175

nijel commented Mar 24, 2022

Ousret left a comment

codecov-commenter commented Jun 18, 2022 •

edited

Ousret commented Jun 18, 2022 •

edited

Ousret left a comment •

edited

Re-use decoded buffer for short texts #175

Re-use decoded buffer for short texts #175

Conversation

nijel commented Mar 24, 2022

Ousret left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 18, 2022 • edited

Codecov Report

Ousret commented Jun 18, 2022 • edited

Ousret left a comment • edited

Choose a reason for hiding this comment

codecov-commenter commented Jun 18, 2022 •

edited

Ousret commented Jun 18, 2022 •

edited

Ousret left a comment •

edited