🐛 Fix large (misleading) sequence giving UnicodeDecodeError #137

Ousret · 2021-11-09T20:00:02Z

Here is a minimal reproducible example.

from charset_normalizer import from_bytes
from charset_normalizer.constant import TOO_BIG_SEQUENCE

def test_misleading_large_sequence():
    content = (("hello simple ascii " * TOO_BIG_SEQUENCE) + ('我没有埋怨，磋砣的只是一些时间。 磋砣的只是一些时间。')) .encode('utf_8')

    guesses = from_bytes(content)

    assert len(guesses) > 0
    match = guesses.best()
    assert match is not None
    assert match.encoding == 'utf_8'
    assert str(match) is not None

This PR aims to fix that issue.

Close #136

Fix issue #136

codecov-commenter · 2021-11-09T20:18:40Z

Codecov Report

Merging #137 (cc56e3e) into master (00ffea0) will decrease coverage by 0.22%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##           master     #137      +/-   ##
==========================================
- Coverage   90.23%   90.00%   -0.23%     
==========================================
  Files          11       11              
  Lines        1157     1171      +14     
==========================================
+ Hits         1044     1054      +10     
- Misses        113      117       +4

Impacted Files	Coverage Δ
charset_normalizer/api.py	`88.78% <73.33%> (-1.28%)`	⬇️
charset_normalizer/version.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00ffea0...cc56e3e. Read the comment docs.

aytey · 2021-11-10T11:42:13Z

Fixes my issue -- thanks, @Ousret!

Found this edge case while doing extended tests around this PR

Ousret added 2 commits November 9, 2021 20:57

✔️ Add simple test case that show the problem (Issue #136)

3b148f3

🐛 Fix getting misleaded by large sequence (lazy str loading)

d0ad20b

Fix issue #136

Ousret added bug detection labels Nov 9, 2021

Ousret added 2 commits November 9, 2021 21:05

🎨 reformat file

85bd55b

🔖 Bump 2.0.8.dev2

b9c094a

Ousret added 4 commits November 17, 2021 22:38

✔️ Simplify/Improve the added test

1030413

🐛 Ignore too insignificant extracted chunk

b035085

Found this edge case while doing extended tests around this PR

🔖 Bump to 2.0.8.dev3

b29b477

🎨 reformat api.py

cc56e3e

Ousret merged commit b7fca11 into master Nov 20, 2021

Ousret deleted the bugfix-lazy-str-decode-error branch November 20, 2021 22:30

Ousret mentioned this pull request Nov 24, 2021

🔖 Bump version 2.0.8 #144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

🐛 Fix large (misleading) sequence giving UnicodeDecodeError #137

🐛 Fix large (misleading) sequence giving UnicodeDecodeError #137

Ousret commented Nov 9, 2021 •

edited

Loading

codecov-commenter commented Nov 9, 2021 •

edited

Loading

aytey commented Nov 10, 2021

🐛 Fix large (misleading) sequence giving UnicodeDecodeError #137

🐛 Fix large (misleading) sequence giving UnicodeDecodeError #137

Conversation

Ousret commented Nov 9, 2021 • edited Loading

codecov-commenter commented Nov 9, 2021 • edited Loading

Codecov Report

aytey commented Nov 10, 2021

Ousret commented Nov 9, 2021 •

edited

Loading

codecov-commenter commented Nov 9, 2021 •

edited

Loading