Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix large (misleading) sequence giving UnicodeDecodeError #137

Merged
merged 8 commits into from Nov 20, 2021

Conversation

Ousret
Copy link
Owner

@Ousret Ousret commented Nov 9, 2021

Here is a minimal reproducible example.

from charset_normalizer import from_bytes
from charset_normalizer.constant import TOO_BIG_SEQUENCE

def test_misleading_large_sequence():
    content = (("hello simple ascii " * TOO_BIG_SEQUENCE) + ('我没有埋怨,磋砣的只是一些时间。 磋砣的只是一些时间。')) .encode('utf_8')

    guesses = from_bytes(content)

    assert len(guesses) > 0
    match = guesses.best()
    assert match is not None
    assert match.encoding == 'utf_8'
    assert str(match) is not None

This PR aims to fix that issue.

Close #136

@Ousret Ousret added bug Something isn't working detection Related to the charset detection mechanism, chaos/mess/coherence labels Nov 9, 2021
@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2021

Codecov Report

Merging #137 (cc56e3e) into master (00ffea0) will decrease coverage by 0.22%.
The diff coverage is 75.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #137      +/-   ##
==========================================
- Coverage   90.23%   90.00%   -0.23%     
==========================================
  Files          11       11              
  Lines        1157     1171      +14     
==========================================
+ Hits         1044     1054      +10     
- Misses        113      117       +4     
Impacted Files Coverage Δ
charset_normalizer/api.py 88.78% <73.33%> (-1.28%) ⬇️
charset_normalizer/version.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00ffea0...cc56e3e. Read the comment docs.

@aytey
Copy link

aytey commented Nov 10, 2021

Fixes my issue -- thanks, @Ousret!

@Ousret Ousret merged commit b7fca11 into master Nov 20, 2021
@Ousret Ousret deleted the bugfix-lazy-str-decode-error branch November 20, 2021 22:30
@Ousret Ousret mentioned this pull request Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working detection Related to the charset detection mechanism, chaos/mess/coherence
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path
3 participants