Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD improvement on trailing data and long foreign (non-pure latin) data #124

Merged
merged 5 commits into from Oct 23, 2021

Conversation

Ousret
Copy link
Owner

@Ousret Ousret commented Oct 17, 2021

Close #121

Address as best as humanly (or bot-wise) possible some remaining edge cases.

Coming from uchardet, those two files acquired my attention.

Both files contain the same content. Repeating in loop "classical chinese" without any punctuation.
They were both miss-detected as cp1251 and cp037.
To be absolutely clear, as they are borderline cases, the goal is not to get it right but be closer.

Now, with this patch, they are both detected as "big5". And they are nothing that we can do without adding considerable effort.
The problematic case is the gb18030 --> big5 that output 潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅. And Google Translate seems to think that is translatable to Extremely strong.

Nonetheless, It is a positive PR that increases the detector accuracy.

@Ousret Ousret added enhancement New feature or request detection Related to the charset detection mechanism, chaos/mess/coherence labels Oct 17, 2021
@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2021

Codecov Report

Merging #124 (5772bc3) into master (8b52c35) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
+ Coverage   90.25%   90.27%   +0.01%     
==========================================
  Files          11       11              
  Lines        1160     1162       +2     
==========================================
+ Hits         1047     1049       +2     
  Misses        113      113              
Impacted Files Coverage Δ
charset_normalizer/md.py 96.65% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b52c35...5772bc3. Read the comment docs.

@Ousret Ousret merged commit b34d2e3 into master Oct 23, 2021
@Ousret Ousret deleted the patch-final-safe-md-improvement branch October 23, 2021 20:16
@Ousret Ousret mentioned this pull request Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DETECTION] Issue with encodings used by Asian languages
2 participants