MD improvement on trailing data and long foreign (non-pure latin) data #124

Ousret · 2021-10-17T16:27:51Z

Close #121

Address as best as humanly (or bot-wise) possible some remaining edge cases.

Coming from uchardet, those two files acquired my attention.

Both files contain the same content. Repeating in loop "classical chinese" without any punctuation.
They were both miss-detected as cp1251 and cp037.
To be absolutely clear, as they are borderline cases, the goal is not to get it right but be closer.

Now, with this patch, they are both detected as "big5". And they are nothing that we can do without adding considerable effort.
The problematic case is the gb18030 --> big5 that output 潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅. And Google Translate seems to think that is translatable to Extremely strong.

Nonetheless, It is a positive PR that increases the detector accuracy.

some MD plugins require a "separator" character before assessing the current buffer.

codecov-commenter · 2021-10-17T16:28:42Z

Codecov Report

Merging #124 (5772bc3) into master (8b52c35) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #124      +/-   ##
==========================================
+ Coverage   90.25%   90.27%   +0.01%     
==========================================
  Files          11       11              
  Lines        1160     1162       +2     
==========================================
+ Hits         1047     1049       +2     
  Misses        113      113

Impacted Files	Coverage Δ
charset_normalizer/md.py	`96.65% <100.00%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8b52c35...5772bc3. Read the comment docs.

Ousret added 2 commits October 17, 2021 18:16

❇️ Always feed a final non printable character to detect trailing mess

3f33bbd

some MD plugins require a "separator" character before assessing the current buffer.

❇️ Never ignore non-pure latin (foreign) word that are too long

512ccf4

Ousret added enhancement New feature or request detection Related to the charset detection mechanism, chaos/mess/coherence labels Oct 17, 2021

Ousret added 3 commits October 20, 2021 08:29

Merge branch 'master' into patch-final-safe-md-improvement

37ec4c3

Merge branch 'master' into patch-final-safe-md-improvement

25e9cb4

Merge branch 'master' into patch-final-safe-md-improvement

5772bc3

Ousret merged commit b34d2e3 into master Oct 23, 2021

Ousret deleted the patch-final-safe-md-improvement branch October 23, 2021 20:16

Ousret mentioned this pull request Nov 24, 2021

🔖 Bump version 2.0.8 #144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MD improvement on trailing data and long foreign (non-pure latin) data #124

MD improvement on trailing data and long foreign (non-pure latin) data #124

Ousret commented Oct 17, 2021

codecov-commenter commented Oct 17, 2021 •

edited

MD improvement on trailing data and long foreign (non-pure latin) data #124

MD improvement on trailing data and long foreign (non-pure latin) data #124

Conversation

Ousret commented Oct 17, 2021

codecov-commenter commented Oct 17, 2021 • edited

Codecov Report

codecov-commenter commented Oct 17, 2021 •

edited