MD improvement on trailing data and long foreign (non-pure latin) data #124
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Close #121
Address as best as humanly (or bot-wise) possible some remaining edge cases.
Coming from
uchardet
, those two files acquired my attention.Both files contain the same content. Repeating in loop "classical chinese" without any punctuation.
They were both miss-detected as cp1251 and cp037.
To be absolutely clear, as they are borderline cases, the goal is not to get it right but be closer.
Now, with this patch, they are both detected as "big5". And they are nothing that we can do without adding considerable effort.
The problematic case is the
gb18030 --> big5
that output潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅潠极笢恅
. And Google Translate seems to think that is translatable toExtremely strong
.Nonetheless, It is a positive PR that increases the detector accuracy.