New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More sensible default for extreme case of 1 none ascii char. #153
base: main
Are you sure you want to change the base?
Conversation
For all who need an hotfix. That does a linear transformation for That's why i think my pull request doing a linear transformation only for mb_char == 1 is way better. |
Is there any chance the build failures will be resolved so we can use UTF8 with a single ASCII character with confidence again? |
the problem was the confidence function is quadratic for char hit = 1 it's too low. Examples for 1 up to 3 char hits. For 6 or more it default's to 99%. So why not default for 1 char hit to 73,1% instead of 50.5% 1-(0.5**1*0.99) = 0.505 -> new 0.731 1-(0.5**2*0.99) = 0.7525 1-(0.5**3*0.99) = 0.87625 6 or more = 0.99 ISO-8859-1 is limited to 0.73
i rebased the same 3 commits as before on the latest master. It resolved the build issues. |
if self._num_mb_chars < 6: | ||
if self._num_mb_chars == 1: | ||
# guaranteed to be preferred over ISO-8859-1, evaluates to 0.731 | ||
unlike *= self.ONE_CHAR_PROB * (1-0.731)/0.99/0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I completely agree that making the default UTF-8 instead of ISO-8859-1 makes a lot of sense (since the world has changed dramatically since the original C code chardet is a port of was written 25+ years ago), I'm not sure that this is the best approach. First of all, it has a rather confusing equation in it that yields a constant value, instead of just returning the constant value. Second, it's not clear to most people while 0.731 will make it win out of ISO-8859-1. Finally, we should probably just reduce the confidence of the ISO-8859-1 prober (or modify what ONE_CHAR_PROB
is here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback.
We would have to reduce the maximum confidence for ISO-8859-1 to 50%. E.g. in Germany you will run into ISO-8859-1 everyday (6% of websites and legacy software).
My motivation was to still allow the user to set ONE_CHAR_PROB. Also for 1 char and not change the behavior for 2+ chars.
Agreed it should be simpler and the comment is confusing.
Do you have some context which qualifies the confidence jump from 1 to 2 chars (50,5% to 75,25%)? In between most other models have there normal confidence levels and get preferred.
If not i would propose a simpler solution:
# Prefer UTF-8 over other models, can be modified with ONE_CHAR_PROB.
if self._numb_chars == 1:
unlike *= self.ONE_CHAR_PROB ** 1.99
above the magic numbers of 73 and 75 and smaller than 2 chars. And easy to read.
Or another alternative just change the original code to max(self._numb_chars, 2).
The equivalent for other models seems to be this, which has some inheritance ongoing:
chardet/chardet/chardistribution.py
Lines 92 to 93 in eb1a10a
r = (self._freq_chars / ((self._total_chars - self._freq_chars) | |
* self.typical_distribution_ratio)) |
the problem was the confidence function is quadratic for char hit = 1 it's too low.
Examples for 1 up to 3 char hits. For 6 or more it default's to 99%.
So why not default for 1 char hit to 73,1% instead of 50.5%
ISO-8859-1 is limited to 0.73
there are duplicated open issues like: #138 #134
Which should be fixed.