Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sensible default for extreme case of 1 none ascii char. #153

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

elcolumbio
Copy link

@elcolumbio elcolumbio commented Jun 2, 2018

the problem was the confidence function is quadratic for char hit = 1 it's too low.
Examples for 1 up to 3 char hits. For 6 or more it default's to 99%.
So why not default for 1 char hit to 73,1% instead of 50.5%

1-(0.5**1*0.99) = 0.505     -> new 0.731
1-(0.5**2*0.99) =0.7525
1-(0.5**3*0.99) = 0.87625
6 or more = 0.99

ISO-8859-1 is limited to 0.73

there are duplicated open issues like: #138 #134
Which should be fixed.

@elcolumbio
Copy link
Author

For all who need an hotfix.
You can set a new value before you run detect:
chardet.utf8prober.UTF8Prober.ONE_CHAR_PROB = 0.26

That does a linear transformation for
mb_char one
but a quadratic transformation for
2 to 5 non ascii possible utf8 chars.
For me that still worked.

That's why i think my pull request doing a linear transformation only for mb_char == 1 is way better.

chardet/utf8prober.py Outdated Show resolved Hide resolved
@basvdheuvel
Copy link

Is there any chance the build failures will be resolved so we can use UTF8 with a single ASCII character with confidence again?

the problem was the confidence function is quadratic for char hit = 1 it's too low.
Examples for 1 up to 3 char hits. For 6 or more it default's to 99%.
So why not default for 1 char hit to 73,1% instead of 50.5%
1-(0.5**1*0.99) = 0.505     -> new 0.731
1-(0.5**2*0.99) = 0.7525
1-(0.5**3*0.99) = 0.87625
6 or more = 0.99

ISO-8859-1 is limited to 0.73
@elcolumbio
Copy link
Author

i rebased the same 3 commits as before on the latest master. It resolved the build issues.

if self._num_mb_chars < 6:
if self._num_mb_chars == 1:
# guaranteed to be preferred over ISO-8859-1, evaluates to 0.731
unlike *= self.ONE_CHAR_PROB * (1-0.731)/0.99/0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I completely agree that making the default UTF-8 instead of ISO-8859-1 makes a lot of sense (since the world has changed dramatically since the original C code chardet is a port of was written 25+ years ago), I'm not sure that this is the best approach. First of all, it has a rather confusing equation in it that yields a constant value, instead of just returning the constant value. Second, it's not clear to most people while 0.731 will make it win out of ISO-8859-1. Finally, we should probably just reduce the confidence of the ISO-8859-1 prober (or modify what ONE_CHAR_PROB is here).

Copy link
Author

@elcolumbio elcolumbio Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback.
We would have to reduce the maximum confidence for ISO-8859-1 to 50%. E.g. in Germany you will run into ISO-8859-1 everyday (6% of websites and legacy software).

My motivation was to still allow the user to set ONE_CHAR_PROB. Also for 1 char and not change the behavior for 2+ chars.
Agreed it should be simpler and the comment is confusing.
Do you have some context which qualifies the confidence jump from 1 to 2 chars (50,5% to 75,25%)? In between most other models have there normal confidence levels and get preferred.
If not i would propose a simpler solution:

# Prefer UTF-8 over other models, can be modified with ONE_CHAR_PROB.
if self._numb_chars == 1:
    unlike *= self.ONE_CHAR_PROB ** 1.99

above the magic numbers of 73 and 75 and smaller than 2 chars. And easy to read.
Or another alternative just change the original code to max(self._numb_chars, 2).

The equivalent for other models seems to be this, which has some inheritance ongoing:

r = (self._freq_chars / ((self._total_chars - self._freq_chars)
* self.typical_distribution_ratio))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants