More sensible default for extreme case of 1 none ascii char. #153

elcolumbio · 2018-06-02T13:09:02Z

the problem was the confidence function is quadratic for char hit = 1 it's too low.
Examples for 1 up to 3 char hits. For 6 or more it default's to 99%.
So why not default for 1 char hit to 73,1% instead of 50.5%

1-(0.5**1*0.99) = 0.505     -> new 0.731
1-(0.5**2*0.99) = 0.7525
1-(0.5**3*0.99) = 0.87625
6 or more = 0.99

ISO-8859-1 is limited to 0.73

there are duplicated open issues like: #138 #134
Which should be fixed.

elcolumbio · 2018-06-03T08:09:06Z

For all who need an hotfix.
You can set a new value before you run detect:
chardet.utf8prober.UTF8Prober.ONE_CHAR_PROB = 0.26

That does a linear transformation for
mb_char one
but a quadratic transformation for
2 to 5 non ascii possible utf8 chars.
For me that still worked.

That's why i think my pull request doing a linear transformation only for mb_char == 1 is way better.

chardet/utf8prober.py

basvdheuvel · 2019-08-21T09:16:18Z

Is there any chance the build failures will be resolved so we can use UTF8 with a single ASCII character with confidence again?

the problem was the confidence function is quadratic for char hit = 1 it's too low. Examples for 1 up to 3 char hits. For 6 or more it default's to 99%. So why not default for 1 char hit to 73,1% instead of 50.5% 1-(0.5**1*0.99) = 0.505 -> new 0.731 1-(0.5**2*0.99) = 0.7525 1-(0.5**3*0.99) = 0.87625 6 or more = 0.99 ISO-8859-1 is limited to 0.73

elcolumbio · 2020-05-12T22:31:46Z

i rebased the same 3 commits as before on the latest master. It resolved the build issues.

dan-blanchard · 2020-12-09T17:13:15Z

chardet/utf8prober.py

-        if self._num_mb_chars < 6:
+        if self._num_mb_chars == 1:
+            # guaranteed to be preferred over ISO-8859-1, evaluates to 0.731
+            unlike *= self.ONE_CHAR_PROB * (1-0.731)/0.99/0.5


While I completely agree that making the default UTF-8 instead of ISO-8859-1 makes a lot of sense (since the world has changed dramatically since the original C code chardet is a port of was written 25+ years ago), I'm not sure that this is the best approach. First of all, it has a rather confusing equation in it that yields a constant value, instead of just returning the constant value. Second, it's not clear to most people while 0.731 will make it win out of ISO-8859-1. Finally, we should probably just reduce the confidence of the ISO-8859-1 prober (or modify what ONE_CHAR_PROB is here).

Thank you for the feedback.
We would have to reduce the maximum confidence for ISO-8859-1 to 50%. E.g. in Germany you will run into ISO-8859-1 everyday (6% of websites and legacy software).

My motivation was to still allow the user to set ONE_CHAR_PROB. Also for 1 char and not change the behavior for 2+ chars.
Agreed it should be simpler and the comment is confusing.
Do you have some context which qualifies the confidence jump from 1 to 2 chars (50,5% to 75,25%)? In between most other models have there normal confidence levels and get preferred.
If not i would propose a simpler solution:

# Prefer UTF-8 over other models, can be modified with ONE_CHAR_PROB. if self._numb_chars == 1: unlike *= self.ONE_CHAR_PROB ** 1.99

above the magic numbers of 73 and 75 and smaller than 2 chars. And easy to read.
Or another alternative just change the original code to max(self._numb_chars, 2).

The equivalent for other models seems to be this, which has some inheritance ongoing:

chardet/chardet/chardistribution.py

Lines 92 to 93 in eb1a10a

r = (self._freq_chars / ((self._total_chars - self._freq_chars)

* self.typical_distribution_ratio))

raylu reviewed Apr 16, 2019

View reviewed changes

chardet/utf8prober.py Outdated Show resolved Hide resolved

basvdheuvel mentioned this pull request Aug 21, 2019

Let user decide their preferred encoding method kakulukia/pypugjs#56

Open

elcolumbio added 3 commits May 13, 2020 00:22

small fix to allow use of ONE_CHAR_PROB parameter

257cca6

typos resolved

e0721c6

elcolumbio force-pushed the master branch from a156237 to e0721c6 Compare May 12, 2020 22:27

dan-blanchard reviewed Dec 9, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More sensible default for extreme case of 1 none ascii char. #153

More sensible default for extreme case of 1 none ascii char. #153

elcolumbio commented Jun 2, 2018 •

edited

elcolumbio commented Jun 3, 2018

basvdheuvel commented Aug 21, 2019

elcolumbio commented May 12, 2020

dan-blanchard Dec 9, 2020

elcolumbio Dec 9, 2020 •

edited

	r = (self._freq_chars / ((self._total_chars - self._freq_chars)
	* self.typical_distribution_ratio))

More sensible default for extreme case of 1 none ascii char. #153

Are you sure you want to change the base?

More sensible default for extreme case of 1 none ascii char. #153

Conversation

elcolumbio commented Jun 2, 2018 • edited

elcolumbio commented Jun 3, 2018

basvdheuvel commented Aug 21, 2019

elcolumbio commented May 12, 2020

dan-blanchard Dec 9, 2020

Choose a reason for hiding this comment

elcolumbio Dec 9, 2020 • edited

Choose a reason for hiding this comment

elcolumbio commented Jun 2, 2018 •

edited

elcolumbio Dec 9, 2020 •

edited