Detection and Tiny sequences #391

dpatryas-rtbhouse · 2023-12-08T15:39:09Z

Describe the bug
The issue pertains to the charset_normalizer.detect() method, which fails to perform valid encoding detection for expression sequences. Specifically, the method incorrectly recognizes the sequence as Big5 and returns it as a Chinese character. This behavior can be observed with the provided code snippet:

To Reproduce
Execute the code snippet, where the charset_normalizer.detect() method misidentifies the encoding of the sequence.

import charset_normalizer
text = b"What Actions Will Keep Us at 1.5-2\xbaC?"
text.decode(charset_normalizer.detect(text)["encoding"])

Expected behavior
The expected behavior can be demonstrated using the chardet library, which accurately recognizes the encoding as ISO 8859-1. The correct degree character is then returned from the sequence:

import chardet
text = b"What Actions Will Keep Us at 1.5-2\xbaC?"
text.decode(chardet.detect(text)["encoding"])

Logs

charset-normalizer:
What Actions Will Keep Us at 1.5-2慢?

chardet:
What Actions Will Keep Us at 1.5-2ºC?

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}  # chardet
{'encoding': 'Big5', 'language': 'Chinese', 'confidence': 1.0}  # charset-normalizer

Desktop (please complete the following information):

OS: Linux
Python version 3.10
Package version 3.3.2 / 2.1.1

The text was updated successfully, but these errors were encountered:

Ousret · 2023-12-09T08:02:47Z

I understand you've encountered a frustrating case.
There could be a minor misunderstanding on how a charset detector works.

Let's analyze the input you've shared, and thanks for that:
text = b"What Actions Will Keep Us at 1.5-2\xbaC?"

We can only agree that the debate will focus on the \xba character.
So charset-normalizer or chardet must decide "what meant the original writer/author?"

OK, so far?

chardet knows about 35ich encodings, and charset-normalizer around 100.

\xba can be decoded using most of the encoding available (that extends ASCII of course)
Here is a tiny extract of what it could translate to.

º
ş
÷
Ί
ŗ
║
¬
[
╨

and so on...

As a human, you've concluded that ° was his original intention due to the sentence What Actions Will Keep Us at 1.5-2

How do you teach that to a machine without taking 10 seconds to answer today?

Know that we did our best to answer as accurately as possible.
A solution to improve must exist, it's just out of reach for me right now.

chardet only answered correctly by pure luck.

Take this for example: Charset Detection, for Everyone 👋 that encode to Charset Detection, for Everyone \xf0\x9f\x91\x8b

>>> chardet.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'Windows-1254', 'confidence': 0.4957960183590231, 'language': 'Turkish'}

>>> charset_normalizer.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}

Or Je suis pas d'accord avec Ahméd that translate to Je suis pas d'accord avec Ahm\xc3\xa9d.

>>> chardet.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'ISO-8859-9', 'confidence': 0.5648588804140238, 'language': 'Turkish'}

>>> charset_normalizer.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}

Now, luck it out of the equation thanks to our ability to detect Unicode.

Hope that clarifies,

dpatryas-rtbhouse added bug Something isn't working help wanted Extra attention is needed labels Dec 8, 2023

Ousret closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2023

Ousret removed bug Something isn't working help wanted Extra attention is needed labels Dec 9, 2023

Ousret changed the title ~~[BUG] Incorrect Encoding Detection for Expression Sequences in charset_normalizer.detect()~~ Detection and Tiny sequences Dec 9, 2023

Ousret pinned this issue Dec 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection and Tiny sequences #391

Detection and Tiny sequences #391

dpatryas-rtbhouse commented Dec 8, 2023 •

edited

Ousret commented Dec 9, 2023

Detection and Tiny sequences #391

Detection and Tiny sequences #391

Comments

dpatryas-rtbhouse commented Dec 8, 2023 • edited

Ousret commented Dec 9, 2023

dpatryas-rtbhouse commented Dec 8, 2023 •

edited