html file is not reported as UTF8 after conversion #381

hrvoj3e · 2023-11-08T15:52:14Z

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

❯ # rm+unzip

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:42:49,817 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:42:49,821 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:42:49,821 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:42:49,830 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:42:49,830 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250


❯ normalizer -rfnvvv 110-original.htm
2023-11-08 16:39:42,180 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:39:42,183 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:39:42,184 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:39:42,192 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:39:42,192 | DEBUG | Encoding detection: cp1250 is most likely the one.
{
    "path": "/home/adax/code/other/encoding/110-original.htm",
    "encoding": "cp1250",
    "encoding_aliases": [
        "1250",
        "windows_1250"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.783,
    "coherence": 66.66,
    "unicode_path": "/home/adax/code/other/encoding/110-original.htm",
    "is_preferred": true
}

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:41:07,958 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:41:07,961 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 1.267000 %
2023-11-08 16:41:07,962 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:41:07,970 | Level 5 | We detected language [('English', 0.7029), ('Indonesian', 0.572), ('Dutch', 0.51), ('Italian', 0.4949), ('Czech', 0.4862), ('Spanish', 0.4806), ('Croatian', 0.4724), ('Norwegian', 0.4692), ('Slovene', 0.4669), ('Romanian', 0.4632), ('Hungarian', 0.4624), ('Slovak', 0.4605), ('Finnish', 0.4565), ('German', 0.4533), ('Swedish', 0.4453), ('French', 0.443), ('Danish', 0.4366), ('Portuguese', 0.4116), ('Polish', 0.4113), ('Lithuanian', 0.3931), ('Estonian', 0.3828), ('Turkish', 0.3828), ('Vietnamese', 0.3795)] using cp1250
2023-11-08 16:41:07,970 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250

enca will however detect UTF-8 as it should

❯ # rm+unzip

❯ enca -L hr 110-original.htm
Unrecognized encoding

❯ normalizer -rfnvvv 110-original.htm

❯ enca -L hr 110-original.htm
Universal transformation format 8 bits; UTF-8
  CRLF line terminators

Expected encoding
Expected normalizer to show UTF-8 encoding after conversion to UTF-8.
Am I wrong here?

Desktop (please complete the following information):

OS: Linux
Python version 3.11.5
Package version charset-normalizer==3.3.2

Additional context
I know. Html is not the same as text.
But I will document this here.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

The text was updated successfully, but these errors were encountered:

Ousret · 2023-11-11T06:21:27Z

Yes, you are correct.
What you have is somewhat edge, but problematic nonetheless.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

Not entirely true, it's more complicated than that.

Fortunately, I know how to fix this. I don't know exactly when, but soon.
The idea is to do a preg replace within the normalizer CLI if there is a declarative mark.

hrvoj3e added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Nov 8, 2023

hrvoj3e changed the title ~~html file is not UTF8 after conversion~~ html file is not reported as UTF8 after conversion Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html file is not reported as UTF8 after conversion #381

html file is not reported as UTF8 after conversion #381

hrvoj3e commented Nov 8, 2023

Ousret commented Nov 11, 2023 •

edited

html file is not reported as UTF8 after conversion #381

html file is not reported as UTF8 after conversion #381

Comments

hrvoj3e commented Nov 8, 2023

Ousret commented Nov 11, 2023 • edited

Ousret commented Nov 11, 2023 •

edited