Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html file is not reported as UTF8 after conversion #381

Open
hrvoj3e opened this issue Nov 8, 2023 · 1 comment
Open

html file is not reported as UTF8 after conversion #381

hrvoj3e opened this issue Nov 8, 2023 · 1 comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed

Comments

@hrvoj3e
Copy link

hrvoj3e commented Nov 8, 2023

Provide the file
110-original.zip

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

❯ # rm+unzip

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:42:49,817 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:42:49,821 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:42:49,821 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:42:49,830 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:42:49,830 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250


❯ normalizer -rfnvvv 110-original.htm
2023-11-08 16:39:42,180 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:39:42,183 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:39:42,184 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:39:42,192 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:39:42,192 | DEBUG | Encoding detection: cp1250 is most likely the one.
{
    "path": "/home/adax/code/other/encoding/110-original.htm",
    "encoding": "cp1250",
    "encoding_aliases": [
        "1250",
        "windows_1250"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.783,
    "coherence": 66.66,
    "unicode_path": "/home/adax/code/other/encoding/110-original.htm",
    "is_preferred": true
}

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:41:07,958 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:41:07,961 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 1.267000 %
2023-11-08 16:41:07,962 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:41:07,970 | Level 5 | We detected language [('English', 0.7029), ('Indonesian', 0.572), ('Dutch', 0.51), ('Italian', 0.4949), ('Czech', 0.4862), ('Spanish', 0.4806), ('Croatian', 0.4724), ('Norwegian', 0.4692), ('Slovene', 0.4669), ('Romanian', 0.4632), ('Hungarian', 0.4624), ('Slovak', 0.4605), ('Finnish', 0.4565), ('German', 0.4533), ('Swedish', 0.4453), ('French', 0.443), ('Danish', 0.4366), ('Portuguese', 0.4116), ('Polish', 0.4113), ('Lithuanian', 0.3931), ('Estonian', 0.3828), ('Turkish', 0.3828), ('Vietnamese', 0.3795)] using cp1250
2023-11-08 16:41:07,970 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250

enca will however detect UTF-8 as it should

❯ # rm+unzip

❯ enca -L hr 110-original.htm
Unrecognized encoding

❯ normalizer -rfnvvv 110-original.htm

❯ enca -L hr 110-original.htm
Universal transformation format 8 bits; UTF-8
  CRLF line terminators

Expected encoding
Expected normalizer to show UTF-8 encoding after conversion to UTF-8.
Am I wrong here?

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.11.5
  • Package version charset-normalizer==3.3.2

Additional context
I know. Html is not the same as text.
But I will document this here.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

@hrvoj3e hrvoj3e added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Nov 8, 2023
@hrvoj3e hrvoj3e changed the title html file is not UTF8 after conversion html file is not reported as UTF8 after conversion Nov 8, 2023
@Ousret
Copy link
Owner

Ousret commented Nov 11, 2023

Yes, you are correct.
What you have is somewhat edge, but problematic nonetheless.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

Not entirely true, it's more complicated than that.

Fortunately, I know how to fix this. I don't know exactly when, but soon.
The idea is to do a preg replace within the normalizer CLI if there is a declarative mark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants