1 sentence utf-8 detected as Windows-1252 #185

EralpB · 2019-12-14T10:49:02Z

Thanks for coming up with this utility, it's a great need but this fails even with simplest examples I can't see how I'd trust this on production.

bs_2 = b'Carter\xe2\x80\x99s Janitorial'
chardet.detect(bs_2)
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
bs_2.decode('cp1252')
'Carterâ€™s Janitorial'
bs_2.decode('utf-8')
'Carter’s Janitorial'

am I doing something wrong? it's just couple of characters and it's clearly utf-8.

Ousret · 2020-01-03T09:42:22Z

You can try https://github.com/Ousret/charset_normalizer :)

yomajo · 2020-02-12T20:40:54Z

I am working with CSV files currently and personally tried 2 detection libraries. Both failed to detect contents are windows...: cp1257

In some StackOverflow answers I found:

def get_encoding(csvfile_path):
with open(csvfile_path) as f:
print(f)

which outputs:
<_io.TextIOWrapper name='Output/Headlines_data.csv' mode='r' encoding='cp1257'>

So I wrote basic extractor myself, with a fallback to lib solution:
from re import findall
def get_encoding(csvfile_path):
with open(csvfile_path) as f:
raw_result = str(f)
try:
matches = findall(r"encoding='.+'>", raw_result)
encoding = matches[0].replace('encoding=\'', '').replace('\'>','')
return encoding
except:
return get_encoding_via_lib(csvfile_path)

I'm kinda new, sorry for indentation stuff, code block ignores my spaces...

chomechome · 2020-05-12T01:45:08Z

I also haven't been able to successfully use chardet in production. So, I've recently built a more accurate encoding detection library myself: https://github.com/chomechome/charamel

This is better maintained and more reliable detection. This avoids issues with chardet mistakenly reporting utf-8 content as windows-1252, see chardet/chardet#185

mcarans mentioned this issue Apr 8, 2020

Issue with change to chardet frictionlessdata/tabulator-py#305

Closed

nijel mentioned this issue Feb 19, 2021

UnicodeEncodeError: 'charmap' codec can't encode character '\u0119' WeblateOrg/weblate#5475

Closed

nijel mentioned this issue Feb 12, 2022

formats: Switch to charset-normalizer from chardet translate/translate#4572

Merged

guillermogf mentioned this issue Feb 27, 2022

Non-ASCII characters not shown properly on text preview ranger/ranger#1948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1 sentence utf-8 detected as Windows-1252 #185

1 sentence utf-8 detected as Windows-1252 #185

EralpB commented Dec 14, 2019

Ousret commented Jan 3, 2020

yomajo commented Feb 12, 2020 •

edited

chomechome commented May 12, 2020 •

edited

1 sentence utf-8 detected as Windows-1252 #185

1 sentence utf-8 detected as Windows-1252 #185

Comments

EralpB commented Dec 14, 2019

Ousret commented Jan 3, 2020

yomajo commented Feb 12, 2020 • edited

chomechome commented May 12, 2020 • edited

yomajo commented Feb 12, 2020 •

edited

chomechome commented May 12, 2020 •

edited