Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from CRLF to LF line feed made the detector return different guess #232

Closed
BLKSerene opened this issue Nov 9, 2022 · 5 comments · Fixed by #233
Closed

Switch from CRLF to LF line feed made the detector return different guess #232

BLKSerene opened this issue Nov 9, 2022 · 5 comments · Fixed by #233

Comments

@BLKSerene
Copy link
Contributor

BLKSerene commented Nov 9, 2022

Describe the bug
The issue was found when testing Charset Normalizer on CI running different OSes.

To Reproduce

>>> import charset_normalizer
>>> text = '''English is a West Germanic language of the Indo-European language family, with its earliest forms spoken by the inhabitants of early medieval England.[3][4][5] It is named after the Angles, one of the ancient Germanic peoples that migrated to the island of Great Britain. English is genealogically West Germanic, closest related to the Low Saxon and Frisian languages; however, its vocabulary is also distinctively influenced by dialects of French (about 29% of modern English words) and Latin (also about 29%), plus some grammar and a small amount of core vocabulary influenced by Old Norse (a North Germanic language).[6][7][8] Speakers of English are called Anglophones.

The earliest forms of English, collectively known as Old English, evolved from a group of West Germanic (Ingvaeonic) dialects brought to Great Britain by Anglo-Saxon settlers in the 5th century and further mutated by Norse-speaking Viking settlers starting in the 8th and 9th centuries. Middle English began in the late 11th century after the Norman conquest of England, when considerable French (especially Old Norman) and Latin-derived vocabulary was incorporated into English over some three hundred years.[9][10] Early Modern English began in the late 15th century with the start of the Great Vowel Shift and the Renaissance trend of borrowing further Latin and Greek words and roots into English, concurrent with the introduction of the printing press to London. This era notably culminated in the King James Bible and plays of William Shakespeare.[11][12]

Modern English grammar is the result of a gradual change from a typical Indo-European dependent-marking pattern, with a rich inflectional morphology and relatively free word order, to a mostly analytic pattern with little inflection, and a fairly fixed subject–verb–object word order.[13] Modern English relies more on auxiliary verbs and word order for the expression of complex tenses, aspect and mood, as well as passive constructions, interrogatives and some negation.

Modern English has spread around the world since the 17th century as a consequence of the worldwide influence of the British Empire and the United States of America. Through all types of printed and electronic media of these countries, English has become the leading language of international discourse and the lingua franca in many regions and professional contexts such as science, navigation and law.[3] English is the most spoken language in the world[14] and the third-most spoken native language in the world, after Standard Chinese and Spanish.[15] It is the most widely learned second language and is either the official language or one of the official languages in 59 sovereign states. There are more people who have learned English as a second language than there are native speakers. As of 2005, it was estimated that there were over 2 billion speakers of English.[16] English is the majority native language in the United Kingdom, the United States, Canada, Australia, New Zealand and the Republic of Ireland (see Anglosphere), and is widely spoken in some areas of the Caribbean, Africa, South Asia, Southeast Asia, and Oceania.[17] It is a co-official language of the United Nations, the European Union and many other world and regional international organisations. It is the most widely spoken Germanic language, accounting for at least 70% of speakers of this Indo-European branch.'''# From wikipedia

>>> open('test.txt', 'w', encoding = 'utf_16_be', newline = '\r\n').write(text) # Windows-style line endings
3409
>>> charset_normalizer.from_path('test.txt').best().encoding # Correct!
'utf_16_be'
>>> open('test.txt', 'w', encoding = 'utf_16_be', newline = '\n').write(text) # Unix/Linux-style line endings
3409
>>> charset_normalizer.from_path('test.txt').best().encoding # Wrong!
'utf_16_le'

Expected behavior
Always return 'utf_16_be' on different OSes

Desktop (please complete the following information):

  • OS: Windows 11 x64
  • Python version: 3.8.10 x64
  • Package version: 3.0.0
@BLKSerene BLKSerene added bug Something isn't working help wanted Extra attention is needed labels Nov 9, 2022
@Ousret
Copy link
Owner

Ousret commented Nov 12, 2022

The title is a bit misleading, in the end, I understood that "Passing CRLF to LF" made the detector return something else.
I took the time trying to reproduce your issue and could not. I have initially done the testing in 3.11 then by pure curiosity setup 3.8.10.
Using Windows 11 and Ubuntu. Nothing seems wrong. Got every time UTF-16-BE.

If your reproducing script was not accurate and you re-verified, re-open this issue with complementary info.

@Ousret Ousret closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2022
@Ousret Ousret changed the title [BUG] Inconsistent results on different operating systems Switch from CRLF to LF line feed made the detector return different guess Nov 12, 2022
@Ousret Ousret added question Further information is requested and removed bug Something isn't working help wanted Extra attention is needed labels Nov 12, 2022
@BLKSerene
Copy link
Contributor Author

@Ousret Sorry for the confusion, the text is missing some sentences. I've modified the code (the return value of open should be exactly 3409 now).

@BLKSerene
Copy link
Contributor Author

I can't reopen this issue (or should I open a new one?), if you could re-verify this, please re-open it.

@Ousret
Copy link
Owner

Ousret commented Nov 12, 2022

OK. The reproducer script now outputs what you encountered. I have narrowed it down to utils.cut_sequence_chunks which did not cut chunks correctly.

@Ousret Ousret reopened this Nov 12, 2022
@Ousret
Copy link
Owner

Ousret commented Nov 12, 2022

See #233

@Ousret Ousret removed the question Further information is requested label Nov 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants