Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Windows-1252 encoding is not detected in turkish text #407

Open
milahu opened this issue Dec 30, 2023 · 2 comments
Open

[BUG] Windows-1252 encoding is not detected in turkish text #407

milahu opened this issue Dec 30, 2023 · 2 comments
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence

Comments

@milahu
Copy link

milahu commented Dec 30, 2023

charset_normalizer returns None

$ chardetect star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: Windows-1252 with confidence 0.73

$ file -i star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt
star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt: application/x-subrip; charset=iso-8859-1

$ python -c "import charset_normalizer; print(charset_normalizer.from_path('star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt').best())"
None

who is right? chardetect is right! the expected encoding is Windows-1252

iso-8859-1 produces an ugly <U+0085> when piped to less (utf16 hex bytes)
or c285 as utf8 hex bytes

unicode-explorer.com/c/0085

U+0085: The "Next Line" (NEL) control character was used in the 1970s for controlling printers and displays (e.g. VT100). Moves to the first position of the next line.

--- star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.iso-8859-1
+++ star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt.Windows-1252
@@ -2242,7 +2242,7 @@
 
 505
 00:43:04,098 --> 00:43:05,428
-Adil davranmaktan bahsetmiþken<U+0085>
+Adil davranmaktan bahsetmiþken…
 
 506
 00:43:06,771 --> 00:43:09,777

input file

star_trek_tng_-_season_1_ep_03_-_the_naked_now.srt

@milahu milahu added bug Something isn't working help wanted Extra attention is needed labels Dec 30, 2023
@Ousret Ousret added detection Related to the charset detection mechanism, chaos/mess/coherence and removed bug Something isn't working help wanted Extra attention is needed labels Jan 2, 2024
@Ousret
Copy link
Owner

Ousret commented Jan 2, 2024

OK, noted. Will try to improve this case for the next minor.

@milahu
Copy link
Author

milahu commented Jan 2, 2024

Adil davranmaktan bahsetmiþken…

it really is just that one byte that breaks charset_normalizer

$ printf '\x85' | iconv -f cp1254 -t utf8
…

when i remove that byte, the encoding cp1254 is found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence
Projects
None yet
Development

No branches or pull requests

2 participants