New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detection of windows-1250 #87
Comments
@dan-blanchard I see "Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until we can retrain the models." message in the wiki. I'd like to offer my help with re-enabling these probers, especially that windows-1250 is not only Hungarian, but in general Central & East European, including Polish, Czech, Slovak, Croatian an so on. And Windows is still very popular in this region, and so are text files windows-1250 enoded. |
thank you for this message.
please let me know when you retrain the models.
best regards
Artur Šilić
…On Thu, 12 Jan 2017 at 10:42, Bartosz Grabski ***@***.***> wrote:
@dan-blanchard <https://github.com/dan-blanchard> I see "Our ISO-8859-2
and windows-1250 (Hungarian) probers have been temporarily disabled until
we can retrain the models." message in the wiki.
I'd like to offer my help with enabling the probers, especially that
windows-1250 is not only Hungarian, but in general Central & East European,
including Polish, Czech, Slovak, Croatian an so on. And Windows is still
very popular in this region, and so are text files windows-1250 enoded.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#87 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADbgFllYcptT9m0-rx8EClNZbS1n4-g-ks5rRfV4gaJpZM4IHRdF>
.
|
I haven't quite had the time to finish this up, but I actually have a local branch where I'm working on retraining all our models so that this issue goes away. |
I used chardet version from around 1.5 years ago, when you give him latin2
text, he almost consistently marks it as a latin1 text. Since I had to
disambiguate between latin2, windows-1250 and UTF8, this hack works, but of
course, it would be better to have a more general model. There are still a
lot of material from eastern europe that are in old encodings.
Best regards,
Artur
…On Thu, 12 Jan 2017 at 16:34, Dan Blanchard ***@***.***> wrote:
I haven't quite had the time to finish this up, but I actually have a
local branch where I'm working on retraining all our models so that this
issue goes away.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#87 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADbgFtTaOYB6BlSfYFV3QTDSPuU53PSGks5rRkgHgaJpZM4IHRdF>
.
|
#99 will re-enable these with newly trained models. |
thanks, i ll give it a try
…On Tue, 11 Apr 2017 at 22:34, Dan Blanchard ***@***.***> wrote:
#99 <#99> will re-enable these
with newly trained models.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#87 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADbgFt5dd6DZtlitWzw-S3oZSJN70G34ks5ru-PXgaJpZM4IHRdF>
.
|
Hi!
Is it possible to do a windows-1250 detection? Current implementation returns "windows-1252" for text encoded in windows-1250. Same question goes for "ISO-8859-2" vs. "ISO-8859-1".
The text was updated successfully, but these errors were encountered: