Detection of windows-1250 #87

7artur · 2016-04-14T11:25:22Z

Hi!
Is it possible to do a windows-1250 detection? Current implementation returns "windows-1252" for text encoded in windows-1250. Same question goes for "ISO-8859-2" vs. "ISO-8859-1".

bartoszgrabski · 2017-01-12T09:42:14Z

@dan-blanchard I see "Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until we can retrain the models." message in the wiki.

I'd like to offer my help with re-enabling these probers, especially that windows-1250 is not only Hungarian, but in general Central & East European, including Polish, Czech, Slovak, Croatian an so on. And Windows is still very popular in this region, and so are text files windows-1250 enoded.

7artur · 2017-01-12T12:32:37Z

thank you for this message. please let me know when you retrain the models. best regards Artur Šilić

…

On Thu, 12 Jan 2017 at 10:42, Bartosz Grabski ***@***.***> wrote: @dan-blanchard <https://github.com/dan-blanchard> I see "Our ISO-8859-2 and windows-1250 (Hungarian) probers have been temporarily disabled until we can retrain the models." message in the wiki. I'd like to offer my help with enabling the probers, especially that windows-1250 is not only Hungarian, but in general Central & East European, including Polish, Czech, Slovak, Croatian an so on. And Windows is still very popular in this region, and so are text files windows-1250 enoded. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADbgFllYcptT9m0-rx8EClNZbS1n4-g-ks5rRfV4gaJpZM4IHRdF> .

dan-blanchard · 2017-01-12T15:34:30Z

I haven't quite had the time to finish this up, but I actually have a local branch where I'm working on retraining all our models so that this issue goes away.

7artur · 2017-01-12T15:41:45Z

I used chardet version from around 1.5 years ago, when you give him latin2 text, he almost consistently marks it as a latin1 text. Since I had to disambiguate between latin2, windows-1250 and UTF8, this hack works, but of course, it would be better to have a more general model. There are still a lot of material from eastern europe that are in old encodings. Best regards, Artur

…

On Thu, 12 Jan 2017 at 16:34, Dan Blanchard ***@***.***> wrote: I haven't quite had the time to finish this up, but I actually have a local branch where I'm working on retraining all our models so that this issue goes away. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADbgFtTaOYB6BlSfYFV3QTDSPuU53PSGks5rRkgHgaJpZM4IHRdF> .

dan-blanchard · 2017-04-11T20:34:30Z

#99 will re-enable these with newly trained models.

7artur · 2017-04-12T06:33:53Z

thanks, i ll give it a try

…

On Tue, 11 Apr 2017 at 22:34, Dan Blanchard ***@***.***> wrote: #99 <#99> will re-enable these with newly trained models. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#87 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADbgFt5dd6DZtlitWzw-S3oZSJN70G34ks5ru-PXgaJpZM4IHRdF> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection of windows-1250 #87

Detection of windows-1250 #87

7artur commented Apr 14, 2016

bartoszgrabski commented Jan 12, 2017 •

edited

7artur commented Jan 12, 2017 via email

dan-blanchard commented Jan 12, 2017

7artur commented Jan 12, 2017 via email

dan-blanchard commented Apr 11, 2017

7artur commented Apr 12, 2017 via email

Detection of windows-1250 #87

Detection of windows-1250 #87

Comments

7artur commented Apr 14, 2016

bartoszgrabski commented Jan 12, 2017 • edited

7artur commented Jan 12, 2017 via email

dan-blanchard commented Jan 12, 2017

7artur commented Jan 12, 2017 via email

dan-blanchard commented Apr 11, 2017

7artur commented Apr 12, 2017 via email

bartoszgrabski commented Jan 12, 2017 •

edited