Merge with cChardet? #19

dan-blanchard · 2014-01-07T16:44:11Z

I only recently discovered that there's a substantially faster version of chardet, cChardet, that's just a Cython wrapper around uchardet-enhanced.

According to their benchmarks it's about 2800 times faster, so if we're only doing the same things they are, maybe we should recommend people who are using CPython use that.

The text was updated successfully, but these errors were encountered:

sigmavirus24 · 2014-01-07T17:04:00Z

Chardet is extraordinarily slow so I second the recommendation

dan-blanchard · 2014-01-07T18:15:07Z

Another idea is that we could try to merge cChardet with chardet so that there's one package, and if you're on CPython (or potentially PyPy, if cChardet works through cpyext), it uses cChardet, and otherwise it uses the pure Python version.

Diaoul · 2014-01-21T15:43:59Z

+1
cChardet as an optional speed-up C-extension to chardet is a great idea.

Diaoul · 2014-01-29T15:49:20Z

Any progress on this? Also the ability to "help" chardet by supplying a language would be nice.

dan-blanchard · 2014-01-29T15:57:23Z

I haven't had a chance to look into this more, as chardet is very low on my priority list at the moment.

Really, we should talk to @PyYoshi about such a potential merger, but no one has reached out to them yet as far as I'm aware; although, I guess @-mentioning in this message sort of counts. 😄

zougloub · 2014-10-04T19:22:00Z

Hi,

I'm just providing some input, I like what chardet does but I think it could be improved, and the first thing is to use a native backend for performance reasons (it's nice to read its source code, but really slow). I'm sorry if I diverge a little from the original topic.

The new chardet project health looks good
Python is not good at doing what chardet does
A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use
I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub@ab176dc btw.
I also found differences in behaviour of chardet and cChardet
I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)
There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.
As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...

I'll hang around on #chardet on FreeNode, maybe I'll get something done in the week-end, maybe not.

sigmavirus24 · 2014-10-04T20:01:11Z

Hi @zougloub your comment is very hard to follow, especially in the context of this issue. If I reply to something and it isn't replying to what you meant, I apologize in advance.

The new chardet project health looks good

I disagree. There are almost no contributors and neither @dan-blanchard nor I have much time for active development or maintenance of it.

Python is not good at doing what chardet does

We all agree on that.

A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use

I'm not familiar with cChardet or libchardetect and I'm not sure what you mean about restore some "order". As I understand it libchardetect can already be used by non-Pythonistas.

I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub/chardet@ab176dc btw.

If you found a bug (which is what I believe you're saying) please open a separate issue or (better yet) pull request regarding that. If you already have a fix (which it seems you do), please only open a pull request.

I also found differences in behaviour of chardet and cChardet

They're separate projects. Were we to merge with cChardet, those differences would likely be resolved. Alternatively if we simply added bindings to libchardetect, those differences would not be problematic.

I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)

This would have to be exposed differently. Currently our API is stable and in my opinion enough. There should be a separate issued to discuss expanding the API to return the n most likely encodings.

There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.

As I already said, neither of us who are working on this have the affordance of time to do this. If someone else does this, that would be great.

As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...

Yes this is already planned. @dan-blanchard has already graciously volunteered to spend some time on this. I think you'll also find that our utf-8 detection is better (although it hasn't been released to PyPI just yet).

zougloub · 2014-10-04T21:17:52Z

Thanks for the answer @sigmavirus24 ; I've been using chardet/charade/chardet for a while (everyday through "less") and it saves me time, but sometime gets wrong, so when I went to look into the source code I got a little lost, which should not necessarily be the case (and this goes up to the NS code). I'm taking a few hours to start a new library (https://github.com/zougloub/libchardet), I'll try in the beginning to have it expandable and proxyable with some implementations of the NS code, but in the end it would be nice to have a limpid library that does character encoding detection, and that is used by the Python chardet/chardet (or cChardet). Sorry for hijacking the thread.

dan-blanchard · 2014-10-07T12:56:28Z

@sigmavirus24 I've actually been thinking that we could switch to using the latest code from upstream directly via ctypes, since ctypes is supported by more than just CPython (unlike the Cython cChardet uses). Do you have any objections to such an approach?

sigmavirus24 · 2014-10-07T14:55:17Z

CFFI is the far better option. CFFI has better support in PyPy and other implementations

dan-blanchard · 2014-10-07T14:59:15Z

Wow, CFFI is really nice. Somehow I missed that before and I've just been using ctypes.

Hmm... maybe a CFFI refactoring for DRMAA Python is in order too.

dan-blanchard · 2015-01-06T16:49:16Z

As I've discussed in other issues and PRs (namely #42 and #48), I'm more concerned at this point with making chardet more accurate and Pythonic at this point. After all the refactoring is done, and we have the ability to train new models, then we can work on making it faster, but at that point we'll be doing things quite differently from cChardet.

dan-blanchard · 2015-01-09T16:43:29Z

😬 It is really astonishing how wrong I can be sometimes. Since cChardet wraps uchardet-enhanced, and not the original Mozilla code, it actually has all of the fixes we want as far as improved accuracy and whatnot are concerned. Maybe a CFFI wrapper around uchardet-enhanced is the way to go after all.

dan-blanchard · 2015-01-09T19:00:23Z

A slight wrinkle in this plan is that there is one encoding we support that cChardet/uchardet-enhanced does not, CP932. It's a variant of SHIFT_JIS.

We also have a longstanding PR to add MacRoman support.

dan-blanchard · 2017-04-11T20:11:44Z

I am now completely against this, because I want chardet to be able to detect encodings and languages that neither cChardet nor uchardet-enhanced do. That's coming in the 4.0 release (#99).

dan-blanchard · 2017-10-19T17:17:35Z

I'm re-opening this issue because I'm beginning to think it would make more sense to just try to extend cChardet (and the forked version of uchardet-enhanced it now uses) to support additional languages than to ignore the substantial speed differences here.

sigmavirus24 · 2017-11-15T20:22:45Z

This is definitely reasonable. If we do this correctly with CFFI we can get the speed without having to dive too much into C or restrict ourselves to CPython.

dan-blanchard · 2017-11-16T14:06:07Z

I actually checked after my most recent comment and cChardet appears to work on pypy as is. It does not work on Jython or IronPython from what I could tell, which I guess would be the only advantage of using CFFI, since the CFFI docs say it provieds "a reasonable path for other Python implementations like IronPython and Jython."

dan-blanchard · 2018-12-12T15:18:46Z

Latest update on this: cChardet actually supports many of the encodings now that I wanted to add in #99, so it makes even more sense to try to merge.

jayvdb · 2019-07-30T05:58:17Z

From a distro packaging perspective, this needs to be done carefully or the result may be a lot of confusion.

Distros typically provide uchardet , now hosted by freedesktop , which is behind many of the forks that have sprung up. I created https://github.com/PyYoshi/uchardet/issues/5 and https://gitlab.freedesktop.org/uchardet/uchardet/issues/13 about this. It looks like @PyYoshi's fork is one of the most advanced, so it would be a significant amount of work to get those patches into the main repo.

Distros will want to build cChardet using the uchardet library that they already have from freedesktop, which has a different feature set.

Note also PyYoshi/cChardet#26 , which is a fairly significant difference in detection from chardet given the prevalence of the Windows curly smart quotes. Obviously the chardet test suite should pass when using the optimised cChardet merged version, which should hopefully catch regressions like that.

Ousret · 2019-08-27T17:53:26Z

Hi,

Merging with cchardet is still a good idea. Still relevant nowadays.

There is still hope regarding chardet, I have worked on an alternative, fully native to python and faster than chardet but slower than uchardet for sure.
Maybe it's worth looking. https://github.com/Ousret/charset_normalizer/

I don't use specific probe for a specific encoding. Around 90 detectables encoding.
I'm seeking feedbacks.

dan-blanchard changed the title ~~Check feature parity with cChardet~~ Merge with cChardet Sep 9, 2014

dan-blanchard added enhancement help wanted labels Sep 9, 2014

dan-blanchard removed the help wanted label Jan 6, 2015

dan-blanchard mentioned this issue Jan 9, 2015

Turkish ISO-8859-9 detection support (New) #41

Merged

dan-blanchard added the help wanted label Jan 9, 2015

dan-blanchard added this to the 3.0 milestone Jan 9, 2015

dan-blanchard changed the title ~~Merge with cChardet~~ Switch to wrapping uchardet-enhanced via CFFI (or merge with cChardet) Jan 9, 2015

Diaoul mentioned this issue Oct 20, 2015

Subtitle encoding detection fail Diaoul/subliminal#315

Closed

dan-blanchard closed this as completed Apr 11, 2017

dan-blanchard reopened this Oct 19, 2017

dan-blanchard changed the title ~~Switch to wrapping uchardet-enhanced via CFFI (or merge with cChardet)~~ Merge with cChardet? Oct 19, 2017

dan-blanchard mentioned this issue Dec 12, 2018

Offering to help with the project #169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with cChardet? #19

Merge with cChardet? #19

dan-blanchard commented Jan 7, 2014

sigmavirus24 commented Jan 7, 2014

dan-blanchard commented Jan 7, 2014

Diaoul commented Jan 21, 2014

Diaoul commented Jan 29, 2014

dan-blanchard commented Jan 29, 2014

zougloub commented Oct 4, 2014

sigmavirus24 commented Oct 4, 2014

zougloub commented Oct 4, 2014

dan-blanchard commented Oct 7, 2014

sigmavirus24 commented Oct 7, 2014

dan-blanchard commented Oct 7, 2014

dan-blanchard commented Jan 6, 2015

dan-blanchard commented Jan 9, 2015

dan-blanchard commented Jan 9, 2015

dan-blanchard commented Apr 11, 2017

dan-blanchard commented Oct 19, 2017

sigmavirus24 commented Nov 15, 2017

dan-blanchard commented Nov 16, 2017

dan-blanchard commented Dec 12, 2018 •

edited

jayvdb commented Jul 30, 2019

Ousret commented Aug 27, 2019 •

edited

Merge with cChardet? #19

Merge with cChardet? #19

Comments

dan-blanchard commented Jan 7, 2014

sigmavirus24 commented Jan 7, 2014

dan-blanchard commented Jan 7, 2014

Diaoul commented Jan 21, 2014

Diaoul commented Jan 29, 2014

dan-blanchard commented Jan 29, 2014

zougloub commented Oct 4, 2014

sigmavirus24 commented Oct 4, 2014

zougloub commented Oct 4, 2014

dan-blanchard commented Oct 7, 2014

sigmavirus24 commented Oct 7, 2014

dan-blanchard commented Oct 7, 2014

dan-blanchard commented Jan 6, 2015

dan-blanchard commented Jan 9, 2015

dan-blanchard commented Jan 9, 2015

dan-blanchard commented Apr 11, 2017

dan-blanchard commented Oct 19, 2017

sigmavirus24 commented Nov 15, 2017

dan-blanchard commented Nov 16, 2017

dan-blanchard commented Dec 12, 2018 • edited

jayvdb commented Jul 30, 2019

Ousret commented Aug 27, 2019 • edited

dan-blanchard commented Dec 12, 2018 •

edited

Ousret commented Aug 27, 2019 •

edited