Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with cChardet? #19

Open
dan-blanchard opened this issue Jan 7, 2014 · 21 comments
Open

Merge with cChardet? #19

dan-blanchard opened this issue Jan 7, 2014 · 21 comments

Comments

@dan-blanchard
Copy link
Member

I only recently discovered that there's a substantially faster version of chardet, cChardet, that's just a Cython wrapper around uchardet-enhanced.

According to their benchmarks it's about 2800 times faster, so if we're only doing the same things they are, maybe we should recommend people who are using CPython use that.

@sigmavirus24
Copy link
Member

Chardet is extraordinarily slow so I second the recommendation

@dan-blanchard
Copy link
Member Author

Another idea is that we could try to merge cChardet with chardet so that there's one package, and if you're on CPython (or potentially PyPy, if cChardet works through cpyext), it uses cChardet, and otherwise it uses the pure Python version.

@Diaoul
Copy link

Diaoul commented Jan 21, 2014

+1
cChardet as an optional speed-up C-extension to chardet is a great idea.

@Diaoul
Copy link

Diaoul commented Jan 29, 2014

Any progress on this? Also the ability to "help" chardet by supplying a language would be nice.

@dan-blanchard
Copy link
Member Author

I haven't had a chance to look into this more, as chardet is very low on my priority list at the moment.

Really, we should talk to @PyYoshi about such a potential merger, but no one has reached out to them yet as far as I'm aware; although, I guess @-mentioning in this message sort of counts. 😄

@dan-blanchard dan-blanchard changed the title Check feature parity with cChardet Merge with cChardet Sep 9, 2014
@zougloub
Copy link

zougloub commented Oct 4, 2014

Hi,

I'm just providing some input, I like what chardet does but I think it could be improved, and the first thing is to use a native backend for performance reasons (it's nice to read its source code, but really slow). I'm sorry if I diverge a little from the original topic.

  • The new chardet project health looks good
  • Python is not good at doing what chardet does
  • A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use
  • I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub@ab176dc btw.
  • I also found differences in behaviour of chardet and cChardet
  • I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)
  • There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.
  • As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...

I'll hang around on #chardet on FreeNode, maybe I'll get something done in the week-end, maybe not.

@sigmavirus24
Copy link
Member

Hi @zougloub your comment is very hard to follow, especially in the context of this issue. If I reply to something and it isn't replying to what you meant, I apologize in advance.

The new chardet project health looks good

I disagree. There are almost no contributors and neither @dan-blanchard nor I have much time for active development or maintenance of it.

Python is not good at doing what chardet does

We all agree on that.

A split of libchardetect from the source tree of cChardet would be nice, to restore some "order", ie. have a portable C library that non-Pythonists can use

I'm not familiar with cChardet or libchardetect and I'm not sure what you mean about restore some "order". As I understand it libchardetect can already be used by non-Pythonistas.

I found by mistake an error in the SJIS hiragana detection which is also in the "Dive Into Python3" code (I was checking up news and opened a bunch of tabs in my browser, and saw http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html#unorderabletypes where '\202' is changed into 202 instead of 0x82). Fixed in zougloub/chardet@ab176dc btw.

If you found a bug (which is what I believe you're saying) please open a separate issue or (better yet) pull request regarding that. If you already have a fix (which it seems you do), please only open a pull request.

I also found differences in behaviour of chardet and cChardet

They're separate projects. Were we to merge with cChardet, those differences would likely be resolved. Alternatively if we simply added bindings to libchardetect, those differences would not be problematic.

I propose to expose the probabilities for all the decoders in the results (even if there is pruning happening at some point for most of them) so that API clients can make further decisions (monkey-patching of C code is impossible)

This would have to be exposed differently. Currently our API is stable and in my opinion enough. There should be a separate issued to discuss expanding the API to return the n most likely encodings.

There are various forks/wrappers for NS universalchardetector, it's sure that the original one cannot be used as is, as it's deep inside the gecko source tree, and has the NSPR dependency, which is superfluous. It would be nice if there was a reference version that is unarguably the right one.

As I already said, neither of us who are working on this have the affordance of time to do this. If someone else does this, that would be great.

As of 2014, universalchardetector could use a rework. At least, utf-8 validation should be attempted earlier. Most documents are now utf-8, and chardet can fail horribly at utf-8 detection. A lot of short utf-8 texts are detected as latin1 because they don't contain much multibyte utf-8 glyphs... I think at least that heuristic is broken. A lot of tables are hard-coded in the sources and do not represent the "preferred form of the work for making modifications to it", they correspond to frequency analysis over corpora of text, which could be updated...

Yes this is already planned. @dan-blanchard has already graciously volunteered to spend some time on this. I think you'll also find that our utf-8 detection is better (although it hasn't been released to PyPI just yet).

@zougloub
Copy link

zougloub commented Oct 4, 2014

Thanks for the answer @sigmavirus24 ; I've been using chardet/charade/chardet for a while (everyday through "less") and it saves me time, but sometime gets wrong, so when I went to look into the source code I got a little lost, which should not necessarily be the case (and this goes up to the NS code). I'm taking a few hours to start a new library (https://github.com/zougloub/libchardet), I'll try in the beginning to have it expandable and proxyable with some implementations of the NS code, but in the end it would be nice to have a limpid library that does character encoding detection, and that is used by the Python chardet/chardet (or cChardet). Sorry for hijacking the thread.

@dan-blanchard
Copy link
Member Author

@sigmavirus24 I've actually been thinking that we could switch to using the latest code from upstream directly via ctypes, since ctypes is supported by more than just CPython (unlike the Cython cChardet uses). Do you have any objections to such an approach?

@sigmavirus24
Copy link
Member

CFFI is the far better option. CFFI has better support in PyPy and other implementations

@dan-blanchard
Copy link
Member Author

Wow, CFFI is really nice. Somehow I missed that before and I've just been using ctypes.

Hmm... maybe a CFFI refactoring for DRMAA Python is in order too.

@dan-blanchard
Copy link
Member Author

As I've discussed in other issues and PRs (namely #42 and #48), I'm more concerned at this point with making chardet more accurate and Pythonic at this point. After all the refactoring is done, and we have the ability to train new models, then we can work on making it faster, but at that point we'll be doing things quite differently from cChardet.

@dan-blanchard
Copy link
Member Author

😬 It is really astonishing how wrong I can be sometimes. Since cChardet wraps uchardet-enhanced, and not the original Mozilla code, it actually has all of the fixes we want as far as improved accuracy and whatnot are concerned. Maybe a CFFI wrapper around uchardet-enhanced is the way to go after all.

@dan-blanchard dan-blanchard added this to the 3.0 milestone Jan 9, 2015
@dan-blanchard dan-blanchard changed the title Merge with cChardet Switch to wrapping uchardet-enhanced via CFFI (or merge with cChardet) Jan 9, 2015
@dan-blanchard
Copy link
Member Author

A slight wrinkle in this plan is that there is one encoding we support that cChardet/uchardet-enhanced does not, CP932. It's a variant of SHIFT_JIS.

We also have a longstanding PR to add MacRoman support.

@dan-blanchard
Copy link
Member Author

I am now completely against this, because I want chardet to be able to detect encodings and languages that neither cChardet nor uchardet-enhanced do. That's coming in the 4.0 release (#99).

@dan-blanchard
Copy link
Member Author

I'm re-opening this issue because I'm beginning to think it would make more sense to just try to extend cChardet (and the forked version of uchardet-enhanced it now uses) to support additional languages than to ignore the substantial speed differences here.

@dan-blanchard dan-blanchard reopened this Oct 19, 2017
@dan-blanchard dan-blanchard changed the title Switch to wrapping uchardet-enhanced via CFFI (or merge with cChardet) Merge with cChardet? Oct 19, 2017
@sigmavirus24
Copy link
Member

This is definitely reasonable. If we do this correctly with CFFI we can get the speed without having to dive too much into C or restrict ourselves to CPython.

@dan-blanchard
Copy link
Member Author

I actually checked after my most recent comment and cChardet appears to work on pypy as is. It does not work on Jython or IronPython from what I could tell, which I guess would be the only advantage of using CFFI, since the CFFI docs say it provieds "a reasonable path for other Python implementations like IronPython and Jython."

@dan-blanchard
Copy link
Member Author

dan-blanchard commented Dec 12, 2018

Latest update on this: cChardet actually supports many of the encodings now that I wanted to add in #99, so it makes even more sense to try to merge.

@jayvdb
Copy link

jayvdb commented Jul 30, 2019

From a distro packaging perspective, this needs to be done carefully or the result may be a lot of confusion.

Distros typically provide uchardet , now hosted by freedesktop , which is behind many of the forks that have sprung up. I created https://github.com/PyYoshi/uchardet/issues/5 and https://gitlab.freedesktop.org/uchardet/uchardet/issues/13 about this. It looks like @PyYoshi's fork is one of the most advanced, so it would be a significant amount of work to get those patches into the main repo.

Distros will want to build cChardet using the uchardet library that they already have from freedesktop, which has a different feature set.

Note also PyYoshi/cChardet#26 , which is a fairly significant difference in detection from chardet given the prevalence of the Windows curly smart quotes. Obviously the chardet test suite should pass when using the optimised cChardet merged version, which should hopefully catch regressions like that.

@Ousret
Copy link

Ousret commented Aug 27, 2019

Hi,

Merging with cchardet is still a good idea. Still relevant nowadays.

There is still hope regarding chardet, I have worked on an alternative, fully native to python and faster than chardet but slower than uchardet for sure.
Maybe it's worth looking. https://github.com/Ousret/charset_normalizer/

I don't use specific probe for a specific encoding. Around 90 detectables encoding.
I'm seeking feedbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants