New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge with cChardet? #19
Comments
Chardet is extraordinarily slow so I second the recommendation |
Another idea is that we could try to merge cChardet with chardet so that there's one package, and if you're on CPython (or potentially PyPy, if cChardet works through cpyext), it uses cChardet, and otherwise it uses the pure Python version. |
+1 |
Any progress on this? Also the ability to "help" chardet by supplying a language would be nice. |
I haven't had a chance to look into this more, as chardet is very low on my priority list at the moment. Really, we should talk to @PyYoshi about such a potential merger, but no one has reached out to them yet as far as I'm aware; although, I guess @-mentioning in this message sort of counts. 😄 |
Hi, I'm just providing some input, I like what chardet does but I think it could be improved, and the first thing is to use a native backend for performance reasons (it's nice to read its source code, but really slow). I'm sorry if I diverge a little from the original topic.
I'll hang around on #chardet on FreeNode, maybe I'll get something done in the week-end, maybe not. |
Hi @zougloub your comment is very hard to follow, especially in the context of this issue. If I reply to something and it isn't replying to what you meant, I apologize in advance.
I disagree. There are almost no contributors and neither @dan-blanchard nor I have much time for active development or maintenance of it.
We all agree on that.
I'm not familiar with cChardet or libchardetect and I'm not sure what you mean about
If you found a bug (which is what I believe you're saying) please open a separate issue or (better yet) pull request regarding that. If you already have a fix (which it seems you do), please only open a pull request.
They're separate projects. Were we to merge with cChardet, those differences would likely be resolved. Alternatively if we simply added bindings to libchardetect, those differences would not be problematic.
This would have to be exposed differently. Currently our API is stable and in my opinion enough. There should be a separate issued to discuss expanding the API to return the
As I already said, neither of us who are working on this have the affordance of time to do this. If someone else does this, that would be great.
Yes this is already planned. @dan-blanchard has already graciously volunteered to spend some time on this. I think you'll also find that our utf-8 detection is better (although it hasn't been released to PyPI just yet). |
Thanks for the answer @sigmavirus24 ; I've been using chardet/charade/chardet for a while (everyday through "less") and it saves me time, but sometime gets wrong, so when I went to look into the source code I got a little lost, which should not necessarily be the case (and this goes up to the NS code). I'm taking a few hours to start a new library (https://github.com/zougloub/libchardet), I'll try in the beginning to have it expandable and proxyable with some implementations of the NS code, but in the end it would be nice to have a limpid library that does character encoding detection, and that is used by the Python chardet/chardet (or cChardet). Sorry for hijacking the thread. |
@sigmavirus24 I've actually been thinking that we could switch to using the latest code from upstream directly via |
CFFI is the far better option. CFFI has better support in PyPy and other implementations |
Wow, CFFI is really nice. Somehow I missed that before and I've just been using ctypes. Hmm... maybe a CFFI refactoring for DRMAA Python is in order too. |
As I've discussed in other issues and PRs (namely #42 and #48), I'm more concerned at this point with making chardet more accurate and Pythonic at this point. After all the refactoring is done, and we have the ability to train new models, then we can work on making it faster, but at that point we'll be doing things quite differently from cChardet. |
😬 It is really astonishing how wrong I can be sometimes. Since cChardet wraps uchardet-enhanced, and not the original Mozilla code, it actually has all of the fixes we want as far as improved accuracy and whatnot are concerned. Maybe a CFFI wrapper around uchardet-enhanced is the way to go after all. |
A slight wrinkle in this plan is that there is one encoding we support that cChardet/uchardet-enhanced does not, CP932. It's a variant of SHIFT_JIS. We also have a longstanding PR to add MacRoman support. |
I am now completely against this, because I want chardet to be able to detect encodings and languages that neither cChardet nor uchardet-enhanced do. That's coming in the 4.0 release (#99). |
I'm re-opening this issue because I'm beginning to think it would make more sense to just try to extend cChardet (and the forked version of uchardet-enhanced it now uses) to support additional languages than to ignore the substantial speed differences here. |
This is definitely reasonable. If we do this correctly with CFFI we can get the speed without having to dive too much into C or restrict ourselves to CPython. |
I actually checked after my most recent comment and cChardet appears to work on pypy as is. It does not work on Jython or IronPython from what I could tell, which I guess would be the only advantage of using CFFI, since the CFFI docs say it provieds "a reasonable path for other Python implementations like IronPython and Jython." |
Latest update on this: cChardet actually supports many of the encodings now that I wanted to add in #99, so it makes even more sense to try to merge. |
From a distro packaging perspective, this needs to be done carefully or the result may be a lot of confusion. Distros typically provide uchardet , now hosted by freedesktop , which is behind many of the forks that have sprung up. I created https://github.com/PyYoshi/uchardet/issues/5 and https://gitlab.freedesktop.org/uchardet/uchardet/issues/13 about this. It looks like @PyYoshi's fork is one of the most advanced, so it would be a significant amount of work to get those patches into the main repo. Distros will want to build cChardet using the uchardet library that they already have from freedesktop, which has a different feature set. Note also PyYoshi/cChardet#26 , which is a fairly significant difference in detection from |
Hi, Merging with cchardet is still a good idea. Still relevant nowadays. There is still hope regarding chardet, I have worked on an alternative, fully native to python and faster than chardet but slower than uchardet for sure. I don't use specific probe for a specific encoding. Around 90 detectables encoding. |
I only recently discovered that there's a substantially faster version of
chardet
,cChardet
, that's just a Cython wrapper arounduchardet-enhanced
.According to their benchmarks it's about 2800 times faster, so if we're only doing the same things they are, maybe we should recommend people who are using CPython use that.
The text was updated successfully, but these errors were encountered: