Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cChardet fork #7126

Closed
1 task done
serjflint opened this issue Dec 10, 2022 · 13 comments · Fixed by #7561
Closed
1 task done

Use cChardet fork #7126

serjflint opened this issue Dec 10, 2022 · 13 comments · Fixed by #7561
Milestone

Comments

@serjflint
Copy link

Is your feature request related to a problem?

cChardet is probably abandoned and charset-normalizer seems to be slower according to #6819.

Describe the solution you'd like

Several days ago cChardet was forked and patches for 3.10 and 3.11 were applied.

Describe alternatives you've considered

Check out original issue #6819 and PyYoshi/cChardet#81

Related component

Server

Additional context

No response

Code of Conduct

  • I agree to follow the aio-libs Code of Conduct
@OmLanke
Copy link

OmLanke commented Jan 1, 2023

Any updates on this?

@Dreamsorcerer Dreamsorcerer added this to the 3.9 milestone Jan 1, 2023
@socketpair
Copy link
Contributor

I suggest dropping all character set guessing technologies.

@wbarnha
Copy link
Member

wbarnha commented Feb 14, 2023

I forgot this PR was opened. I'm the current maintainer of https://github.com/faust-streaming/cChardet and the only issue I have at the moment is support for macOS ARM builds is unavailable. Other than that I think this is a safe merge.

@bitnom
Copy link

bitnom commented Mar 22, 2023

I suggest dropping all character set guessing technologies.

I forgot this PR was opened. I'm the current maintainer of https://github.com/faust-streaming/cChardet and the only issue I have at the moment is support for macOS ARM builds is unavailable. Other than that I think this is a safe merge.

I suggest using faust-streaming/cChardet for the speedup, but giving the user the option to disable charset guessing technologies altogether.

@socketpair
Copy link
Contributor

@asvetlov I think, character set guessing technologies are obsolete nowadays. I vote for removal from aiohttp. What you think?

@wbarnha
Copy link
Member

wbarnha commented Mar 22, 2023

@asvetlov I think, character set guessing technologies are obsolete nowadays. I vote for removal from aiohttp. What you think?

For my own edification, what replaced character set guessing techniques? It'd be useful to know in order to understand cChardet's purpose in this project to begin with, and what would replace it.

@socketpair
Copy link
Contributor

socketpair commented Mar 22, 2023

@wbarnha Nothing replaces. Guessing is not required anymore. Everyone has to setup correct content-types. Also, UTF-8 is a winner. If one wants something else, he MUST specify it explicitly. No implicit encodings anymore.

That's my thoughts.

@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Mar 22, 2023

To be honest, that sounds pretty reasonable. Just looking through the code, it appears that it is only used for ClientResponse.text() and ClientResponse.json(). The latter is almost always going to be utf-8, and if a user needs codec guessing for the former, they can use chardet with ClientResponse.read() directly (or try/except first maybe).

@Dreamsorcerer
Copy link
Member

Also, if the mimetype is set correctly (which ClientResponse.json() requires by default), then codec guessing is not done for ClientResponse.json() anyway. So, it's really just ClientResponse.text() we're looking at.

@john-parton
Copy link
Contributor

All of the major web browsers have some faculty for character set guessing. Here's an interesting blog post on the matter: https://hsivonen.fi/chardetng/

@Dreamsorcerer
Copy link
Member

Yes, but we're not a browser, so legacy compatibility is a little less relevant, while performance is more relevant. We shouldn't penalise performance for something that is probably needed for 1 in a million requests or less. It should be trivial for a user to add charset guessing directly in their code, so I'm still thinking that we should still remove this library.

@socketpair
Copy link
Contributor

socketpair commented Aug 27, 2023

Guessing anything is definitely a bad thing. For example, if a Russian text starts with an English phrase, a typical guesser says it's English. I strongly consider, earlier or sooner, these installations must fail. I think the time has come.

Absolutely agreed about performance. Moreover, cchardet is an extra dependency. “Supply chain” problem is also important.

@Dreamsorcerer
Copy link
Member

We've removed charset guessing from aiohttp itself, but added a new fallback_charset_resolver parameter to ClientSession which can be used to call a charset guesser (which means the choice of library etc. is totally up to the user now). It simply defaults to lambda *_: "utf-8".

Once backports are done, we should have a transitionary release in 3.8.6, where the new parameter will be present, but defaults to the older behaviour of trying charset-normalizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment