Use cChardet fork #7126

serjflint · 2022-12-10T06:23:57Z

Is your feature request related to a problem?

cChardet is probably abandoned and charset-normalizer seems to be slower according to #6819.

Describe the solution you'd like

Several days ago cChardet was forked and patches for 3.10 and 3.11 were applied.

Describe alternatives you've considered

Check out original issue #6819 and PyYoshi/cChardet#81

Related component

Server

Additional context

No response

Code of Conduct

I agree to follow the aio-libs Code of Conduct

OmLanke · 2023-01-01T18:47:58Z

Any updates on this?

socketpair · 2023-02-04T15:25:06Z

I suggest dropping all character set guessing technologies.

wbarnha · 2023-02-14T18:15:48Z

I forgot this PR was opened. I'm the current maintainer of https://github.com/faust-streaming/cChardet and the only issue I have at the moment is support for macOS ARM builds is unavailable. Other than that I think this is a safe merge.

bitnom · 2023-03-22T18:13:48Z

I suggest dropping all character set guessing technologies.

I forgot this PR was opened. I'm the current maintainer of https://github.com/faust-streaming/cChardet and the only issue I have at the moment is support for macOS ARM builds is unavailable. Other than that I think this is a safe merge.

I suggest using faust-streaming/cChardet for the speedup, but giving the user the option to disable charset guessing technologies altogether.

socketpair · 2023-03-22T18:32:27Z

@asvetlov I think, character set guessing technologies are obsolete nowadays. I vote for removal from aiohttp. What you think?

wbarnha · 2023-03-22T18:33:43Z

@asvetlov I think, character set guessing technologies are obsolete nowadays. I vote for removal from aiohttp. What you think?

For my own edification, what replaced character set guessing techniques? It'd be useful to know in order to understand cChardet's purpose in this project to begin with, and what would replace it.

socketpair · 2023-03-22T18:38:42Z

@wbarnha Nothing replaces. Guessing is not required anymore. Everyone has to setup correct content-types. Also, UTF-8 is a winner. If one wants something else, he MUST specify it explicitly. No implicit encodings anymore.

That's my thoughts.

Dreamsorcerer · 2023-03-22T18:52:46Z

To be honest, that sounds pretty reasonable. Just looking through the code, it appears that it is only used for ClientResponse.text() and ClientResponse.json(). The latter is almost always going to be utf-8, and if a user needs codec guessing for the former, they can use chardet with ClientResponse.read() directly (or try/except first maybe).

Dreamsorcerer · 2023-03-22T19:05:50Z

Also, if the mimetype is set correctly (which ClientResponse.json() requires by default), then codec guessing is not done for ClientResponse.json() anyway. So, it's really just ClientResponse.text() we're looking at.

john-parton · 2023-08-26T21:13:00Z

All of the major web browsers have some faculty for character set guessing. Here's an interesting blog post on the matter: https://hsivonen.fi/chardetng/

Dreamsorcerer · 2023-08-27T16:38:22Z

Yes, but we're not a browser, so legacy compatibility is a little less relevant, while performance is more relevant. We shouldn't penalise performance for something that is probably needed for 1 in a million requests or less. It should be trivial for a user to add charset guessing directly in their code, so I'm still thinking that we should still remove this library.

socketpair · 2023-08-27T16:49:01Z

Guessing anything is definitely a bad thing. For example, if a Russian text starts with an English phrase, a typical guesser says it's English. I strongly consider, earlier or sooner, these installations must fail. I think the time has come.

Absolutely agreed about performance. Moreover, cchardet is an extra dependency. “Supply chain” problem is also important.

Dreamsorcerer · 2023-08-31T13:47:42Z

We've removed charset guessing from aiohttp itself, but added a new fallback_charset_resolver parameter to ClientSession which can be used to call a charset guesser (which means the choice of library etc. is totally up to the user now). It simply defaults to lambda *_: "utf-8".

Once backports are done, we should have a transitionary release in 3.8.6, where the new parameter will be present, but defaults to the older behaviour of trying charset-normalizer.

serjflint added the enhancement label Dec 10, 2022

Dreamsorcerer added this to the 3.9 milestone Jan 1, 2023

Dreamsorcerer mentioned this issue Aug 11, 2023

Performance issue in case of repeating text()/json() calls for single response instance #7516

Open

1 task

john-parton mentioned this issue Aug 26, 2023

Replace cChardet with chardetng_py. #7559

Closed

5 tasks

This was referenced Aug 27, 2023

Remove charset-normalizer and cchardet as dependencies. Update docs #7560

Closed

Remove chardet/charset-normalizer. Add fallback_charset_resolver ClientSession parameter. #7561

Merged

Dreamsorcerer closed this as completed in #7561 Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cChardet fork #7126

Use cChardet fork #7126

serjflint commented Dec 10, 2022

OmLanke commented Jan 1, 2023

socketpair commented Feb 4, 2023

wbarnha commented Feb 14, 2023

bitnom commented Mar 22, 2023 •

edited

socketpair commented Mar 22, 2023

wbarnha commented Mar 22, 2023 •

edited

socketpair commented Mar 22, 2023 •

edited

Dreamsorcerer commented Mar 22, 2023 •

edited

Dreamsorcerer commented Mar 22, 2023

john-parton commented Aug 26, 2023

Dreamsorcerer commented Aug 27, 2023

socketpair commented Aug 27, 2023 •

edited

Dreamsorcerer commented Aug 31, 2023

Use cChardet fork #7126

Use cChardet fork #7126

Comments

serjflint commented Dec 10, 2022

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Related component

Additional context

Code of Conduct

OmLanke commented Jan 1, 2023

socketpair commented Feb 4, 2023

wbarnha commented Feb 14, 2023

bitnom commented Mar 22, 2023 • edited

socketpair commented Mar 22, 2023

wbarnha commented Mar 22, 2023 • edited

socketpair commented Mar 22, 2023 • edited

Dreamsorcerer commented Mar 22, 2023 • edited

Dreamsorcerer commented Mar 22, 2023

john-parton commented Aug 26, 2023

Dreamsorcerer commented Aug 27, 2023

socketpair commented Aug 27, 2023 • edited

Dreamsorcerer commented Aug 31, 2023

bitnom commented Mar 22, 2023 •

edited

wbarnha commented Mar 22, 2023 •

edited

socketpair commented Mar 22, 2023 •

edited

Dreamsorcerer commented Mar 22, 2023 •

edited

socketpair commented Aug 27, 2023 •

edited