Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apparent_encoding should be cached since chardet can be slow #6250

Open
mlissner opened this issue Sep 29, 2022 · 3 comments
Open

apparent_encoding should be cached since chardet can be slow #6250

mlissner opened this issue Sep 29, 2022 · 3 comments

Comments

@mlissner
Copy link
Contributor

We have some scraper code that sometimes gets back PDFs and other times gets back HTML. Today we learned that if you access r.text in a large-ish PDF (40MB), chardet is called, which uses a lot of CPU (and a ton of memory):

r = requests.get(some_url)
r.text

That's more or less fine (best not to try to get the text of a PDF this way), but if you access r.text more than once, chardet gets run over and over.

We have code like this that performs horribly:

r = requests.get(some_url)
if bad_text in r.text:
    continue
if other_bad_text in r.text:
    continue
# ...many more tests...

When you access r.text, it checks if the encoding can come from the HTTP headers. If not, it runs the apparent_encoding property, which looks like:

    @property
    def apparent_encoding(self):
        """The apparent encoding, provided by the charset_normalizer or chardet libraries."""
        return chardet.detect(self.content)["encoding"]

I think that property should probably be cached since it's slow, so that repeated calls to r.text don't hurt so badly.

Expected Result

I expected the calls to the text property to only calculate the encoding once per request.

Actual Result

Each call to the text property re-calculates the encoding, which is slow and uses a lot of memory (this is probably a bug in chardet, but it uses hundreds of MB on a 40MB PDF right now).

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "5.0.0"
  },
  "charset_normalizer": {
    "version": "2.0.12"
  },
  "cryptography": {
    "version": "36.0.2"
  },
  "idna": {
    "version": "2.10"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.10.7"
  },
  "platform": {
    "release": "5.4.209-116.363.amzn2.x86_64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "101010ef",
    "version": "20.0.1"
  },
  "requests": {
    "version": "2.28.1"
  },
  "system_ssl": {
    "version": "101010ef"
  },
  "urllib3": {
    "version": "1.26.5"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": true
}
mlissner added a commit to freelawproject/juriscraper that referenced this issue Sep 29, 2022
It turns out that r.text makes calls to chardet each time it is called. That's
not great because chardet can be slow and use a lot of memory, particularly
when checking PDFs.

Instead of doing that or checking if things are PDFs all the time, simply use
the binary content instead of the text.

Fixes: #564
Relates to: psf/requests#6250
@sigmavirus24
Copy link
Contributor

You can do this today by setting r.encoding = r.apparent_encoding

@mlissner
Copy link
Contributor Author

Yeah, or by overriding the text or apparent_encoding methods. All of those approaches feel like workarounds though, right — or are you suggesting that the apparent_encoding method should set the encoding property when it finishes, and that could serve as the cache?

One other thought, I noticed that the content method actually has caching. Maybe instead of caching the apparent_encoding property, the text property should be cached to parallel how the the content one is.

@sigmavirus24
Copy link
Contributor

No, I'm saying that while character detection can be slow, it usually isn't. If it's speed is a problem, the documented way if avoiding it is to set the encoding attribute yourself. This is you want it cached, set the encoding attribute yourself. It's easy and available today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants