You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a lot of code in Juriscraper that looks like this:
r = request.get(some_link)
if "some phrase" in r.text:
# throw an error
The thing is, when we have large PDFs that we're downloading, r.text is pretty hard to produce. To do so, requests calls out to chardet, which is slow and uses a lot of memory on big files.
From models.py in the requests library, here's r.text:
@property
def text(self):
"""Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using
``charset_normalizer`` or ``chardet``.
The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
"""
# Try charset from content-type
content = None
encoding = self.encoding
if not self.content:
return ""
# Fallback to auto-detected encoding.
if self.encoding is None:
encoding = self.apparent_encoding # <---- HERE'S THE CALL
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors="replace")
except (LookupError, TypeError):
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# A TypeError can be raised if encoding is None
#
# So we try blindly encoding.
content = str(self.content, errors="replace")
return content
apparent_encoding is NOT cached (but probably should be, and thus each time we call r.text we're hitting the CPU and memory pretty hard.
The text was updated successfully, but these errors were encountered:
It turns out that r.text makes calls to chardet each time it is called. That's
not great because chardet can be slow and use a lot of memory, particularly
when checking PDFs.
Instead of doing that or checking if things are PDFs all the time, simply use
the binary content instead of the text.
Fixes: #564
Relates to: psf/requests#6250
We have a lot of code in Juriscraper that looks like this:
The thing is, when we have large PDFs that we're downloading, r.text is pretty hard to produce. To do so, requests calls out to chardet, which is slow and uses a lot of memory on big files.
From models.py in the requests library, here's
r.text
:apparent_encoding
is NOT cached (but probably should be, and thus each time we call r.text we're hitting the CPU and memory pretty hard.The text was updated successfully, but these errors were encountered: