Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate use of cgi #377

Merged
merged 9 commits into from Jan 6, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/355.bugfix.rst
@@ -0,0 +1 @@
:mod:`treq.content.text_content()` no longer generates deprecation warnings due to use of the ``cgi`` module.
1 change: 1 addition & 0 deletions setup.py
Expand Up @@ -35,6 +35,7 @@
"Twisted[tls] >= 22.10.0", # For #11635
"attrs",
"typing_extensions >= 3.10.0",
"multipart",
],
extras_require={
"dev": [
Expand Down
31 changes: 24 additions & 7 deletions src/treq/content.py
@@ -1,7 +1,7 @@
import cgi
import json
from typing import Any, Callable, List, Optional, cast
from typing import Any, Callable, FrozenSet, List, Optional, cast

import multipart # type: ignore
from twisted.internet.defer import Deferred, succeed
from twisted.internet.protocol import Protocol, connectionDone
from twisted.python.failure import Failure
Expand All @@ -11,21 +11,38 @@
from twisted.web.iweb import IResponse


"""Characters that are valid in a charset name per RFC 2978.

See https://www.rfc-editor.org/errata/eid5433
"""
_MIME_CHARSET_CHARS: FrozenSet[str] = frozenset(
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" # ALPHA
"0123456789" # DIGIT
"!#$%&+-^_`~" # symbols
)


def _encoding_from_headers(headers: Headers) -> Optional[str]:
content_types = headers.getRawHeaders("content-type")
if content_types is None:
return None

# This seems to be the choice browsers make when encountering multiple
# content-type headers.
content_type, params = cgi.parse_header(content_types[-1])
media_type, params = multipart.parse_options_header(content_types[-1])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a FYI.

In Twisted we/I went for stdlib usage https://github.com/twisted/twisted/pull/12016/files#diff-3093a8f64dec64c2305f6322f276a5cee1437f0037efc660ce6524ffa58017a1R229-R234


My reasoning for using stdlib, is that this is quite common usage and if there is a bug, it's best to have it fixed in stdlib rather than rely on 3rd party packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for using stdlib, is that this is quite common usage and if there is a bug, it's best to have it fixed in stdlib rather than rely on 3rd party packages.

I'm curious what experience this is based on. In my experience, known bugs often remain in the stdlib for years (perhaps because there are relatively many barriers to landing a change in the stdlib, as well as the fact that the stdlib can only change with a release which happens relatively infrequently - plus such a release can not fix the issue for already-released versions of python).

If there is a third-party module of reasonable quality that is practical to use then I would generally prefer this over something from the stdlib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the source of the 3rd-party module. More dependencies = more attack surface, both in the literal sense (more people whose PyPI upload credentials can ruin your day), and in the metaphoric sense of unanticipated changes (what happens if a dependency goes AGPL3).

In this sense "the stdlib" or any existing dependency are roughly equivalent, and I'd probably prefer to depend more on our existing deps. Taking on a new dependency ought to give us a moment of pause, but really only a moment, especially if we know the maintainers or it's sufficiently widely used that we are not entirely on the hook for due diligence on its continued maintenance.

Copy link
Member

@glyph glyph Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case this is a blocker to 3.13 support and the dependency is only for tests, so we should probably not block on this minor concern :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the dependency here is not just for tests. In 0a3d17b I switched to using the equivalent of cgi.parse_header() provided by multipart to avoid vendoring code under the PSF license as that seemed to give @glyph pause.

I picked multipart because it has a few familiar faces attached, e.g. @cjwatson, who has PyPI upload permissions. It is hosted under an individual's GitHub account rather than an org, so that's a downside.

I also considered pulling the relevant bits of cgi into a separate package and putting it on PyPI. I'd be happy to do that under the Twisted org if that's preferable to an external dependency.

I can only echo @exarkun's skepticism of using the stdlib for this. The stdlib has an anti-bytes bias that is often incorrect and has led to a lot of makework for us in Twisted. The cgi module has seen some baffling refactors, then deprecation and removal. Now we're told to use the email package, which isn't explicitly implementing the HTTP RFCs. Seems risky. Also, it was recently rewritten! It's possible for code to be too maintained.

charset = params.get("charset")
if charset:
return charset.strip("'\"")

if content_type == "application/json":
return "UTF-8"
assert isinstance(charset, str) # for MyPy
charset = charset.strip("'\"").lower()
if not charset:
return None
if not set(charset).issubset(_MIME_CHARSET_CHARS):
return None
return charset

if media_type == "application/json":
return "utf-8"

return None

Expand Down
60 changes: 58 additions & 2 deletions src/treq/test/test_content.py
@@ -1,4 +1,6 @@
import unittest
from unittest import mock
from typing import Optional

from twisted.python.failure import Failure

Expand All @@ -11,6 +13,7 @@
from twisted.web.server import NOT_DONE_YET

from treq import collect, content, json_content, text_content
from treq.content import _encoding_from_headers
from treq.client import _BufferedResponse
from treq.testing import StubTreq

Expand Down Expand Up @@ -267,6 +270,59 @@ def error(data):
# being closed.
stub.flush()
self.assertEqual(len(resource.request_finishes), 1)
self.assertIsInstance(
resource.request_finishes[0].value, ConnectionDone
self.assertIsInstance(resource.request_finishes[0].value, ConnectionDone)


class EncodingFromHeadersTests(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a docstring to describe the criteria for grouping tests into this class.
This should help future devs know if they should add a new tests here or add it somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! In the future it's fine to remind me of the Twisted coding standard, no need to write so much. :)

def _encodingFromContentType(self, content_type: str) -> Optional[str]:
"""
Invoke `_encoding_from_headers()` for a header value.

:param content_type: A Content-Type header value.
:returns: The result of `_encoding_from_headers()`
"""
h = Headers({"Content-Type": [content_type]})
return _encoding_from_headers(h)

def test_rfcExamples(self):
"""
The examples from RFC 9110 § 8.3.1 are normalized to
canonical (lowercase) form.
"""
for example in [
"text/html;charset=utf-8",
'Text/HTML;Charset="utf-8"',
'text/html; charset="utf-8"',
"text/html;charset=UTF-8",
]:
self.assertEqual("utf-8", self._encodingFromContentType(example))

def test_multipleParams(self):
"""The charset parameter is extracted even if mixed with other params."""
for example in [
"a/b;c=d;charSet=ascii",
"a/b;c=d;charset=ascii; e=f",
"a/b;c=d; charsEt=ascii;e=f",
"a/b;c=d; charset=ascii; e=f",
]:
self.assertEqual("ascii", self._encodingFromContentType(example))

def test_quotedString(self):
"""Any quotes that surround the value of the charset param are removed."""
self.assertEqual(
"ascii", self._encodingFromContentType("foo/bar; charset='ASCII'")
)
self.assertEqual(
"shift_jis", self._encodingFromContentType('a/b; charset="Shift_JIS"')
)

def test_noCharset(self):
"""None is returned when no valid charset parameter is found."""
for example in [
"application/octet-stream",
"text/plain;charset=",
"text/plain;charset=''",
"text/plain;charset=\"'\"",
"text/plain;charset=🙃",
]:
self.assertIsNone(self._encodingFromContentType(example))
35 changes: 16 additions & 19 deletions src/treq/test/test_multipart.py
@@ -1,12 +1,11 @@
# Copyright (c) Twisted Matrix Laboratories.
# See LICENSE for details.

import cgi
import sys
from typing import cast, AnyStr

from io import BytesIO

from multipart import MultipartParser # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok to use multipart in the testing code, to double check the implementation... even if the code under tests would have use stdlib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, the cgi docs suggest the multipart package:

Deprecated since version 3.11, will be removed in version 3.13: This function, like the rest of the cgi module, is deprecated. It can be replaced with the functionality in the email package (e.g. email.message.EmailMessage/email.message.Message) which implements the same MIME RFCs, or with the multipart PyPI project.

While it suggests the email package too, I do not think that the email package is representative of real-world multipart/form-data parsers, in particular because it targets email-dialect MIME that involves nesting never seen in HTTP. (As just one example, Django's multipart parser doesn't support nesting. I guarantee no browser ever generates it.)

A more ideal test suite would test against a variety of real-world multipart parsers, but I don't think that'd have great ROI.

from twisted.trial import unittest
from zope.interface.verify import verifyObject

Expand Down Expand Up @@ -588,9 +587,10 @@ def test_newLinesInParams(self):
--heyDavid--
""".encode("utf-8")), output)

def test_worksWithCgi(self):
def test_worksWithMultipart(self):
"""
Make sure the stuff we generated actually parsed by python cgi
Make sure the stuff we generated can actually be parsed by the
`multipart` module.
"""
output = self.getOutput(
MultiPartProducer([
Expand All @@ -612,23 +612,20 @@ def test_worksWithCgi(self):
)
)

form = cgi.parse_multipart(BytesIO(output), {
"boundary": b"heyDavid",
"CONTENT-LENGTH": str(len(output)),
})
form = MultipartParser(
stream=BytesIO(output),
boundary=b"heyDavid",
content_length=len(output),
)

# Since Python 3.7, the value for a non-file field is now a list
# of strings, not bytes.
if sys.version_info >= (3, 7):
self.assertEqual(set(['just a string\r\n', 'another string']),
set(form['cfield']))
else:
self.assertEqual(set([b'just a string\r\n', b'another string']),
set(form['cfield']))
self.assertEqual(
[b'just a string\r\n', b'another string'],
[f.raw for f in form.get_all('cfield')],
)

self.assertEqual(set([b'my lovely bytes2']), set(form['efield']))
self.assertEqual(set([b'my lovely bytes219']), set(form['xfield']))
self.assertEqual(set([b'my lovely bytes22']), set(form['afield']))
self.assertEqual(b'my lovely bytes2', form.get('efield').raw)
self.assertEqual(b'my lovely bytes219', form.get('xfield').raw)
self.assertEqual(b'my lovely bytes22', form.get('afield').raw)


class LengthConsumerTestCase(unittest.TestCase):
Expand Down