Eliminate use of `cgi` #377

twm · 2024-01-02T04:54:19Z

Stop using cgi, which was deprecated in Python 3.11 and dropped in Python 3.13. Replace it with the multipart package, which becomes a dependency.

Fixes #355.

glyph · 2024-01-02T05:20:32Z

Does vendoring a stdlib function have license implications? I'd rather deal with the possibility of tinkering in the email module than copy the PSF license.

twm · 2024-01-02T05:23:19Z

@glyph I thought we've vendored stuff from the stdlib in Twisted before... is that not the case?

twm · 2024-01-02T05:59:49Z

Okay, I switched to using multipart for everything.

adiroiban

Thanks for the update.

I just left a very generic review.
I have never used treq so far.

It looks good. Only minor comments.

I prefer stdlib here... but multipart is also ok.

Regards

adiroiban · 2024-01-02T12:12:08Z

src/treq/content.py

 def _encoding_from_headers(headers: Headers) -> Optional[str]:
    content_types = headers.getRawHeaders("content-type")
    if content_types is None:
        return None

    # This seems to be the choice browsers make when encountering multiple
    # content-type headers.
-    content_type, params = cgi.parse_header(content_types[-1])
+    media_type, params = multipart.parse_options_header(content_types[-1])


More of a FYI.

In Twisted we/I went for stdlib usage https://github.com/twisted/twisted/pull/12016/files#diff-3093a8f64dec64c2305f6322f276a5cee1437f0037efc660ce6524ffa58017a1R229-R234

My reasoning for using stdlib, is that this is quite common usage and if there is a bug, it's best to have it fixed in stdlib rather than rely on 3rd party packages.

My reasoning for using stdlib, is that this is quite common usage and if there is a bug, it's best to have it fixed in stdlib rather than rely on 3rd party packages.

I'm curious what experience this is based on. In my experience, known bugs often remain in the stdlib for years (perhaps because there are relatively many barriers to landing a change in the stdlib, as well as the fact that the stdlib can only change with a release which happens relatively infrequently - plus such a release can not fix the issue for already-released versions of python).

If there is a third-party module of reasonable quality that is practical to use then I would generally prefer this over something from the stdlib.

It depends on the source of the 3rd-party module. More dependencies = more attack surface, both in the literal sense (more people whose PyPI upload credentials can ruin your day), and in the metaphoric sense of unanticipated changes (what happens if a dependency goes AGPL3).

In this sense "the stdlib" or any existing dependency are roughly equivalent, and I'd probably prefer to depend more on our existing deps. Taking on a new dependency ought to give us a moment of pause, but really only a moment, especially if we know the maintainers or it's sufficiently widely used that we are not entirely on the hook for due diligence on its continued maintenance.

In any case this is a blocker to 3.13 support and the dependency is only for tests, so we should probably not block on this minor concern :)

To be clear, the dependency here is not just for tests. In 0a3d17b I switched to using the equivalent of cgi.parse_header() provided by multipart to avoid vendoring code under the PSF license as that seemed to give @glyph pause.

I picked multipart because it has a few familiar faces attached, e.g. @cjwatson, who has PyPI upload permissions. It is hosted under an individual's GitHub account rather than an org, so that's a downside.

I also considered pulling the relevant bits of cgi into a separate package and putting it on PyPI. I'd be happy to do that under the Twisted org if that's preferable to an external dependency.

I can only echo @exarkun's skepticism of using the stdlib for this. The stdlib has an anti-bytes bias that is often incorrect and has led to a lot of makework for us in Twisted. The cgi module has seen some baffling refactors, then deprecation and removal. Now we're told to use the email package, which isn't explicitly implementing the HTTP RFCs. Seems risky. Also, it was recently rewritten! It's possible for code to be too maintained.

adiroiban · 2024-01-02T12:13:04Z

src/treq/test/test_multipart.py

 from typing import cast, AnyStr

 from io import BytesIO

+from multipart import MultipartParser  # type: ignore


it's ok to use multipart in the testing code, to double check the implementation... even if the code under tests would have use stdlib.

For what it's worth, the cgi docs suggest the multipart package:

Deprecated since version 3.11, will be removed in version 3.13: This function, like the rest of the cgi module, is deprecated. It can be replaced with the functionality in the email package (e.g. email.message.EmailMessage/email.message.Message) which implements the same MIME RFCs, or with the multipart PyPI project.

While it suggests the email package too, I do not think that the email package is representative of real-world multipart/form-data parsers, in particular because it targets email-dialect MIME that involves nesting never seen in HTTP. (As just one example, Django's multipart parser doesn't support nesting. I guarantee no browser ever generates it.)

A more ideal test suite would test against a variety of real-world multipart parsers, but I don't think that'd have great ROI.

adiroiban · 2024-01-02T12:13:52Z

src/treq/test/test_content.py

+        self.assertIsInstance(resource.request_finishes[0].value, ConnectionDone)
+
+
+class EncodingFromHeadersTests(unittest.TestCase):


maybe add a docstring to describe the criteria for grouping tests into this class.
This should help future devs know if they should add a new tests here or add it somewhere else.

Sure! In the future it's fine to remind me of the Twisted coding standard, no need to write so much. :)

twm

Thanks for the review @adiroiban! I'll give a little time for folks to discuss the third-party dependency situation.

twm · 2024-01-03T05:54:27Z

src/treq/content.py

 def _encoding_from_headers(headers: Headers) -> Optional[str]:
    content_types = headers.getRawHeaders("content-type")
    if content_types is None:
        return None

    # This seems to be the choice browsers make when encountering multiple
    # content-type headers.
-    content_type, params = cgi.parse_header(content_types[-1])
+    media_type, params = multipart.parse_options_header(content_types[-1])


To be clear, the dependency here is not just for tests. In 0a3d17b I switched to using the equivalent of cgi.parse_header() provided by multipart to avoid vendoring code under the PSF license as that seemed to give @glyph pause.

I picked multipart because it has a few familiar faces attached, e.g. @cjwatson, who has PyPI upload permissions. It is hosted under an individual's GitHub account rather than an org, so that's a downside.

I also considered pulling the relevant bits of cgi into a separate package and putting it on PyPI. I'd be happy to do that under the Twisted org if that's preferable to an external dependency.

I can only echo @exarkun's skepticism of using the stdlib for this. The stdlib has an anti-bytes bias that is often incorrect and has led to a lot of makework for us in Twisted. The cgi module has seen some baffling refactors, then deprecation and removal. Now we're told to use the email package, which isn't explicitly implementing the HTTP RFCs. Seems risky. Also, it was recently rewritten! It's possible for code to be too maintained.

twm · 2024-01-03T05:54:53Z

src/treq/test/test_content.py

+        self.assertIsInstance(resource.request_finishes[0].value, ConnectionDone)
+
+
+class EncodingFromHeadersTests(unittest.TestCase):


Sure! In the future it's fine to remind me of the Twisted coding standard, no need to write so much. :)

twm · 2024-01-03T06:08:52Z

src/treq/test/test_multipart.py

 from typing import cast, AnyStr

 from io import BytesIO

+from multipart import MultipartParser  # type: ignore


For what it's worth, the cgi docs suggest the multipart package:

Deprecated since version 3.11, will be removed in version 3.13: This function, like the rest of the cgi module, is deprecated. It can be replaced with the functionality in the email package (e.g. email.message.EmailMessage/email.message.Message) which implements the same MIME RFCs, or with the multipart PyPI project.

While it suggests the email package too, I do not think that the email package is representative of real-world multipart/form-data parsers, in particular because it targets email-dialect MIME that involves nesting never seen in HTTP. (As just one example, Django's multipart parser doesn't support nesting. I guarantee no browser ever generates it.)

A more ideal test suite would test against a variety of real-world multipart parsers, but I don't think that'd have great ROI.

adiroiban · 2024-01-05T11:40:25Z

Thanks for the feedback. Much appreciated!

I have approved the PR and I am ok with using multipart for treq.

My comment about multipart vs stdlib was more of a FYI for what was done in twisted/twisted to fix this issue.

Also, I remember some discussions for getting treq code include in Twisted and have it as a high-level API .

From what I can see in this PR, multipart is used in "production" only to parse the Content-Type header. I was hoping that HTTP and MIME have the same standard for Content-Type header format.

I think that HTTP handling should be a core feature of any library, and this is why I would prefer to have http code implemented in stdlib.

For example, in Twisted, HTTP protocol handling is part of main Twisted package, while for LDAP we have a separate package.

Something similar for stdlib. I would expect Python to have a good enough HTTP handing, and use a separate package for LDAP handling.

twm added 6 commits January 1, 2024 21:05

Eliminate use of cgi.parse_multipart()

431173d

Vendor cgi.parse_header()

c1c1dc4

Eliminate use of cgi.parse_header()

b4fc499

Add change fragment

cd4c7f8

Reject empty quoted charset

5cc43b8

Make MyPy happy

852e343

twm force-pushed the bye-cgi-355 branch from bec1ce2 to 852e343 Compare January 2, 2024 05:05

Fix Python 3.7 compat

e03348f

twm requested a review from a team January 2, 2024 05:28

Avoid vendoring anything

0a3d17b

twm force-pushed the bye-cgi-355 branch from d8c8ccd to 0a3d17b Compare January 2, 2024 05:57

adiroiban approved these changes Jan 2, 2024

View reviewed changes

twm commented Jan 3, 2024

View reviewed changes

Merge branch 'trunk' into bye-cgi-355

c2210ca

glyph merged commit 8ccd453 into trunk Jan 6, 2024
16 checks passed

glyph deleted the bye-cgi-355 branch January 6, 2024 01:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate use of `cgi` #377

Eliminate use of `cgi` #377

twm commented Jan 2, 2024 •

edited

glyph commented Jan 2, 2024

twm commented Jan 2, 2024

twm commented Jan 2, 2024

adiroiban left a comment

adiroiban Jan 2, 2024

exarkun Jan 2, 2024

glyph Jan 2, 2024

glyph Jan 2, 2024 •

edited

twm Jan 3, 2024

adiroiban Jan 2, 2024

twm Jan 3, 2024

adiroiban Jan 2, 2024

twm Jan 3, 2024

twm left a comment

twm Jan 3, 2024

twm Jan 3, 2024

twm Jan 3, 2024

adiroiban commented Jan 5, 2024 •

edited

		self.assertIsInstance(resource.request_finishes[0].value, ConnectionDone)


		class EncodingFromHeadersTests(unittest.TestCase):

Eliminate use of cgi #377

Eliminate use of cgi #377

Conversation

twm commented Jan 2, 2024 • edited

glyph commented Jan 2, 2024

twm commented Jan 2, 2024

twm commented Jan 2, 2024

adiroiban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glyph Jan 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

twm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adiroiban commented Jan 5, 2024 • edited

Eliminate use of `cgi` #377

Eliminate use of `cgi` #377

twm commented Jan 2, 2024 •

edited

glyph Jan 2, 2024 •

edited

adiroiban commented Jan 5, 2024 •

edited