Fix parsing in Header::normalize() #447

TimWolla · 2021-10-05T08:01:25Z

No description provided.

Simply splitting at commas will result in an incorrect output, as each list items might itself contain a comma within a quoted-string. While ETags are not technically a quoted-string (specifically backslash escapes are not supported) they can serve as a good example: If-None-Match: "foo", "foo,bar", "bar" Refers to three ETags, not four. Simply reuse the existing array implementation that is able to correctly handle quoted-strings without any backslash escapes.

The code did not correctly handle a backslash-escaped quote, misparsing headers such as: private, community="Guzzle\"Psr7"

RFC 7230#7: > For compatibility with legacy list rules, a recipient MUST parse and ignore a > reasonable number of empty list elements: and > Empty elements do not contribute to the count of elements present.

TimWolla · 2021-10-05T08:05:42Z

src/Header.php

        $result = [];
-        foreach ($header as $value) {
+        foreach ((array) $header as $value) {


This first loop is a bit odd, because it appears to allow passing in an array of arrays. It is not useful to parse more than one type of header at once, as the list syntax is not defined for every header. i.e. simply passing in the full ->getHeaders() array is not valid.

All in all this normalize() function is not documented super-well, the method name is not really helpful and I don't really trust the parsing logic based on regular expressions.

I suggest that this method is deprecated and replaced by a public static function splitList(array|string $header) that either accepts a string or a string[], properly validating the input. The parser should be handwritten (simply handling quoted strings is easy), because that regular expression is pretty magic.

I can perform this change, but would like to hear your opinion first.

I am reluctant to make big changes here. Can we go with whatever the minimum changes are to get your use case to work?

The current diff as it is in this PR contains the minimally necessary changes to correctly implement parsing of headers containing lists.

My proposed change with the deprecation of the normalize() method would just be to tighten up the API and to clearly define the method's behavior.

I just noticed your comment over in WoltLab/WCF (notifications for that repository go to $companymail):

This PR is not fixing an actual issue with our use case. This PR primarily is a matter of standard's correctness. The only real bug is when a string instead of an array is passed to normalize(), because then the simply explode() will be used.

I've also re-added the php-http Slack to my Slack Desktop Client. Feel free to message me there if you feel that it would help.

Nyholm

Could we also add the date test case from #172

How come you do not trust the regex? Should we add more tests to make sure all headers are parsed properly?

I think we can change the doc block to show the intended input of this method.

TimWolla · 2021-10-06T21:17:10Z

Could we also add the date test case from #172

No, this is not useful: The set-cookie header is not a header that is defined as a using the list syntax.

https://httpwg.org/specs/rfc7230.html#rfc.section.3.2.2

Note: In practice, the "Set-Cookie" header field ([RFC6265]) often appears multiple times in a response message and does not use the list syntax,

Using the normalize() function on set-cookie result in undefined behavior!

How come you do not trust the regex?

With the look-ahead it's a pretty complicated regular expression that is hard to verify for correctness in edge cases. I think it's easier to verify the correctness of a simply character based parser, especially one that only needs to handle a very small number of cases. Of course your mileage might vary.

Should we add more tests to make sure all headers are parsed properly?

I've added all the edge cases I could think of as tests.

I think we can change the doc block to show the intended input of this method.

Expanding the doc block would certainly be helpful. However I also think that normalize() is a pretty generic method name for a method with a limited scope. The implementation is meant to split headers that contain lists of values into each of the values. It does not perform any other normalization (e.g. lowercasing the values)

That's why I proposed splitList() as a replacement method with a clearly defined scope to not break any existing users.

To have something actual to look at, my proposed replacement method would look like this. It passes the same tests:

/**
     * Splits a HTTP header defined to contain comma-separated list into
     * each individual value. Empty values will be removed.
     * 
     * Example headers include 'accept', 'cache-control' and 'if-none-match'.
     * 
     * This method must not be used to parse headers that are not defined as
     * a list, such as 'user-agent' or 'set-cookie'.
     * 
     * @param string|string[] $values Header value as returned by MessageInterface::getHeader()
     * @return string[] 
     */
    public static function splitList($values): array
    {
        if (!\is_array($values)) {
            $values = [$values];
        }

        $result = [];
        foreach ($values as $value) {
            if (!\is_string($value)) {
                throw new TypeError('$header must either be a string or an array containing strings.');
            }

            $v = '';
            $isQuoted = false;
            $isEscaped = false;
            for ($i = 0, $max = \strlen($value); $i < $max; $i++) {
                if ($isEscaped) {
                    $v .= $value[$i];
                    $isEscaped = false;

                    continue;
                }

                if (!$isQuoted && $value[$i] === ',') {
                    $v = \trim($v);
                    if ($v !== '') {
                        $result[] = $v;
                    }

                    $v = '';
                    continue;
                }

                if ($isQuoted && $value[$i] === '\\') {
                    $isEscaped = true;
                    $v .= $value[$i];

                    continue;
                }
                if ($value[$i] === '"') {
                    $isQuoted = !$isQuoted;
                    $v .= $value[$i];

                    continue;
                }

                $v .= $value[$i];
            }

            $v = \trim($v);
            if ($v !== '') {
                $result[] = $v;
            }
        }

        return $result;
    }

TimWolla · 2021-10-21T13:52:20Z

Friendly ping 😃

Can this get looked at? I don't need this fix in a release immediately. I'd like to see this PR merged so that I can rely on this arriving in a release eventually, allowing me to build on top of it.

Summarizing:

The change in this PR fixes an actual bug in parsing quoted strings and includes tests to verify this behavior. The PR performs the minimally necessary change to perform this fix. I've split my fixes into single commits, each fixing a small part of the behavior.
The API of the normalize() method is ugly and should be fixed. But this is not directly related to (1). I'm happy to propose this in a follow-up PR.

I'm happy to answer your questions here in this PR, or in PHP HTTP Slack.

stale · 2022-02-19T11:36:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 2 weeks if no further activity occurs. Thank you for your contributions.

TimWolla · 2022-02-19T11:44:41Z

This still is an issue and I'd still like to see this merged.

Nyholm

This sure is complex. Sorry for the delay.

I do agree with the tests and I am happy with the changes.

Nyholm · 2022-03-13T15:12:05Z

@GrahamCampbell added an additional test in #476. You will get the credit, for this work. Thank you

TimWolla · 2022-03-13T15:31:35Z

@GrahamCampbell added an additional test in #476. You will get the credit, for this work. Thank you

That works for me, thank you. I've created a PR with my suggested cleaned up replacement API: #477. It's not as critical, because at least normalize() now properly parses everything that I am aware of.

Thanks!

TimWolla added 3 commits October 5, 2021 09:40

Correctly handle backslash escapes in Header::normalize()

c668720

The code did not correctly handle a backslash-escaped quote, misparsing headers such as: private, community="Guzzle\"Psr7"

Correctly suppress empty list items in Header::normalize()

f7d9989

RFC 7230#7: > For compatibility with legacy list rules, a recipient MUST parse and ignore a > reasonable number of empty list elements: and > Empty elements do not contribute to the count of elements present.

TimWolla commented Oct 5, 2021

View reviewed changes

TimWolla mentioned this pull request Oct 5, 2021

Potential issue with parse_header() #172

Closed

TimWolla force-pushed the header-normalize branch from ad75894 to 8b7f198 Compare October 5, 2021 10:07

Add extensive tests for Header::normalize()

36e879f

TimWolla force-pushed the header-normalize branch from 8b7f198 to 36e879f Compare October 5, 2021 10:08

TimWolla mentioned this pull request Oct 5, 2021

Delegate header list parsing to \GuzzleHttp\Psr7\Header::normalize() WoltLab/WCF#4534

Closed

3 tasks

GrahamCampbell requested a review from Nyholm October 6, 2021 15:28

Nyholm reviewed Oct 6, 2021

View reviewed changes

TimWolla requested a review from GrahamCampbell October 13, 2021 16:05

GrahamCampbell requested a review from Nyholm October 13, 2021 16:08

TimWolla mentioned this pull request Nov 1, 2021

optimized some ignore rules in the phpstan-baseline.neon #456

Closed

stale bot added the lifecycle/stale label Feb 19, 2022

stale bot removed the lifecycle/stale label Feb 19, 2022

Nyholm approved these changes Mar 13, 2022

View reviewed changes

GrahamCampbell mentioned this pull request Mar 13, 2022

Fix parsing in Header::normalize() #476

Merged

Nyholm closed this in #476 Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing in Header::normalize() #447

Fix parsing in Header::normalize() #447

TimWolla commented Oct 5, 2021

TimWolla Oct 5, 2021

GrahamCampbell Oct 5, 2021

TimWolla Oct 5, 2021

TimWolla Oct 5, 2021

Nyholm left a comment

TimWolla commented Oct 6, 2021 •

edited

TimWolla commented Oct 21, 2021 •

edited

stale bot commented Feb 19, 2022

TimWolla commented Feb 19, 2022

Nyholm left a comment

Nyholm commented Mar 13, 2022

TimWolla commented Mar 13, 2022

Fix parsing in Header::normalize() #447

Fix parsing in Header::normalize() #447

Conversation

TimWolla commented Oct 5, 2021

TimWolla Oct 5, 2021

Choose a reason for hiding this comment

GrahamCampbell Oct 5, 2021

Choose a reason for hiding this comment

TimWolla Oct 5, 2021

Choose a reason for hiding this comment

TimWolla Oct 5, 2021

Choose a reason for hiding this comment

Nyholm left a comment

Choose a reason for hiding this comment

TimWolla commented Oct 6, 2021 • edited

TimWolla commented Oct 21, 2021 • edited

stale bot commented Feb 19, 2022

TimWolla commented Feb 19, 2022

Nyholm left a comment

Choose a reason for hiding this comment

Nyholm commented Mar 13, 2022

TimWolla commented Mar 13, 2022

TimWolla commented Oct 6, 2021 •

edited

TimWolla commented Oct 21, 2021 •

edited