Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix parsing in Header::normalize() #447

Closed
wants to merge 4 commits into from

Conversation

TimWolla
Copy link
Contributor

@TimWolla TimWolla commented Oct 5, 2021

No description provided.

Simply splitting at commas will result in an incorrect output, as each list
items might itself contain a comma within a quoted-string.

While ETags are not technically a quoted-string (specifically backslash escapes
are not supported) they can serve as a good example:

    If-None-Match: "foo", "foo,bar", "bar"

Refers to three ETags, not four.

Simply reuse the existing array implementation that is able to correctly handle
quoted-strings without any backslash escapes.
The code did not correctly handle a backslash-escaped quote, misparsing headers
such as:

    private, community="Guzzle\"Psr7"
RFC 7230#7:

> For compatibility with legacy list rules, a recipient MUST parse and ignore a
> reasonable number of empty list elements:

and

> Empty elements do not contribute to the count of elements present.
$result = [];
foreach ($header as $value) {
foreach ((array) $header as $value) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first loop is a bit odd, because it appears to allow passing in an array of arrays. It is not useful to parse more than one type of header at once, as the list syntax is not defined for every header. i.e. simply passing in the full ->getHeaders() array is not valid.

All in all this normalize() function is not documented super-well, the method name is not really helpful and I don't really trust the parsing logic based on regular expressions.

I suggest that this method is deprecated and replaced by a public static function splitList(array|string $header) that either accepts a string or a string[], properly validating the input. The parser should be handwritten (simply handling quoted strings is easy), because that regular expression is pretty magic.

I can perform this change, but would like to hear your opinion first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reluctant to make big changes here. Can we go with whatever the minimum changes are to get your use case to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current diff as it is in this PR contains the minimally necessary changes to correctly implement parsing of headers containing lists.

My proposed change with the deprecation of the normalize() method would just be to tighten up the API and to clearly define the method's behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed your comment over in WoltLab/WCF (notifications for that repository go to $companymail):

This PR is not fixing an actual issue with our use case. This PR primarily is a matter of standard's correctness. The only real bug is when a string instead of an array is passed to normalize(), because then the simply explode() will be used.

I've also re-added the php-http Slack to my Slack Desktop Client. Feel free to message me there if you feel that it would help.

Copy link
Member

@Nyholm Nyholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add the date test case from #172

How come you do not trust the regex? Should we add more tests to make sure all headers are parsed properly?

I think we can change the doc block to show the intended input of this method.

@TimWolla
Copy link
Contributor Author

TimWolla commented Oct 6, 2021

Could we also add the date test case from #172

No, this is not useful: The set-cookie header is not a header that is defined as a using the list syntax.

https://httpwg.org/specs/rfc7230.html#rfc.section.3.2.2

Note: In practice, the "Set-Cookie" header field ([RFC6265]) often appears multiple times in a response message and does not use the list syntax,

Using the normalize() function on set-cookie result in undefined behavior!

How come you do not trust the regex?

With the look-ahead it's a pretty complicated regular expression that is hard to verify for correctness in edge cases. I think it's easier to verify the correctness of a simply character based parser, especially one that only needs to handle a very small number of cases. Of course your mileage might vary.

Should we add more tests to make sure all headers are parsed properly?

I've added all the edge cases I could think of as tests.

I think we can change the doc block to show the intended input of this method.

Expanding the doc block would certainly be helpful. However I also think that normalize() is a pretty generic method name for a method with a limited scope. The implementation is meant to split headers that contain lists of values into each of the values. It does not perform any other normalization (e.g. lowercasing the values)

That's why I proposed splitList() as a replacement method with a clearly defined scope to not break any existing users.

To have something actual to look at, my proposed replacement method would look like this. It passes the same tests:

/**
     * Splits a HTTP header defined to contain comma-separated list into
     * each individual value. Empty values will be removed.
     * 
     * Example headers include 'accept', 'cache-control' and 'if-none-match'.
     * 
     * This method must not be used to parse headers that are not defined as
     * a list, such as 'user-agent' or 'set-cookie'.
     * 
     * @param string|string[] $values Header value as returned by MessageInterface::getHeader()
     * @return string[] 
     */
    public static function splitList($values): array
    {
        if (!\is_array($values)) {
            $values = [$values];
        }

        $result = [];
        foreach ($values as $value) {
            if (!\is_string($value)) {
                throw new TypeError('$header must either be a string or an array containing strings.');
            }

            $v = '';
            $isQuoted = false;
            $isEscaped = false;
            for ($i = 0, $max = \strlen($value); $i < $max; $i++) {
                if ($isEscaped) {
                    $v .= $value[$i];
                    $isEscaped = false;

                    continue;
                }

                if (!$isQuoted && $value[$i] === ',') {
                    $v = \trim($v);
                    if ($v !== '') {
                        $result[] = $v;
                    }

                    $v = '';
                    continue;
                }

                if ($isQuoted && $value[$i] === '\\') {
                    $isEscaped = true;
                    $v .= $value[$i];

                    continue;
                }
                if ($value[$i] === '"') {
                    $isQuoted = !$isQuoted;
                    $v .= $value[$i];

                    continue;
                }

                $v .= $value[$i];
            }

            $v = \trim($v);
            if ($v !== '') {
                $result[] = $v;
            }
        }

        return $result;
    }

@TimWolla
Copy link
Contributor Author

TimWolla commented Oct 21, 2021

Friendly ping 😃

Can this get looked at? I don't need this fix in a release immediately. I'd like to see this PR merged so that I can rely on this arriving in a release eventually, allowing me to build on top of it.

Summarizing:

  1. The change in this PR fixes an actual bug in parsing quoted strings and includes tests to verify this behavior. The PR performs the minimally necessary change to perform this fix. I've split my fixes into single commits, each fixing a small part of the behavior.
  2. The API of the normalize() method is ugly and should be fixed. But this is not directly related to (1). I'm happy to propose this in a follow-up PR.

I'm happy to answer your questions here in this PR, or in PHP HTTP Slack.

@stale
Copy link

stale bot commented Feb 19, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 2 weeks if no further activity occurs. Thank you for your contributions.

@TimWolla
Copy link
Contributor Author

This still is an issue and I'd still like to see this merged.

@stale stale bot removed the lifecycle/stale label Feb 19, 2022
Copy link
Member

@Nyholm Nyholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sure is complex. Sorry for the delay.

I do agree with the tests and I am happy with the changes.

@Nyholm
Copy link
Member

Nyholm commented Mar 13, 2022

@GrahamCampbell added an additional test in #476. You will get the credit, for this work. Thank you

@Nyholm Nyholm closed this in #476 Mar 13, 2022
@TimWolla
Copy link
Contributor Author

@GrahamCampbell added an additional test in #476. You will get the credit, for this work. Thank you

That works for me, thank you. I've created a PR with my suggested cleaned up replacement API: #477. It's not as critical, because at least normalize() now properly parses everything that I am aware of.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants