Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClassMapGenerator: stabilize the heredoc/nowdoc stripping #10072

Merged
merged 7 commits into from Aug 29, 2021

Conversation

jrfnl
Copy link
Contributor

@jrfnl jrfnl commented Aug 21, 2021

ClassMapGeneratorTest: add test with consecutive duplicate heredoc markers

... as well as a test with heredoc markers with only a newline character between the start and end marker.

ClassMapGenerator: stabilize the heredoc/nowdoc stripping

I've looked into #10067 and have come to the conclusion that using a single regex to strip the heredoc/nowdocs is always going to run into trouble as:

We cannot solve both within a single regex.

So, I'm proposing a slightly different solution which should support both and should also improve performance for files containing large heredoc/nowdocs.

The stripHereNowDocs() function will find a start marker and remember the offset of the start marker.
It will then find the end marker and strip the contents between the two (replace with null).
The function will then recurse onto itself until all heredocs/nowdocs in a file have been removed.

ClassMapGeneratorTest: merge two tests

As hitting the backtrace limit is now no longer an issue, the tests with the long heredoc/nowdocs can be merged into the testCreateMap() test method.

I've verified that the long heredoc from the original issue #10037 with the updated fix no longer throws the PHP 8.1 deprecation notice and still gets indexed correctly.

@jrfnl
Copy link
Contributor Author

jrfnl commented Aug 21, 2021

Note: the failing build is due to a PHPStan issue in a file not touched in this PR, so is unrelated.

…rkers

... as well as a test with heredoc markers with only a newline character between the start and end marker.
I've looked into 10067 and have come to the conclusion that using a single regex to strip the heredoc/nowdocs is always going to run into trouble as:
* Either the matching will be too greedy (issue 10067);
* Or the matching will run into backtrace limits for large heredoc/nowdocs.

We cannot solve both within a single regex.

So, I'm proposing a slightly different solution which should support both and should also improve performance for files containing large heredoc/nowdocs.

The `stripHereNowDocs()` function will find a start marker and remember the offset of the start marker.
It will then find the end marker and strip the contents between the two (replace with `null`).
The function will then recurse onto itself until all heredocs/nowdocs in a file have been removed.
@Seldaek Seldaek force-pushed the feature/10067-fix-classmap-regression branch from f15eaf6 to f44a254 Compare August 21, 2021 15:45
@jrfnl
Copy link
Contributor Author

jrfnl commented Aug 21, 2021

@Seldaek I see your regex change, but that's something I already tried and with that the original problem identified in #10037 still exists.

Can also be seen when you undo the third commit (where I merge the tests as a separate test for the backtrace issue was no longer needed with my fix).

@Seldaek
Copy link
Member

Seldaek commented Aug 21, 2021

Yup, I see that the backtracking is an issue still there. I'm investigating if I can't improve this, because I'd rather keep it in a single regex for perf and simplicity reasons, but if not I'll kill off my commit :) Anyway thanks for providing a fix, and no worries at all for introducing a regression, I do it all the time ;)

@jrfnl
Copy link
Contributor Author

jrfnl commented Aug 21, 2021

@Seldaek I appreciate you doing a thorough check of the proposed fix and please continue to do so, who knows what you'll come up with.
I did try all sorts of variations myself and even though my regex-fu is pretty good (it should be all things considering), I still couldn't find a way to keep it in one regex, while still solving both issues and making the regex performant.

@Seldaek Seldaek force-pushed the feature/10067-fix-classmap-regression branch 2 times, most recently from fe0b6ed to 54da245 Compare August 21, 2021 20:02
@Seldaek
Copy link
Member

Seldaek commented Aug 21, 2021

OK I got a fix I believe, 3500 vs 80000 backtracks.. documented the hell out of it too. https://regex101.com/r/JG4eT9/3/ vs old https://regex101.com/r/erysLg/1

Sorry for obsessing over this, I'm sure your solution was fine but it's mostly out of the fact I enjoy a good regex puzzle :D

Now if you'd like to try and break it that would be amazing because I'm not quite sure it's 100% yet, but it seems sensible to my tired brain..

@jrfnl
Copy link
Contributor Author

jrfnl commented Aug 21, 2021

@Seldaek Have a look when you use the StripNoise.php file as input - the MARKERINTEXT case is broken, which will lead to an unwanted not-actually-a-class being added to the classmap. Sorry to be a spoilsport.

@Seldaek
Copy link
Member

Seldaek commented Aug 22, 2021

Heh you are right, thanks for spotting that. I am a bit worried that this did not break any tests tho. I don't have much time to investigate but I did quickly fix the regex so it is hopefully now a complete solution.

@jrfnl
Copy link
Contributor Author

jrfnl commented Aug 22, 2021

Heh you are right, thanks for spotting that. I am a bit worried that this did not break any tests tho. I don't have much time to investigate but I did quickly fix the regex so it is hopefully now a complete solution.

@Seldaek If the regex is going to stay, my third commit needs to be reverted.
I only merged those two tests as my solution removed the need for a safeguard against backtrace limits being reached.

The testCreateMap() failed on this with an extra class previously as the MARKERTEXT contains a class of phrase after the first marker. Not sure why it didn't fail now. Would need to look into it.

I do still wonder if a singular regex is the best solution for performance as the more complex the regex and the larger the file, the performance decreases. The solution I proposed should be fast no matter what.

@Seldaek
Copy link
Member

Seldaek commented Aug 22, 2021

I tried earlier to do basic benchmarking by running a dumpautoload -o in a large project, all 3 versions (with the insane backtracking, fixed regex and your alternative parsing) perform about the same, no noticable difference, and those gigantic heredocs are rare enough that it doesn't seem to matter really perf wise.

Comment on lines 257 to 259
(?:
# non-word or non-space char, then anything goes for the rest of the line
[^\s\w][^\r\n]+
# white-space (possessive match) not followed by the end delimiter, then anything goes for the rest of the line
| \s*+(?!\\2 \s*[;,.)])[^\r\n]+
# white-space but no new lines
| [\t\f\v ]+
)
# end of line(s)
[\r\n]+
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking - As the "this or that" part of the regex is not optional and all options have a + quantifier for "eating" the line contents - does this handle completely blank lines (i.e. only a new line) in a heredoc/nowdoc correctly ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does handle them but I am not 100% sure I understand why, as per regex101 debugger they seem to be consumed by \v which I understood to only match vertical tab chars, but it seems to eat up empty newlines too. I guess I learned something. PCRE2 docs list \v as:

The vertical space characters are:
U+000A Linefeed (LF)
U+000B Vertical tab (VT)
U+000C Form feed (FF)
U+000D Carriage return (CR)
U+0085 Next line (NEL)
U+2028 Line separator
U+2029 Paragraph separator

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks to figuring this out, and looking a bit closer, I realized half of it was nonsense and unnecessary, and now it's much simpler and actually more efficient for the huge string 2400 steps instead of 3500 (and 80K pre-PR) https://regex101.com/r/JG4eT9/4

I also removed your last commit merging the tests as you advised. Hoping it looks good now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a side-note: what's included in \v (and in \s for that matter) differs per regex engine and can be influenced by whether or not the PCRE was compiled with Unicode or not.
As we're only dealing with the regex engine in PHP here, this won't really be an issue, but just thought it was worth a mention anyway.

@Seldaek Seldaek force-pushed the feature/10067-fix-classmap-regression branch from c862aa3 to 6ab1b6a Compare August 23, 2021 20:18
src/Composer/Autoload/ClassMapGenerator.php Outdated Show resolved Hide resolved
src/Composer/Autoload/ClassMapGenerator.php Outdated Show resolved Hide resolved
src/Composer/Autoload/ClassMapGenerator.php Outdated Show resolved Hide resolved
[\r\n]+
)*
# end delimiter
[\t ]* \\2 (?=[\t \r\n]*[;,.)])
Copy link
Contributor

@mvorisek mvorisek Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://3v4l.org/i2YXM (?=... it is very difficult to match every usecase with this approach, I belive this part can be dropped completely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes now that we do consume the content lines correctly it is probably not needed anymore, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok removing the whole thing fails some tests, but with \b it works, so I'm hopeful this is a good enough solution.

@@ -246,7 +246,22 @@ private static function findClasses($path)
}

// strip heredocs/nowdocs
$contents = preg_replace('{<<<[ \t]*([\'"]?)(\w+)\\1(?:\r\n|\n|\r)(?:.*(?=[\r\n]+[ \t]*\\2))[\r\n]+[ \t]*\\2(?=\s*[;,.)])}s', 'null', $contents);
$contents = preg_replace('{
Copy link
Contributor

@mvorisek mvorisek Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All regexes must be combined in one, otherwise starting part of heredoc in a string will replace things very wrongly.

Ex:

'
<<<X
';

<<<X

X;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look this code has worked fine for 10 years now on normal code, barring the one bug we recently discovered.. There may be ways to abuse it for sure as it's not lexing completely, but in the worst case is means your classes don't get found. I don't think it's worth adding more complexity to it to fix hypotheticals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments, simple strings and heredocs/nowdocs should really be replaced at once, it is easy and the only correct way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then let's wrap this PR up, and once it's merged feel free to give that a shot, if you can make it a reasonable looking solution I'm not strictly against it, but IMO it's not strictly needed in practice, no matter how correct in theory.

…st-heredoc syntax, fix test file syntax being invalid
@Seldaek Seldaek force-pushed the feature/10067-fix-classmap-regression branch from 58ff7b5 to d8054d1 Compare August 29, 2021 10:23
But, what are you blind McFly, it's there. How else do you explain that wreck out there? Doc, Doc. Oh, no. You're alive. Bullet proof vest, how did you know, I never got a chance to tell you. About all that talk about screwing up future events, the space time continuum. Okay, alright, I'll prove it to you.
MARKERINTEXT
. MARKERINTEXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI @jrfnl I had to add a dot here (could've been anything really) because the test the way it was before was invalid PHP syntax, so I don't think it was testing any real case.

Copy link
Contributor Author

@jrfnl jrfnl Aug 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Seldaek Except that it was and that test has now been broken.

As per the first line of the heredoc:

In PHP < 7.3, the docblock marker could occur in the text as long as it did not occur at the very start of the line.

The test, as it was originally, will lint as valid in PHP < 7.3, but as a parse error in PHP 7.3.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I presume that stripping heredocs containing a marker in PHP < 7.3 is now broken again ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh 🤦🏻‍♂️ sorry. I do wonder though if it makes sense to test for this, given it'll break on all supported PHP versions.. Is it really reasonable to have an expectation that we handle such code correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, it is broken as we do not check for the end delimiter being followed by some meaningful character anymore (i.e. [;,.)\]\}] lookahead would be needed on <7.3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrfnl sorry just to make sure before I can make another release, do you agree with my assessment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Seldaek Thanks for asking for my opinion, but to answer your question: I stand by the solution I originally pulled here.

Part of mastering regex is knowing when it's not the right solution and the fact that supporting both pre-PHP 7.3 as well as post-PHP 7.3 syntax correctly was neigh impossible (yes, I ran into that too) was a good enough reason for me to look at an alternative solution.

Believe me when I say, I understand enjoying the challenge of a good regex puzzle, but in this case, regex is the problem instead of the solution.

Happy to tweak the original solution some more (like adding the u modifier and such), but if the regex stays, this edge case is not viable to be supported IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the feedback. Looking at your original fix here again though, it seems it also suffers from the same pitfalls as the regex version did when it was checking for the heredoc marker in text. i.e. 40bd4b0#diff-1180d0c164d5a8cb3ecb242c7e607ee86bd3a6a78e7c7c38d59ec317dc6eff44R323 has to lookahead for .;,) etc. So IMO it's probably not worth it supporting this edge case for <7.3, given that any piece of code at this point should reasonably be written to run on 7.3+, and the chances of problems here are anyway quite low.

Copy link
Contributor Author

@jrfnl jrfnl Sep 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to support the "marker in text" case, you will always need a look-ahead. The .;,) could (should) probably be made more flexible in that case, but I agree, it's an edge case and choosing to explicitly not support that edge-case is a perfectly valid choice.

So IMO it's probably not worth it supporting this edge case for <7.3, given that any piece of code at this point should reasonably be written to run on 7.3+.

Well, for open source software: yes, nearly all will support PHP cross-version. For proprietary software: not so much and there is still a lot of legacy software out there. All the same, it's still an edge case.

@Seldaek Seldaek merged commit 922ba01 into composer:master Aug 29, 2021
@Seldaek Seldaek added this to the 2.1 milestone Aug 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants