ClassMapGenerator: stabilize the heredoc/nowdoc stripping #10072

jrfnl · 2021-08-21T13:32:10Z

ClassMapGeneratorTest: add test with consecutive duplicate heredoc markers

... as well as a test with heredoc markers with only a newline character between the start and end marker.

ClassMapGenerator: stabilize the heredoc/nowdoc stripping

I've looked into #10067 and have come to the conclusion that using a single regex to strip the heredoc/nowdocs is always going to run into trouble as:

Either the matching will be too greedy (issue Missing class in autoload_classmap as of 2.1.6 #10067);
Or the matching will run into backtrace limits for large heredoc/nowdocs.

We cannot solve both within a single regex.

So, I'm proposing a slightly different solution which should support both and should also improve performance for files containing large heredoc/nowdocs.

The stripHereNowDocs() function will find a start marker and remember the offset of the start marker.
It will then find the end marker and strip the contents between the two (replace with null).
The function will then recurse onto itself until all heredocs/nowdocs in a file have been removed.

ClassMapGeneratorTest: merge two tests

As hitting the backtrace limit is now no longer an issue, the tests with the long heredoc/nowdocs can be merged into the testCreateMap() test method.

I've verified that the long heredoc from the original issue #10037 with the updated fix no longer throws the PHP 8.1 deprecation notice and still gets indexed correctly.

jrfnl · 2021-08-21T13:36:00Z

Note: the failing build is due to a PHPStan issue in a file not touched in this PR, so is unrelated.

…rkers ... as well as a test with heredoc markers with only a newline character between the start and end marker.

I've looked into 10067 and have come to the conclusion that using a single regex to strip the heredoc/nowdocs is always going to run into trouble as: * Either the matching will be too greedy (issue 10067); * Or the matching will run into backtrace limits for large heredoc/nowdocs. We cannot solve both within a single regex. So, I'm proposing a slightly different solution which should support both and should also improve performance for files containing large heredoc/nowdocs. The `stripHereNowDocs()` function will find a start marker and remember the offset of the start marker. It will then find the end marker and strip the contents between the two (replace with `null`). The function will then recurse onto itself until all heredocs/nowdocs in a file have been removed.

jrfnl · 2021-08-21T15:50:15Z

@Seldaek I see your regex change, but that's something I already tried and with that the original problem identified in #10037 still exists.

Can also be seen when you undo the third commit (where I merge the tests as a separate test for the backtrace issue was no longer needed with my fix).

Seldaek · 2021-08-21T15:55:59Z

Yup, I see that the backtracking is an issue still there. I'm investigating if I can't improve this, because I'd rather keep it in a single regex for perf and simplicity reasons, but if not I'll kill off my commit :) Anyway thanks for providing a fix, and no worries at all for introducing a regression, I do it all the time ;)

jrfnl · 2021-08-21T16:05:42Z

@Seldaek I appreciate you doing a thorough check of the proposed fix and please continue to do so, who knows what you'll come up with.
I did try all sorts of variations myself and even though my regex-fu is pretty good (it should be all things considering), I still couldn't find a way to keep it in one regex, while still solving both issues and making the regex performant.

Seldaek · 2021-08-21T20:03:24Z

OK I got a fix I believe, 3500 vs 80000 backtracks.. documented the hell out of it too. https://regex101.com/r/JG4eT9/3/ vs old https://regex101.com/r/erysLg/1

Sorry for obsessing over this, I'm sure your solution was fine but it's mostly out of the fact I enjoy a good regex puzzle :D

Now if you'd like to try and break it that would be amazing because I'm not quite sure it's 100% yet, but it seems sensible to my tired brain..

jrfnl · 2021-08-21T20:25:07Z

@Seldaek Have a look when you use the StripNoise.php file as input - the MARKERINTEXT case is broken, which will lead to an unwanted not-actually-a-class being added to the classmap. Sorry to be a spoilsport.

Seldaek · 2021-08-22T10:14:44Z

Heh you are right, thanks for spotting that. I am a bit worried that this did not break any tests tho. I don't have much time to investigate but I did quickly fix the regex so it is hopefully now a complete solution.

jrfnl · 2021-08-22T13:59:18Z

Heh you are right, thanks for spotting that. I am a bit worried that this did not break any tests tho. I don't have much time to investigate but I did quickly fix the regex so it is hopefully now a complete solution.

@Seldaek If the regex is going to stay, my third commit needs to be reverted.
I only merged those two tests as my solution removed the need for a safeguard against backtrace limits being reached.

The testCreateMap() failed on this with an extra class previously as the MARKERTEXT contains a class of phrase after the first marker. Not sure why it didn't fail now. Would need to look into it.

I do still wonder if a singular regex is the best solution for performance as the more complex the regex and the larger the file, the performance decreases. The solution I proposed should be fast no matter what.

Seldaek · 2021-08-22T17:26:44Z

I tried earlier to do basic benchmarking by running a dumpautoload -o in a large project, all 3 versions (with the insane backtracking, fixed regex and your alternative parsing) perform about the same, no noticable difference, and those gigantic heredocs are rare enough that it doesn't seem to matter really perf wise.

jrfnl · 2021-08-22T17:58:52Z