-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement of StringContains #4095
Performance improvement of StringContains #4095
Conversation
283854f
to
d32073f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way to normalize strings in unicode would be to use \Normalizer::normalize($string, Normalizer::FORM_C)
on both strings before passing them to the strpos
-function.
But TBH I would leave that to the person writing the test to be able to find issues with different normalizations (which can be a PITA especially with filenames on MacOS-systems - been there...)
And as that will only work on unicode-strings it would require a lot of guesswork before the actual strpos
regarding the current strings encoding.
Additionally while it might be possible to find a binary match in UTF16 or UTF32 when looking for a single character the possibility to find such a match decreases rapidly with the number of characters tried to match. This seems such an edge-case that the possible performance increase for the majority of the use-cases IMO verifies this risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modification would increase speed not only for binary safe comparisons but also for case-insensitive comparisons.
/* | ||
* We must use the multi byte safe version so we can accurately compare non latin upper characters with | ||
* their lowercase equivalents. | ||
*/ | ||
return \mb_stripos($other, $this->string) !== false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To speed the case-insensitive comparison up we could use $other = mb_strtolower($other)
(as used in line 45) here to convert the searched phrase to lower case which is much faster here than using mb_stripos
and then do the comparison binary safe in line 84.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to send a pull request for this.
Thank you for your review, @heiglandreas. |
Thanks @sebastianbergmann. |
I've created a faster version of StringContains, in particular for large haystacks.
I noticed a 4.5x speed up for asserting 1,000x that a 2MB string contains an other string.
We don't need the multi-byte ware function
mb_strpos
as it is approximately O(n^2) wheren
is the length of the haystack.In UTF-16 and UTF-32, it is theoretically possible that same byte sequences map to different characters, but as all PHPUnit's own output is not compatible with this encoding I don't think it will cause any issues in practice as I don't think anyone is running PHPUnit with PHP's internal encoding set to UTF-16/UTF-32.
NB. There is the issue that UTF-8 allows the same characters to be represented by different combinations of unicode code points (and thus bytes), but that wasn't supported by
mb_strpos
anyway as it doesn't normalize the string internally.