Performance improvement of StringContains #4095

willemstuursma · 2020-02-13T15:31:11Z

I've created a faster version of StringContains, in particular for large haystacks.

I noticed a 4.5x speed up for asserting 1,000x that a 2MB string contains an other string.

We don't need the multi-byte ware function mb_strpos as it is approximately O(n^2) where n is the length of the haystack.

We don't care about the character position of the substring.
In most encodings, the same characters will always be represented by the same bytes. This holds for ISO-8859-1 and friends, and UTF-8.

In UTF-16 and UTF-32, it is theoretically possible that same byte sequences map to different characters, but as all PHPUnit's own output is not compatible with this encoding I don't think it will cause any issues in practice as I don't think anyone is running PHPUnit with PHP's internal encoding set to UTF-16/UTF-32.

NB. There is the issue that UTF-8 allows the same characters to be represented by different combinations of unicode code points (and thus bytes), but that wasn't supported by mb_strpos anyway as it doesn't normalize the string internally.

be-heiglandreas

The only way to normalize strings in unicode would be to use \Normalizer::normalize($string, Normalizer::FORM_C) on both strings before passing them to the strpos-function.

But TBH I would leave that to the person writing the test to be able to find issues with different normalizations (which can be a PITA especially with filenames on MacOS-systems - been there...)

And as that will only work on unicode-strings it would require a lot of guesswork before the actual strpos regarding the current strings encoding.

Additionally while it might be possible to find a binary match in UTF16 or UTF32 when looking for a single character the possibility to find such a match decreases rapidly with the number of characters tried to match. This seems such an edge-case that the possible performance increase for the majority of the use-cases IMO verifies this risk.

heiglandreas

This modification would increase speed not only for binary safe comparisons but also for case-insensitive comparisons.

heiglandreas · 2020-02-15T17:33:26Z

src/Framework/Constraint/StringContains.php

+            /*
+             * We must use the multi byte safe version so we can accurately compare non latin upper characters with
+             * their lowercase equivalents.
+             */
            return \mb_stripos($other, $this->string) !== false;


To speed the case-insensitive comparison up we could use $other = mb_strtolower($other) (as used in line 45) here to convert the searched phrase to lower case which is much faster here than using mb_stripos and then do the comparison binary safe in line 84.

Feel free to send a pull request for this.

sebastianbergmann · 2020-02-16T06:15:47Z

Thank you for your review, @heiglandreas.

willemstuursma · 2020-02-16T13:12:55Z

Thanks @sebastianbergmann.

Performance improvement of StringContains

d32073f

willemstuursma force-pushed the fast-string-contains branch from 283854f to d32073f Compare February 13, 2020 15:32

sebastianbergmann added feature/assertion Issues related to assertions and expectations type/performance Issues related to resource consumption (time and memory) labels Feb 13, 2020

sebastianbergmann self-assigned this Feb 13, 2020

sebastianbergmann added this to the PHPUnit 9.1 milestone Feb 13, 2020

be-heiglandreas approved these changes Feb 15, 2020

View reviewed changes

heiglandreas approved these changes Feb 15, 2020

View reviewed changes

heiglandreas reviewed Feb 15, 2020

View reviewed changes

sebastianbergmann merged commit ac017d1 into sebastianbergmann:master Feb 16, 2020

willemstuursma deleted the fast-string-contains branch February 16, 2020 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement of StringContains #4095

Performance improvement of StringContains #4095

willemstuursma commented Feb 13, 2020

be-heiglandreas left a comment

heiglandreas left a comment

heiglandreas Feb 15, 2020

sebastianbergmann Feb 16, 2020

sebastianbergmann commented Feb 16, 2020

willemstuursma commented Feb 16, 2020

Performance improvement of StringContains #4095

Performance improvement of StringContains #4095

Conversation

willemstuursma commented Feb 13, 2020

be-heiglandreas left a comment

Choose a reason for hiding this comment

heiglandreas left a comment

Choose a reason for hiding this comment

heiglandreas Feb 15, 2020

Choose a reason for hiding this comment

sebastianbergmann Feb 16, 2020

Choose a reason for hiding this comment

sebastianbergmann commented Feb 16, 2020

willemstuursma commented Feb 16, 2020