Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement of StringContains #4095

Conversation

willemstuursma
Copy link
Contributor

I've created a faster version of StringContains, in particular for large haystacks.

I noticed a 4.5x speed up for asserting 1,000x that a 2MB string contains an other string.

We don't need the multi-byte ware function mb_strpos as it is approximately O(n^2) where n is the length of the haystack.

  • We don't care about the character position of the substring.
  • In most encodings, the same characters will always be represented by the same bytes. This holds for ISO-8859-1 and friends, and UTF-8.

In UTF-16 and UTF-32, it is theoretically possible that same byte sequences map to different characters, but as all PHPUnit's own output is not compatible with this encoding I don't think it will cause any issues in practice as I don't think anyone is running PHPUnit with PHP's internal encoding set to UTF-16/UTF-32.

NB. There is the issue that UTF-8 allows the same characters to be represented by different combinations of unicode code points (and thus bytes), but that wasn't supported by mb_strpos anyway as it doesn't normalize the string internally.

@sebastianbergmann sebastianbergmann added feature/assertion Issues related to assertions and expectations type/performance Issues related to resource consumption (time and memory) labels Feb 13, 2020
@sebastianbergmann sebastianbergmann self-assigned this Feb 13, 2020
@sebastianbergmann sebastianbergmann added this to the PHPUnit 9.1 milestone Feb 13, 2020
Copy link

@be-heiglandreas be-heiglandreas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to normalize strings in unicode would be to use \Normalizer::normalize($string, Normalizer::FORM_C) on both strings before passing them to the strpos-function.

But TBH I would leave that to the person writing the test to be able to find issues with different normalizations (which can be a PITA especially with filenames on MacOS-systems - been there...)

And as that will only work on unicode-strings it would require a lot of guesswork before the actual strpos regarding the current strings encoding.

Additionally while it might be possible to find a binary match in UTF16 or UTF32 when looking for a single character the possibility to find such a match decreases rapidly with the number of characters tried to match. This seems such an edge-case that the possible performance increase for the majority of the use-cases IMO verifies this risk.

Copy link

@heiglandreas heiglandreas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modification would increase speed not only for binary safe comparisons but also for case-insensitive comparisons.

/*
* We must use the multi byte safe version so we can accurately compare non latin upper characters with
* their lowercase equivalents.
*/
return \mb_stripos($other, $this->string) !== false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To speed the case-insensitive comparison up we could use $other = mb_strtolower($other) (as used in line 45) here to convert the searched phrase to lower case which is much faster here than using mb_stripos and then do the comparison binary safe in line 84.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to send a pull request for this.

@sebastianbergmann sebastianbergmann merged commit ac017d1 into sebastianbergmann:master Feb 16, 2020
@sebastianbergmann
Copy link
Owner

Thank you for your review, @heiglandreas.

@willemstuursma willemstuursma deleted the fast-string-contains branch February 16, 2020 13:12
@willemstuursma
Copy link
Contributor Author

Thanks @sebastianbergmann.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/assertion Issues related to assertions and expectations type/performance Issues related to resource consumption (time and memory)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants