Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize regular expressions #674

Open
colinodell opened this issue Jun 19, 2021 · 2 comments
Open

Optimize regular expressions #674

colinodell opened this issue Jun 19, 2021 · 2 comments
Labels
do not close Issue which won't close due to inactivity hacktoberfest performance Something could be made faster or more efficient up-for-grabs Please feel free to take this one!

Comments

@colinodell
Copy link
Member

colinodell commented Jun 19, 2021

This library makes heavy use of regular expressions. While most of them should be fairly performant, there could certainly be some room for improvement to help improve the performance of this library. Examples of improvements might include:

  1. Replacing non-regex parsing logic with regular expressions (if that's quicker)
  2. Replacing regex-based parsing with logic that doesn't use regular expressions (if that's quicker)
  3. Combining multiple regexes into one (if that's quicker)
  4. Fixing excessive backtracking in expressions
  5. Other improvements to existing expressions
  6. ???

Tools that could help here include:

A partial list of areas where regex is used in this library include:

I will accept (almost) any PR that aims to improve performance, though I would ask that you keep the following in mind:

  • The performance improvement should be measurable, using either our performance benchmark or some other means
  • Improvements that don't break BC are preferred, though substantial improvements requiring a major version bump would be considered
  • The rationale behind the improvements should either be obvious or have a description in the PR explaining what you did and why
@colinodell colinodell added performance Something could be made faster or more efficient do not close Issue which won't close due to inactivity labels Jun 19, 2021
@colinodell colinodell added this to the v2.1 milestone Jun 19, 2021
@colinodell colinodell self-assigned this Jun 19, 2021
@colinodell colinodell added hacktoberfest up-for-grabs Please feel free to take this one! labels Oct 1, 2021
@colinodell colinodell removed their assignment Oct 1, 2021
@colinodell
Copy link
Member Author

I'm removing the v2.1 milestone as I've already tested a number of expressions and am fairly happy with the current state of things. However, I'll keep this open in case any regex experts want to dig deeper and maybe find something that I missed.

@colinodell colinodell removed this from the v2.1 milestone Nov 7, 2021
@live627
Copy link

live627 commented Mar 22, 2023

regexes with lots of alternations could be optimized like the one I link to

public const PARTIAL_BLOCKTAGNAME = '(?:address|article|aside|base|basefont|blockquote|body|caption|center|col|colgroup|dd|details|dialog|dir|div|dl|dt|fieldset|figcaption|figure|footer|form|frame|frameset|h1|head|header|hr|html|iframe|legend|li|link|main|menu|menuitem|nav|noframes|ol|optgroup|option|p|param|section|source|summary|table|tbody|td|tfoot|th|thead|title|tr|track|ul)';

several alternations could be reduced by combining similar ones into optional atomic groups, but readability and maintainability go down the toilet and break the sewers. However, I cannot find where that specific regex is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not close Issue which won't close due to inactivity hacktoberfest performance Something could be made faster or more efficient up-for-grabs Please feel free to take this one!
Projects
None yet
Development

No branches or pull requests

2 participants