Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify markdown strikethrough regex and leave text parsing to lower cases #1618

Closed
wants to merge 3 commits into from

Conversation

gerner
Copy link
Contributor

@gerner gerner commented Dec 1, 2020

the regex: r'([^~]*)(~~[^~]+~~)' ends up being expensive looking for essentially a string of any characters (not "~") of any length preceeding the strikethrough case. In the cases that this fails on a long string we might fall back to consuming a single character and then checking again this regex again, only to fail again. In the worst case we might repeat this O(n) regex for every character in the file leading to O(n^2) behavior.

#1617 illustrates how this can cause poor performance in practice on an otherwise "normal" markdown file.

This fix eliminates the [^~]* group so it will only trigger when the next character is "~". I believe this is ok because other cases that will advance lexing are either specific to other cases with reasonable starting and terminating patterns (e.g. image tags or code spans), or are the non-space, non-escape string text case or the single character text matching case. In these cases we'll consume a minimal (in some sense) amount of content until we might possibly come back to the start of the strikethrough. In fact, other than the non-space text case, this is the only "inline" pattern that consumes nearly arbitrary strings prior to matching a specific pattern.

Existing tests pass and I added another simple, happy-path test case just to make sure that text processing isn't messed up. I also tested the old and new logic on the file from #1617 and got the same terminal formatted result, only got there in 400ms instead of 21s.

@gerner
Copy link
Contributor Author

gerner commented Dec 3, 2020

There's a slight problem with this in that the following won't get the right tokens:

this is not striked~~but this should be striked but is not~~

which GitHub renders:

this is not strikedbut this should be striked but is not

I have another forthcoming change that fixes this.

@gerner
Copy link
Contributor Author

gerner commented Dec 3, 2020

the key to handling inline formatting embedded in a word is by specializing the text rule to not match these inline formatting characters "~*`" (notably excluding "_"). This allows these inline formatting to trigger when embedded within words without being delimited by spaces. There's still a fallback rule that matches any single character when this doesn't trigger formatting (e.g. foo*bar, bar should not be emphasized).

"_" doesn't trigger formatting when embedded in words, at least on GitHub. I have a GitHub test gist here:

https://gist.github.com/gerner/8351bd8abd709ef892808f041584fae2

This change is necessary to prevent a regression in this case for strikethrough, but at the same time I think it fixes previously unknown bugs for the other cases. I added test cases that checks for those cases as well as the strikethrough case that already worked because of how the strikethrough regex was written, but would have been broken by my change if I hadn't caught the issue.

I think this also produces more sensible stream of tokens too, that is, each word turns into one Text token instead of a string of words as a single Text token preceeding a strikethrough string.

@gerner
Copy link
Contributor Author

gerner commented Dec 3, 2020

I added a test case that covers the original perf issue. pygmentizing a file with that fragment used to take 25 seconds. Now pygmentize takes 400ms (of which much is just spinning up the python process I think).

The fragment is 104KB on a single line "this is text " over and over, but the same pattern happens if there are newlines throughout (e.g. 104KB of "this is text\n").

Note that the original file from #1617 was 79KB. I believe that's above average, perhaps even in the top 20% or so of markdown files, but not crazy large. In fact, I converted doc/docs/lexerdevelopment.rst (about 28KB) to markdown using pandoc and ran pygmentize on it using the old code and the new. It used to take just over 1 second to pygmentize before and now takes 200ms. Because of the O(n^2) nature things get a lot worse, really quickly. So I think it's reasonable to expect this is at best annoying for lots of markdown docs and at worst totally blocking for some.

@gerner
Copy link
Contributor Author

gerner commented Dec 3, 2020

Ouch, looks like there are other perf issues with pypy which make my perf test too aggressive. I've slimmed it up substantially. I don't think it actually provides much perf regression coverage now, but fixing the other perf issues is probably out of scope for this PR.

@gerner
Copy link
Contributor Author

gerner commented Dec 11, 2020

Just to be clear, I think this change is good to go. The perf issue with pypy is not a regression. I'm just pointing out there are other perf issues in pypy that make a timed test hard across different python interpreters. This change doesn't make things worse, indeed I think it improves things. That improvement is very dramatic on cpython. On pypy it's more subtle.

@Anteru
Copy link
Collaborator

Anteru commented Jan 6, 2021

Obsoleted by #1623.

@Anteru Anteru closed this Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants