Simplify markdown strikethrough regex and leave text parsing to lower cases #1618

gerner · 2020-12-01T19:05:15Z

the regex: r'([^~]*)(~~[^~]+~~)' ends up being expensive looking for essentially a string of any characters (not "~") of any length preceeding the strikethrough case. In the cases that this fails on a long string we might fall back to consuming a single character and then checking again this regex again, only to fail again. In the worst case we might repeat this O(n) regex for every character in the file leading to O(n^2) behavior.

#1617 illustrates how this can cause poor performance in practice on an otherwise "normal" markdown file.

This fix eliminates the [^~]* group so it will only trigger when the next character is "~". I believe this is ok because other cases that will advance lexing are either specific to other cases with reasonable starting and terminating patterns (e.g. image tags or code spans), or are the non-space, non-escape string text case or the single character text matching case. In these cases we'll consume a minimal (in some sense) amount of content until we might possibly come back to the start of the strikethrough. In fact, other than the non-space text case, this is the only "inline" pattern that consumes nearly arbitrary strings prior to matching a specific pattern.

Existing tests pass and I added another simple, happy-path test case just to make sure that text processing isn't messed up. I also tested the old and new logic on the file from #1617 and got the same terminal formatted result, only got there in 400ms instead of 21s.

gerner · 2020-12-03T17:40:05Z

There's a slight problem with this in that the following won't get the right tokens:

this is not striked~~but this should be striked but is not~~

which GitHub renders:

this is not striked~~but this should be striked but is not~~

I have another forthcoming change that fixes this.

gerner · 2020-12-03T18:11:27Z

the key to handling inline formatting embedded in a word is by specializing the text rule to not match these inline formatting characters "~*`" (notably excluding "_"). This allows these inline formatting to trigger when embedded within words without being delimited by spaces. There's still a fallback rule that matches any single character when this doesn't trigger formatting (e.g. foo*bar, bar should not be emphasized).

"_" doesn't trigger formatting when embedded in words, at least on GitHub. I have a GitHub test gist here:

https://gist.github.com/gerner/8351bd8abd709ef892808f041584fae2

This change is necessary to prevent a regression in this case for strikethrough, but at the same time I think it fixes previously unknown bugs for the other cases. I added test cases that checks for those cases as well as the strikethrough case that already worked because of how the strikethrough regex was written, but would have been broken by my change if I hadn't caught the issue.

I think this also produces more sensible stream of tokens too, that is, each word turns into one Text token instead of a string of words as a single Text token preceeding a strikethrough string.

gerner · 2020-12-03T18:30:24Z

I added a test case that covers the original perf issue. pygmentizing a file with that fragment used to take 25 seconds. Now pygmentize takes 400ms (of which much is just spinning up the python process I think).

The fragment is 104KB on a single line "this is text " over and over, but the same pattern happens if there are newlines throughout (e.g. 104KB of "this is text\n").

Note that the original file from #1617 was 79KB. I believe that's above average, perhaps even in the top 20% or so of markdown files, but not crazy large. In fact, I converted doc/docs/lexerdevelopment.rst (about 28KB) to markdown using pandoc and ran pygmentize on it using the old code and the new. It used to take just over 1 second to pygmentize before and now takes 200ms. Because of the O(n^2) nature things get a lot worse, really quickly. So I think it's reasonable to expect this is at best annoying for lots of markdown docs and at worst totally blocking for some.

gerner · 2020-12-03T18:55:03Z

Ouch, looks like there are other perf issues with pypy which make my perf test too aggressive. I've slimmed it up substantially. I don't think it actually provides much perf regression coverage now, but fixing the other perf issues is probably out of scope for this PR.

gerner · 2020-12-11T17:59:39Z

Just to be clear, I think this change is good to go. The perf issue with pypy is not a regression. I'm just pointing out there are other perf issues in pypy that make a timed test hard across different python interpreters. This change doesn't make things worse, indeed I think it improves things. That improvement is very dramatic on cpython. On pypy it's more subtle.

Anteru · 2021-01-06T11:08:46Z

Obsoleted by #1623.

simplify strikethrough regex and leave text parsing to lower cases

9b44ed4

handle various inline formatting cases embedded in words

c88881c

add regression coverage for perf issue

cae94be

gerner force-pushed the markdown-perf-fix branch from 0372705 to cae94be Compare December 3, 2020 18:54

gerner mentioned this pull request Dec 17, 2020

Markdown lexer improvements #1623

Merged

Anteru closed this Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify markdown strikethrough regex and leave text parsing to lower cases #1618

Simplify markdown strikethrough regex and leave text parsing to lower cases #1618

gerner commented Dec 1, 2020

gerner commented Dec 3, 2020

gerner commented Dec 3, 2020

gerner commented Dec 3, 2020

gerner commented Dec 3, 2020

gerner commented Dec 11, 2020

Anteru commented Jan 6, 2021 •

edited

Simplify markdown strikethrough regex and leave text parsing to lower cases #1618

Simplify markdown strikethrough regex and leave text parsing to lower cases #1618

Conversation

gerner commented Dec 1, 2020

gerner commented Dec 3, 2020

gerner commented Dec 3, 2020

gerner commented Dec 3, 2020

gerner commented Dec 3, 2020

gerner commented Dec 11, 2020

Anteru commented Jan 6, 2021 • edited

Anteru commented Jan 6, 2021 •

edited