extremely poor performance on certain markdown file #1617

gerner · 2020-11-28T22:31:16Z

This particular markdown takes a long time (21 seconds on my laptop) to parse:
https://github.com/date-fns/date-fns/blob/a9fc0c7b715883349555bfb94daa1059430eda52/src/locale/en-US/snapshot.md

$ time pygmentize -f terminal -l md -o /dev/null /tmp/snapshot.md

real    0m21.318s
user    0m21.306s
sys     0m0.012s

I've seen slow-ish parsing performance on markdown with tables before, however I'm not certain it's tables that's causing the issue.

The text was updated successfully, but these errors were encountered:

gerner · 2020-11-28T22:32:35Z

I do see Error tokens showing up for '\n' on its own line which I think is because the rule here doesn't match a newline:

https://github.com/pygments/pygments/blob/master/pygments/lexers/markup.py#L601

That seems like a separate issue. If I add a rule to match r'\n' (most) of the errors go away, but that doesn't speed up processing any.

gerner · 2020-11-28T22:49:16Z

I think tables is a red herring. I pulled out the table rules and performance didn't change.

However, this rule looks expensive:

# strikethrough
(r'([^~]*)(~~[^~]+~~)', bygroups(Text, Generic.Deleted)),

I don't know why, but if I comment it out the file is processed in 400ms down from 21s. Note that this file doesn't have the character "~" anywhere in it.

it looks like the leading ([^~]*) group is the culprit here. changing the rule to this gets the same improvement in performance (400ms down from 21s):

# strikethrough
(r'(~~[^~]+~~)', bygroups(Generic.Deleted)),

I don't think we need that since there's already a catch all single character rule that will eat up text characters that don't match.

Also, why the heavy use of bygroups throughout?

gerner · 2020-11-28T22:58:07Z

Note, if I change the rule as suggested above the output of pygmentize doesn't change and all test cases still pass. It just happens in 400ms instead of 21s.

gerner · 2021-01-06T18:33:59Z

This is fixed by #1623

gerner changed the title ~~extremely poor performance on table-heavy markdown file~~ extremely poor performance on certain markdown file Nov 28, 2020

gerner mentioned this issue Dec 1, 2020

Simplify markdown strikethrough regex and leave text parsing to lower cases #1618

Closed

gerner mentioned this issue Dec 17, 2020

Markdown lexer improvements #1623

Merged

gerner closed this as completed Jan 6, 2021

Anteru added the changelog-update Items which need to get mentioned in the changelog label Jan 6, 2021

Anteru added this to the 2.7.4 milestone Jan 6, 2021

Anteru removed the changelog-update Items which need to get mentioned in the changelog label Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extremely poor performance on certain markdown file #1617

extremely poor performance on certain markdown file #1617

gerner commented Nov 28, 2020

gerner commented Nov 28, 2020

gerner commented Nov 28, 2020 •

edited

gerner commented Nov 28, 2020

gerner commented Jan 6, 2021

extremely poor performance on certain markdown file #1617

extremely poor performance on certain markdown file #1617

Comments

gerner commented Nov 28, 2020

gerner commented Nov 28, 2020

gerner commented Nov 28, 2020 • edited

gerner commented Nov 28, 2020

gerner commented Jan 6, 2021

gerner commented Nov 28, 2020 •

edited