Speed up JSON and reduce HTML formatter consumption #1569

kurtmckee · 2020-10-11T03:21:10Z

In #1425, the author of an app that depends on Pygments reported slowness when running Pygments against large JSON files.

I investigated by generating a 118MB JSON file as input, using the inputs reported by the author of that app. I found that the regex parser's .get_tokens_unprocessed() took ~63 seconds to lex the entire file. I then rewrote the parser in Python and found that the time was reduced to only ~22 seconds.

I also found that Pygments was consuming ~3GB of memory when formatting the 118MB JSON file in HTML. It appears that the buffered file I/O is caching everything in memory before finally writing to the file all at once as Pygments exits. While I wasn't able to stop the buffering from waiting until the entire file was in memory, I was able to shave off an entire gigabyte of memory consumption by caching the opening span classes that are generated per-token (like <span class="s2"> or <span style="whatever">). I didn't find similar memory consumption issues in the Terminal256 formatter.

For the Terminal256 formatter, this patch cuts the total runtime almost in half, dropping from 2:02 to 1:09 total processing time (as measured by Powershell's Measure-Command).

For the HTML formatter, the gains are more significant:

Memory consumption drops by ~33%
The HTML formatter with default options drops from 1:56 to 1:04 total processing time.
The HTML formatter with the noclasses option drops from 2:59 to 1:03 total processing time.

All output from the HTML and Terminal256 formatters is 100% byte-for-byte identical between master branch and this branch.

Changes in this patch: * Update the JSON-LD URL to HTTPS * Update the list of JSON-LD keywords * Make the JSON-LD parser less dependent on the JSON lexer implementation * Add unit tests for the JSON-LD lexer

This includes: * Testing valid literals * Testing valid string escapes * Testing that object keys are tokenized differently from string values

Related to pygments#1425 Included in this change: * The JSON parser is rewritten * The JSON bare object parser no longer requires additional code * `get_tokens_unprocessed()` returns as much as it can to reduce yields (for example, side-by-side punctuation is not returned separately) * The unit tests were updated * Add unit tests based on Hypothesis test results

Related to pygments#1425 Tested on a 118MB JSON file. Memory consumption tops out at ~3GB before this patch and drops to only ~2GB with this patch. These were the command lines used: python -m pygments -l json -f html -o .\new-code-classes.html .\jc-output.txt python -m pygments -l json -f html -O "noclasses" -o .\new-code-styles.html .\jc-output.txt

…ting For a 118MB JSON input file, this reduces memory consumption by ~500MB and reduces formatting time by ~15 seconds.

kurtmckee · 2020-10-11T19:00:23Z

I added an LRU cache that further reduces the HTML formatter's memory consumption and processing time. The total run-time for the 118MB JSON test has dropped to ~51 seconds and the memory consumption has dropped to ~1.7GB (down from ~3.2GB).

Anteru

Ok, that's a quite impressive JSON parser there :) Well done -- my only concern is reduced test coverage. Yes, the parser is much less prone to some problems, and running those tests won't yield any new insights, but it ensures we don't regress. Unless a test is invalid, it should be kept around.

tests/test_data.py

Anteru

I wasn't aware of pytest-timeout, but it does indeed sound very useful, given backtracking is typically going to timeout. @birkenfeld Any concerns with adding pytest-timeout as a dependency?

pygments/lexers/data.py

Anteru · 2020-10-26T20:33:16Z

Thanks a lot, excellent work!

kurtmckee · 2020-10-26T20:45:54Z

Thanks! I really had a blast working on this. =)

birkenfeld · 2020-10-28T07:41:18Z

@kurtmckee thanks a lot! I am surprised that the lru_cache made such a big difference. I guess it's kind of specific for some languages with lots of similar small tokens like braces etc.

gerner · 2020-11-20T00:42:05Z

@kurtmckee nice work. I'm looking forward to using the parser and I'm glad that the backtracking test is still in there :) I can't wait for this to get released.

kurtmckee added 5 commits October 10, 2020 21:52

Update the JSON-LD keyword list to match JSON-LD 1.1

61de3c3

Changes in this patch: * Update the JSON-LD URL to HTTPS * Update the list of JSON-LD keywords * Make the JSON-LD parser less dependent on the JSON lexer implementation * Add unit tests for the JSON-LD lexer

Add unit tests for the JSON parser

271be39

This includes: * Testing valid literals * Testing valid string escapes * Testing that object keys are tokenized differently from string values

Add an LRU cache to the HTML formatter's HTML-escaping and line-split…

69e4882

…ting For a 118MB JSON input file, this reduces memory consumption by ~500MB and reduces formatting time by ~15 seconds.

Anteru requested changes Oct 12, 2020

View reviewed changes

tests/test_data.py Show resolved Hide resolved

Anteru added this to the 2.8 milestone Oct 24, 2020

Anteru self-assigned this Oct 24, 2020

JSON: Add a catastrophic backtracking test back to the test suite

a219f27

kurtmckee requested a review from Anteru October 25, 2020 16:01

Anteru approved these changes Oct 25, 2020

View reviewed changes

pygments/lexers/data.py Show resolved Hide resolved

kurtmckee added 2 commits October 25, 2020 20:22

JSON: Update the comment that documents the internal queue

8800a92

JSON: Document in comments that ints/floats/constants are not validated

e086e48

Anteru removed this from the 2.8 milestone Oct 26, 2020

Anteru added the changelog-update Items which need to get mentioned in the changelog label Oct 26, 2020

Anteru added this to the 2.7.3 milestone Oct 26, 2020

Anteru merged commit 164dcb5 into pygments:master Oct 26, 2020

Anteru removed the changelog-update Items which need to get mentioned in the changelog label Dec 5, 2020

asottile mentioned this pull request Dec 25, 2020

pygments 2.7.3 broke pygments-ansi-color's html formatter subclass #1644

Closed

kurtmckee deleted the speedy-json branch March 3, 2021 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up JSON and reduce HTML formatter consumption #1569

Speed up JSON and reduce HTML formatter consumption #1569

kurtmckee commented Oct 11, 2020

kurtmckee commented Oct 11, 2020

Anteru left a comment

Anteru left a comment

Anteru commented Oct 26, 2020

kurtmckee commented Oct 26, 2020 via email

birkenfeld commented Oct 28, 2020

gerner commented Nov 20, 2020

Speed up JSON and reduce HTML formatter consumption #1569

Speed up JSON and reduce HTML formatter consumption #1569

Conversation

kurtmckee commented Oct 11, 2020

kurtmckee commented Oct 11, 2020

Anteru left a comment

Choose a reason for hiding this comment

Anteru left a comment

Choose a reason for hiding this comment

Anteru commented Oct 26, 2020

kurtmckee commented Oct 26, 2020 via email

birkenfeld commented Oct 28, 2020

gerner commented Nov 20, 2020