Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not compress the outer parts of a SyntaxSet #398

Merged
merged 3 commits into from Dec 23, 2021

Conversation

Enselic
Copy link
Collaborator

@Enselic Enselic commented Nov 25, 2021

Here is the last PR in the lazy-loading series.

Since the lazy-loaded syntaxes already has their data compressed, it is only
detrimental to performance to compress another time. So stop compressing the
SyntaxSet and add APIs for clients too.

To summarize impact of lazy-loading work, here are criterion benchmark numbers with v4.6.0 as baseline (with some back-ported benches to make it possible to compare):

Much faster:

  • from_dump_file takes 1.3263 ms instead of 42.257 ms, an improvement with -96.859%
  • load_internal_dump show similar numbers as from_dump_file

Significant improvements:

  • load_and_highlight/*: Each test is ~41 ms faster compared to before (due to the load_internal_dump improvement). For example, load_and_highlight/"parser.rs" takes ~434 ms instead of ~467 ms, and load_and_highlight/"highlight_test.erb" takes ~32 ms instead of ~62 ms.

No practical difference in performance compared to v4.6.0:

  • parse/*
  • highlight/*
  • add_from_folder
  • load_internal_themes
  • stack_matching
  • highlight_html

Much slower:

  • link_syntaxes takes ~200 ms instead of ~17 ms since it now involves serialization and compression. It's for the greater good though of course since linking is a one time thing while loading dumps happens over and over, and affects startup time in a much more fundamental way.

Another example of something that has gotten much faster is syncat. The improvement numbers I gave in the prototype code is still applicable.

Overview of the syntax lazy-loading work:

(Note that MSRV CI will fail due to #397 not being merged)

(Note that projects that have their own code for dumps (like bat) is not affected by this PR, but for "normal" clients this PR is important.)

Since the lazy-loaded syntaxes already has their data compressed, it is only
detrimental to performance to compress another time. So stop compressing the
SyntaxSet and add APIs for clients too.
/// data that has been embedded in your own binary with the [`include_bytes!`]
/// macro.
#[cfg(any(feature = "dump-load", feature = "dump-load-rs"))]
fn from_uncompressed_data<T: DeserializeOwned>(v: &[u8]) -> Result<T> {
Copy link
Collaborator Author

@Enselic Enselic Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of #98 I chose to return a Result<T> here instead of T as from_binary. I also chose to call it _data instead of _binary since I think _data is clearer. But I am of course willing to make any changes that my reviewer(s) request.

@Enselic
Copy link
Collaborator Author

Enselic commented Dec 23, 2021

I'm not entirely comfortable to merge this PR without any review, so I'll leave this open for at least another month.

After it is merged, it would be fantastic with a new syntect release, because we can then practically immediately make a bat release that makes use of lazy-loading of syntaxes.

Copy link
Owner

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, this isn't that big or weird of a change so I would have been fine with you merging this on your own after a month, for reference.

I'm happy to do a release now! Do you know offhand whether any important public APIs changed in a breaking way? I remember one minor change to something public that I'm pretty sure nobody uses, that I'm fine doing in a non-major release, but was there anything other than that? If you don't remember I can look through myself.

@trishume trishume merged commit b041cb6 into trishume:master Dec 23, 2021
@Enselic Enselic deleted the stop-compressing-outer-syntax-set branch December 24, 2021 06:38
@Enselic
Copy link
Collaborator Author

Enselic commented Dec 24, 2021

Looks good, this isn't that big or weird of a change so I would have been fine with you merging this on your own after a month, for reference.

Good to know, I'll take this into account when evaluating complexity/risk of PRs in the future 👍

Looking forward to the release! I quickly skimmed through git diff v4.6.0..origin/master and git log v4.6.0..origin/master --no-merges --oneline and came up with this set of changes since v4.6.0:

## [Version 4.7.0](https://github.com/trishume/syntect/compare/v4.6.0...v4.7.0) (2021-12-xx)

- Add ScopeRangeIterator
- Add CI check for Minimum Supported Rust Version. This is currently Rust 1.51.
- Fix lots of build warnings and lints
- Make 'plist' dependency (used for loading themes) optional via new 'plist-load' feature
- Make looking up a syntax by extension use case-insensitive comparison
- Add Criterion benchmarks for a whole syntect pipeline and for from_dump_file()
- Make from_dump_file() ~15% faster
- Blend alpha value on converting colors to ANSI color sequences
- Remove ContextId::new() from public API to support lazy-loading of syntaxes
- Lazy-load syntaxes to significantly improve startup time
- Replace lazycell with once_cell to fix crash on lazy initialization
- Fix sample code in documentation to avoid double newlines

In terms of changes to the public API, I checked the diff using my home-cooked method:

#!/usr/bin/env bash
set -o errexit -o nounset -o pipefail

RUSTDOCFLAGS='-Z unstable-options --output-format json' cargo +nightly doc --no-deps

jq  '.index | .[] | select(.crate_id == 0) | select(.visibility == "public") | .name' target/doc/syntect.json | sort > public-api-symbols.txt

Diffing the output of v4.6.0 with origin/master produces the following list of changes:

  • ScopeRangeIterator and ScopeRangeIterator::new() added
  • ContextId::new() removed
  • dump_to_uncompressed_file() added
  • from_uncompressed_dump_file() added

To be precise; the diff of adding fn new() and remvoving fn new() cancels out, so I need to polish that script. But it is a good sanity check at least. Anyway, it was "ContextId::new() removed" that you were thinking of. Apart from that there should be no change that is "risky".

I did all this in a bit of a rush since now I need to prepare for Christmas, but it should be a good baseline for you when you double-check the diff yourself :) Feel free to remove and add and do whatever changes you like to my draft!

@trishume
Copy link
Owner

Thanks so much for figuring out the changelog for me, that made things much quicker! I just pushed the release and published v4.7.0 on crates.io: https://github.com/trishume/syntect/releases/tag/v4.7.0

Thanks again for all your great contributions and your patience with how spottily I've put time into maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants