Discuss: File size of distributable for the browser #2846

joshgoebel · 2020-11-11T22:42:44Z

I wanted to open a discussion about the direction we're headed with regard to file size. It's come up recently in talking to StackOverflow, but otherwise I don't hear much about it.

First some history

In terms of "absolute" relative size we've grown quite a bit in the past year or so. All numbers are stated in terms of gzipped size.

When I came on board in Sept 2019: ~20kb :common subset, gzipped
Today: ~37kb :common gzipped

We've close to doubled the size of our common distributable. Yet in that time we also added several new languages to :common also:

Go
Kotlin
Less and SCSS
Lua
Rust
Swift
Typescript

33% of our size increase comes from just these new languages (i.e. ~20kb to 28kb). The rest comes from numerous grammar improvements, parser improvements, etc...

Here and Now / My Thoughts

We have our new "higher fidelity" initiative in #2500. Both 37kb and 20kb seem tiny to me. Yes, it's possible to build much larger builds. The full library (with every grammars) weighs in at a whopping 272kb.

All the feedback I see here on issues is of the "please, better highlighting, highlight more, highlight better" variety... I can't remember anyone pushing back with "the library is too large, make it smaller". I wanted to open the topic to see if anyone has any thoughts on this.

Personally I feel that our situation now is good and that increasing the bundle even 30-40% would be a win if we end up with much more nuanced highlighting at a result. (I don't think the size will actually increase that much though.) I don't see how we can keep the size the same as we pursue higher fidelity and more nuanced highlighting. Many of the recent "language reboots" (LaTex, Mathematica, etc) have seen huge improvements in those grammars - but also a significant increase in the grammar size.

I still think a very "popular" use for Highlight.js is on a small website/blog where one is using a subset of the languages, not a full build (or anything even close). Then on the other end you have huge sites like Discourse and StackOverflow building larger bundles. In those cases I think the right solution (if size becomes a problem) is to lazy-load the grammars on demand. Which we've always made easy to do and it just got easier with my PR to eliminate all run-time dependencies between languages.

A good portion of our size is keyword bundles... There has been talk recently of whether (in some languages) we could detect CamelCaseClassThingy rather than a hard-coded list... and while we could do that it (removing some keywords) it would have detrimental effects on our auto-detection capabilities which for many languages is highly dependent on large keyword lists.

Also, there are other highlighters. It's always been my advice that if "small size" is a key requirement for someone that Prism might be a better choice as they tend to rely a lot more on tighter regex, simpler grammars, and dependency stacking... which helps them keep the size of each grammar smaller.

So I see our library continuing to grow slowly in size with every new release... and continuing to highlight with more nuance... with of course continued improvements to the parser and auto-detect when possible.

Does anything think this is the wrong direction?
Should we have some sort of size cap on 1st party languages?
Any other thoughts?

Note: Currently every language is built as a stand-alone module - which hurts our non-compressed size - since some dependency modules end up being duplicated in the source... this should have less bearing on the final gzip size though and there are also plans to fix this in the future (when using the official build system to build a monolithic distributable).

The text was updated successfully, but these errors were encountered:

aduh95 · 2021-02-17T19:15:50Z

I have a take on this, but it might be ignorant on how people are actually using highlight.js, but here it is: Recommend using ESM and stop caring about the bundled size. If we use a promise-based approach, we could use dynamic import (https://caniuse.com/es6-module-dynamic-import) to lazy load only language parsers the user is actually using and have a pretty clean API:

<link rel="stylesheet" href="/path/to/styles/default.css">
<!-- IE compatibility -->
<script src="/path/to/promise-polyfill.min.js" nomodule ></script>
<script src="/path/to/highlight.min.js" nomodule ></script>
<script nomodule >hljs.highlightAll().catch(console.error);</script>
<!-- Evergreen browsers -->
<script type="module">
import hljs from '/path/to/highlight.mjs';
hljs.highlightAll().catch(console.error);
</script>

Using this approach, only IE users would suffer from the bundle size.

We would have to make some change in the underlining API to deal with Promise and import():

import hljs from "./highlight.js/lib/core.js";

// registerLanguage should accept lazy loaded `Language`:
hljs.registerLanguage('c', () => import('./highlight.js/lib/languages/c.js'));
hljs.registerLanguage('d', () => import('./highlight.js/lib/languages/d.js'));

// the actual fetching of the language module can be triggered later by a user
// event – or they are never fetched if there are not actually needed.
someForm.addEventListener('submit', (ev) => {
  ev.preventDefault();
  // highlight now returns a Promise
  hljs.highlight(someForm.language.value, someForm.code.value)
    // the correct module is loaded only when we need it
    .then(result => {
      // update the highlighted HTML…
    })
    .catch(err => {
      // the Promise rejects if someForm.language.value doesn't correspond to a
      // registered language or an alias.
      alert('Unknown language?');
      console.error(err);
    });
});

I might be wrong, but I am under the impression that most highlight.js use a subset of the languages it provides (and sometimes only one). If we are moving to ESM on v11, I think we should do the extra work of having a Promise-based API, users would benefit from it.

joshgoebel · 2021-02-18T00:54:01Z

Thanks for chiming in. This is definitely all very interesting. I had originally added lazy loading as a feature request but no one seemed to care about it, so I let it go.

A few thoughts:

We don't support IE at all, so it's a non-concern.
Auto-detection requires all the bundled languages to be loaded and many installs that highlight user facing content (forums, etc) use auto-detection. So the benefit for those users would be far less (though perhaps interlacing the highlighting with the downloading would be a bit faster). So someone setting this up "wrong" in the worst case could be adding 180+ fetch requests to their project without even realizing it...
Languages aliases and other meta-data (which are needed at runtime) require the language to first be loaded (to resolve the aliases). I suppose this could be moved to registration time, but it seems annoying.
The biggest benefit here would be say a blog with a single language in a single post... in that case the single language needed would be downloaded and nothing more.
I think a lot of people bundle because it's a familiar and reliable system (if it ain't broke...) - so I'm in no hurry to break those setups. So I feel like we'd need to support both sync and async strategies, which adds complexity.
I really wish we had a beta channel (that was active) to allow us to test large architectural changes like this prior to just publishing and hoping for the best.
Prior history: Proposal: JIT loading of Javascript grammars via CDN #2405

Questions:

Is there a story for how import would mix with Subresource integrity protection?
What do bundlers (like Rollup, or others) do with all this dynamism? Do they just flatten it so that promises immediately resolve and everything is still static and resolved during compile?
There are inter-grammar dependencies. For example a large HTML file with a tiny snippet of JS at the end would not know it needed the JS grammar until it had parsed almost the entire file... would that be fine, or would those dependencies need to be bubbled to the surface so they could be fetched ASAP when an HTML highlight began? I also suppose not every HTML file includes JS... so perhaps you could argue the other direction also.

The original idea of upgrading npm while leaving the brower-side alone (for now) was appealing in that it:

it makes a pretty MAJOR change on a single side only (the server)
it allows users to opt-in to using the npm modules in the browser via import
everyone else using our pre-built CDN JS assets notices nothing (because they are identical)

Being a tiny team with very limited resources we tend to be a bit risk adverse and typically that serves us well.

aduh95 · 2021-02-18T08:57:30Z

Note that I'm not suggesting we remove the bundle CDN browser script: I think we should keep it (for people who don't want to deal with ESM, or want to support older browsers, or don't want to deal with lazy-loading). What I'm suggesting is to let its size grow, consider it a non-concern, and nudge people to use the ESM version if they want to save some of their users' bandwidth.

Auto-detection requires all the bundled languages to be loaded and many installs that highlight user facing content (forums, etc) use auto-detection. So the benefit for those users would be far less (though perhaps interlacing the highlighting with the downloading would be a bit faster). So someone setting this up "wrong" in the worst case could be adding 180+ fetch requests to their project without even realizing it...

My feeling is that web pages that do includes snippets of all 190 languages must be pretty rare, and that most pages must have something like 10 different languages tops – but I have no data to back this up though. But anyways, if your website is likely to contain all the languages, do use the bundled version, I agree lazy loading is not for you.

Languages aliases and other meta-data (which are needed at runtime) require the language to first be loaded (to resolve the aliases). I suppose this could be moved to registration time, but it seems annoying.

True, although that's something we can solve at build time.

Is there a story for how import would mix with Subresource integrity protection?

Not really: https://stackoverflow.com/questions/45804660/is-it-possible-to-use-subresource-integrity-with-es6-module-imports. The Import Assertions proposal has reached stage 3, SRI may be added to the spec at a later time, but currently dynamic import() are unfiltered. <script type="module"> has the same SRI implementation as the nomodule one.

What do bundlers (like Rollup, or others) do with all this dynamism? Do they just flatten it so that promises immediately resolve and everything is still static and resolved during compile?

They usually have an option to let the user chose (E.G.: for Rollup it's called inlineDynamicImports).

There are inter-grammar dependencies. For example a large HTML file with a tiny snippet of JS at the end would not know it needed the JS grammar until it had parsed almost the entire file... would that be fine, or would those dependencies need to be bubbled to the surface so they could be fetched ASAP when an HTML highlight began? I also suppose not every HTML file includes JS... so perhaps you could argue the other direction also.

My take on this would be to be as lazy as possible, because it's easier 😅 for us to implement, and for the users to understand what the library is doing. We should document that if the user knows what languages they are going to use, they can preload the relevant modules:

<link rel="preload" href="/path/to/highlight.js/languages/js.js" as="script">
<link rel="preload" href="/path/to/highlight.js/languages/html.js" as="script">

<!-- or -->

<script type="module" href="/path/to/highlight.js/languages/js.js" integrity="…"></script>
<script type="module" href="/path/to/highlight.js/languages/html.js" integrity="…"></script>

If they can't be bothered, they can either accept the tradeoff of lazy-loading, or the file size of the browser bundle.

joshgoebel · 2021-02-18T14:06:13Z

My feeling is that web pages that do includes snippets of all 190 languages must be pretty rare, and that most pages must have something like 10 different languages tops – but I have no data to back this up though. But anyways, if your website is likely to contain all the languages, do use the bundled version, I agree lazy loading is not for you.

I wasn't referring entirely to actual real-life webpages or use cases - but rather implimentor behavior, including the possibility of confusion and mistakes.

There are over 100 million of downloads of our default (common) build today (via a single CDN), which is 39 languages and 41kb gzipped.
This will be getting smaller with v11 as we rip out some languages from common.
We prevent this 190 language "mistake" now by simply not shipping an "all_languages.js"
If someone really wants to build/package this manually (from source), then they can.
I have no idea how many of those 39 are "used at a time", but 41kb to me is not "heavy" at all IMHO.
StackOverflow adds quite a few additional to this and still comes in around only 80kb I think, which is not onerous (to most implementors)
And of course the penalty for loading all 190 with Node.js is much smaller since the code is all local and bandwidth is no issue.

So it seems the actual need here is for a tiny fraction of use cases:

Those that need LOTS of languages (where download size really starts to matter)
PLUS they aren't using auto-detect (so they could truly really take advantage of lazy load)

Given all that, I feel right now lazy-loading is best handled outside of core.

This issue was originally created following some discussions with Stack Overflow, who are extremely size sensitive. It took 3 months before anyone else chimed in on the topic. I just don't think many people actually need this functionality (or care about size super strongly) - and even if I'm mistaken, I don't (at this time) see huge advantages to it being in core vs a plug-in/add-on.

It seems (esp. after we release an ESM npm package) one could very easily write a small "wrapper" package such as highlightjs-async that provided a custom index (with addl. metadata and async registration calls) and then replace/wrap key API functions with async versions:

highlight
highlightBlock
registerLanguage

I'd suggest this is even quite possible today without much effort using fetch instead of modules.

aduh95 · 2021-02-18T14:29:38Z

I just don't think many people actually need this functionality (or care about size super strongly) - and even if I'm mistaken, I don't (at this time) see huge advantages to it being in core vs a plug-in/add-on.

Fair enough, I think I share your POV on this, most people are fine with highlight.js the way it is. That doesn't mean we shouldn't try to improve it (I mean, that's why you opened this issue, right?), but apologies if my comments were off topic.

I don't (at this time) see huge advantages to it being in core vs a plug-in/add-on.

What if a magic PR came down your inbox that added support for lazy-loading in core? Are you open to be convinced, or would you rather wait for another time?

joshgoebel · 2021-02-18T15:39:31Z

but apologies if my comments were off topic.

No need for an apology - I'd say you were definitely on-topic.

What if a magic PR came down your inbox that added support for lazy-loading in core? Are you open to be convinced, or would you rather wait for another time?

Well, it's not just about who does the initial work (and believe me I appreciate the willingness). That's often only the tip of the iceberg. Often the long-term support/maintenance of a feature can far outweigh any initial outlay of development time. Our current trend has been towards shrinking and simplifying the core codebase and public API - while extending the plug-in API as it makes sense... things not critical to "core" functionality are then pulled out into plugins, etc. The old but time tested "less is more" approach.

With our current tiny team a vibrant plugin community with individual authors who are passionate about maintaining and supporting their own particular plugins has a lot of benefits.

jimmywarting · 2022-04-13T23:47:38Z

I like #2846 (comment) idea about code splitting, dose anything like this ever got implemented?

joshgoebel · 2022-04-14T01:55:01Z

Not fully. Perhaps also see my thoughts just added to #3199. We do currently ship everything as separate ESM modules: core, all the grammars, etc.... You can dynamically import them and register them if you write the code for it... What doesn't exist is the glue to tie it all together into some type of highlightjs-async library...

joshgoebel added enhancement An enhancement or new feature help welcome Could use help from community big picture Policy or high level discussion and removed enhancement An enhancement or new feature labels Nov 11, 2020

oliverlloyd mentioned this issue Dec 21, 2020

Added CodeBlockComponent guardian/dotcom-rendering#2275

Merged

joshgoebel added the discuss/propose Proposal for a new feature/direction label Jan 7, 2021

joshgoebel added the autoclose Flag things to future autoclose. label Feb 14, 2021

joshgoebel mentioned this issue Mar 3, 2021

[Request] Build minimal target (i.e. no languages) on CDNs too #3034

Closed

joshgoebel closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss: File size of distributable for the browser #2846

Discuss: File size of distributable for the browser #2846

joshgoebel commented Nov 11, 2020 •

edited

aduh95 commented Feb 17, 2021

joshgoebel commented Feb 18, 2021 •

edited

aduh95 commented Feb 18, 2021 •

edited

joshgoebel commented Feb 18, 2021 •

edited

aduh95 commented Feb 18, 2021

joshgoebel commented Feb 18, 2021

jimmywarting commented Apr 13, 2022 •

edited

joshgoebel commented Apr 14, 2022

Discuss: File size of distributable for the browser #2846

Discuss: File size of distributable for the browser #2846

Comments

joshgoebel commented Nov 11, 2020 • edited

First some history

Here and Now / My Thoughts

aduh95 commented Feb 17, 2021

joshgoebel commented Feb 18, 2021 • edited

aduh95 commented Feb 18, 2021 • edited

joshgoebel commented Feb 18, 2021 • edited

aduh95 commented Feb 18, 2021

joshgoebel commented Feb 18, 2021

jimmywarting commented Apr 13, 2022 • edited

joshgoebel commented Apr 14, 2022

joshgoebel commented Nov 11, 2020 •

edited

joshgoebel commented Feb 18, 2021 •

edited

aduh95 commented Feb 18, 2021 •

edited

joshgoebel commented Feb 18, 2021 •

edited

jimmywarting commented Apr 13, 2022 •

edited