Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy: Discuss extras repository for additional languages #2149

Closed
joshgoebel opened this issue Oct 4, 2019 · 64 comments
Closed

Policy: Discuss extras repository for additional languages #2149

joshgoebel opened this issue Oct 4, 2019 · 64 comments
Labels
big picture Policy or high level discussion

Comments

@joshgoebel
Copy link
Member

I see we have a lot of requests for new languages in the PRs... if the issue is time/maintenance over time, etc... might we perhaps consider an "extra" respository or something with community/unsupported syntaxes? That way the criteria to be approved could be lessened a little and obscure languages that might not really make sense in Highlight.js proper could still have a home?

Or is the idea that eventually we'll get to them?

@joshgoebel
Copy link
Member Author

Are we still moving languages into repositories? Is that ALL the languges, or only "additional" ones?

@marcoscaceres
Copy link
Contributor

Only new ones because then we can make submitters the repo maintainers.

@joshgoebel
Copy link
Member Author

So what is the policy on new (or very old) PRs already in the queue? Is there some cut off date or pretty much the answer now is always "separate repo"?

@marcoscaceres
Copy link
Contributor

Yep, separate repo.

@jf990
Copy link
Contributor

jf990 commented Oct 5, 2019

The separate language repos concept addresses a maintenance problem that has caused consternation here since @isagalaev slowed down the regular maintenance. I think the separate repo idea is going to address that but we need to discuss the many problems it introduces. We should decide how to handle these things because they change the way this library has been managed since its inception.

  1. Discovery. Available languages and how to use them are documented on highlightjs.org. By putting new languages in separate repos we are effectively invalidating this entire page. No longer could we have a single source of truth as to what languages are available and how to get them. Even worse, old languages that were part of the original highlight js will be handled and documented differently than the new ones. this will make the library difficult to use and documentation will be all over the place as different language authors choose their own way of doing things without the oversight of the maintainers. We should figure out and document some minimum standards for language contributions. (To do this right) we should even move all the existing languages into this new format such that all languages are handled the same making using the library consistent. We should also come up with a meta data file that each language repo is responsible to maintain so that we can automate some aspects of the discovery, testing, detection, and packaging.

  2. Auto-detection. I'm not sure we are going to easily have language auto-detection work as well as it does now. with the separate repos there's no easy way to run the test against all the other languages and determine if relevance is working.

  3. Testing. Current testing methodology is well defined and there is a process to handle it. This method is no longer valid and individual repos are on their own. So far with the new languages definitions we have not established any requirement for testing, and we see that some do and some don't.

  4. Packaging and deployment. We probably can no longer support a page like Getting highlight.js, or not for the new languages. This may or may not be a real problem for some as packagers and modern workflows may reduce this requirement, but some developers are still loading a custom package in a script tag and would be required to completely change their method if they wanted to include a individual language. It would also be not so nice to offer that page and not include the newer languages, so we have some thinking to do there.

  5. Quality. Since we are now deferring individual language maintenance to the individual authors (or those interested in contributing) and not the central highlightjs maintainers, going forward the quality and maintenance of languages will be indeterminate at best. Our current process at least has the maintainers as the gatekeepers making sure we don't include poorly implemented, undocumented, or buggy language contributions. But the new way will remove that oversight. For a community driven project I guess this is the expectation but in the past this repo has been very well maintained, and I think that is part of the reason it gained such traction and popularity.

Most of this is solvable with new documentation on the language contributors page and some new development such as a language registry and supporting process scripts like test and build to support that. But it's all a lot of work.

I still haven't been able to figure out how to update the documentation. Have we documented how stuff in docs gets built and deployed to highlightjs.org and highlightjs.readthedocs.io?

BTW, really nice work @yyyc514 going through all those issues 🚀

@joshgoebel
Copy link
Member Author

. I'm not sure we are going to easily have language auto-detection work as well as it does now. with the separate repos there's no easy way to run the test against all the other languages and determine if relevance is working.

Well, the "test it" part is just a tooling problem but I think logically this will prove impossible with the infinite range of possibilities as languages grow and grow, yes.

Testing

I think to be "semi-official" or included in some type of global list that there should be some minimal amount of specs that a syntax is required to pass.

But obviously if someone just wants to rip a language down from somewhere and use it then we can't stop them. We do have power to decide what gets hosted at highlightjs org though, so that's something.

It would also be not so nice to offer that page and not include the newer languages, so we have some thinking to do there.

Well, I wasn't going to say it publicly, but I guess now I will. I can't speak for everyone but I can't imagine this ban on new languages in core is ABSOLUTE. It's to prevent a proliferation of 100 tiny languages stagnating that no one has time or inclination to maintain (or to baby sit PRs, tests, QA, etc). If the next Swift comes around and 50% of the world is writing code in it, I imagine we'd consider adding it to core and someone would make time to maintain it.

Although if we figure out this whole "separate repo" thing perhaps eventually none of the languages will be in core... but seems that's a bit far away at the moment.

Our current process at least has the maintainers as the gatekeepers making sure we don't include poorly implemented, undocumented, or buggy language contributions. But the new way will remove that oversight.

Yes, this is a for sure concern and why I think there should be some gateway between "one of the core contributors has read this, or agreed it passes "reasonable" specs and "who knows I just found this laying around somewhere". It would be very bad if someone installs a shiny new "Pancakes 1.0" syntax that locks up their website and blames it instead on Highlight.js.

I still haven't been able to figure out how to update the documentation. Have we documented how stuff in docs gets built and deployed to highlightjs.org and highlightjs.readthedocs.io?

No idea. Someone who knows how needs to find time to write up something ROUGH... and then we need to find people who have the time an inclination to help keep docs updated. They could iterate on the rough docs and push them forward.

BTW, really nice work @yyyc514 going through all those issues 🚀

Thanks.

@jf990
Copy link
Contributor

jf990 commented Oct 6, 2019

@yyyc514 I had offered to pitch in on the docs in some other issue here, and in particular to help rewrite the contribute a new language guide. I had worked on redoing the languages I help maintain as separate repos to work out the way to do it.

I setup a github template project, we can review this effort and see if it's on the right track. It sets up a template project with a unit test to get started with a new language.

https://github.com/jf990/highlightjs-language-template

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 6, 2019

@jf990 See my thoughts on this thread regarding auto-detection:

#1213

I actually think there are some pretty great ideas (or the beginnings of ideas there). In this context instead of saying "Yes, all 10,000 languages from all maintainers have NO conflicts!" We'd run the tests and then say:

Hey, you, maintainer of "Pancake 1". Your language is 95% "sticky", it thinks almost any language is Pancake... our recommended threshold is Y... you need to tune your relevancy scores to prevent false positives against other languages.

Need a better word than sticky, lol.

And then we'd have a metric we could use for including in a master list, including on the "main website" for build packs, etc...

@joshgoebel joshgoebel changed the title Discuss extras repository for additional languages Policy: Discuss extras repository for additional languages Oct 7, 2019
@joshgoebel joshgoebel self-assigned this Oct 7, 2019
@joshgoebel joshgoebel added the big picture Policy or high level discussion label Oct 7, 2019
@jaredlll08
Copy link
Contributor

So I have been looking at a few of the issues over the past few days, and feel I can maybe provide some outsider input.

Before I start, I want to make it clear, I am not a proper JavaScript Developer, I primarily work in other back-end languages, so my approach may be naive or going against best practices, but I do believe it will solve most issues.

The approach is very similar to what GitHub does with their Linguist library, and that is using Git Submodules.

I have made a proof of concept for everything I am going to say below, which can be found here, like I said above, I'm not that well versed in JS build tools (and it is currently 5AM), so I did a dirty hack to get it to work, in reality it will either need to be all or nothing, with all languages following the same format (being in a folder with the language name for example), but having a standard format for languages is good in my opinion.

Things I changed in the proof of concept:

  • Languages are now loaded from a folder, so src/languages/$LANGUAGE_NAME/*.js
  • Reading snippets uses the name of the folder the language is in(so $LANGUAGE_NAME) instead of the js file (could be changed, I did it because my file was called index.js, you could enforce a proper name when making the repo)

Right now, based on issues I have read on this repo, if someone wants a new language, they make an issue / PR on this repo, then a @highlightjs member creates a repo for their language, and adds the original author of the issue / PR to that repo for their language to live in.

As @jf990 said, while it helps with maintenance, it comes with a few drawbacks, I believe Git Submodules can fix most of those, like so:

  1. Discovery

This is the whole reason why I am even commenting here, I made support for a language, and now trying to get products to use it is an up hill battle, most developers feel that I should try and get my language to be supported in HighlightJS itself, which right now, isn't an option.

So the solution, since manual work is already required when someone wants to add language support (making a repo for them and add them as a collaborator), then it shouldn't be an issue to make a commit to this repo, adding that newly created repo as a submodule, like I do in this commit of the PoC.

This will only add the submodule to the src/languages folder, so I do not have a solution for files in the test folder, I can look more into it if this solution is something that will actually be considered and would work for this project.

So with the 3rd party languages now in the languages folder, it is treated exactly the same as a "first party" language, so it is built when running node tools/build and shows up on the build/demo/index.html, and I believe, tests are still ran on it, I don't see a reason why they wouldn't be.

  1. Auto-detection

See above about third party languages being treated as first party, so this would no longer be an issue.

  1. Testing

Like I said above, I don't see why tests wouldn't run on these third party languages, besides the issue of getting other files into the test directory.

  1. Packaging and deployment

Once again this is all handled because the files are physically there when the commands are ran, I'm not sure how the Getting highlight.js page is generated, but if it is generated based on the src/languages folder, it should be trivial to move to the new src/languages/$LANGUAGE_NAME format.

The one thing that would need doing, or at least would be a quality of life feature, would be having the build script update all the submodules to their latest commit, but this can be done manually in the case of broken commits on submodules, or just forcing a submodule to use a specific commit.

  1. Quality

I have no solution for this, the only thing that comes to mind is implementing a set of requirements and guidelines for new languages, for example, Github Linguist (example), they require the language to be used in "hundreds of repositories", for them, being Github, that makes sense, since the PR would affect those repositories, so for HighlightJS, it gets a bit tricky, you could use the same metric as Github, and that way you could help ensure that:

  1. A language is used (I don't know how highlightJS members feel about this, the docs do say any and all languages are allowed).
  2. There are open source developers using the language (if the maintainer of the highlightJS language support dissapears, it is possible a new maintainer who has experience with the language could step up).

Or highlightJS could work on a different metric, I will admit that this doesn't seem like an easy issue to solve.

As for the documentation, I have used ReadTheDocs in the past and have some experience with it, so I am happy to help figure out how it works, and from there document any of the changes I listed above (if implemented) to help ensure that everyone knows what the new protocol is.

I hope this all makes sense, I am happy to go into more detail on git submodules if need be, or even brainstorm a different solution (possibly having the build script traverse the highlightJS organization and pull the languages from there, that way there are no submodules).

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 12, 2019

From another thread (I'm replying to @egor-rogov) (#1829):

Egor: This language is already in the core. Why should we move it away?

Well, already in or not is a very weird (meaningless?) metric (IMHO) to decide whether that's where they BELONG. I thought the whole idea of not letting more languages in [to core] had to do with developer time/maintenance/responsibility/who is in the best position to maintain the language long-term, etc... so surely the right way to think about existing languages ALSO is how they fare on those exact same metrics...

This "already in core" vs "sorry, you just missed the cut-off!" is a VERY weird and arbitrary line.

@joshgoebel
Copy link
Member Author

Not a fan of git submodules, though I haven't worked with them in years. It's possible they have improved. Back then all I heard was whining about how annoying they were.

I do see the advantage of "just works" (other than for tests, which you didn't go into in great detail)... but I don't think the paths is the hard part...

Having a languages.toml or languages.json file that anyone could contribute too and a smart build tool (doesn't have to be that smart) could accomplish the same thing.

Our build pipeline is crazy old and needs replacing anyways - so keeping it "as-is" isn't a priority.

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 12, 2019

Another possible suggestion:

  • Make your own repo following a template with your own tests etc.
  • Add your grammar to a "blessed" languages.json in the master repo, make a PR
  • If it looks even semi-reasonable, we merge. (the goal being discoverability, not policing)
  • Fix tooling so ./tools/build -t node javascript cpp some_weird_3rd_party "just works"

I'm also a fan of a shared language repository. highlightjs-grammars. I think that has a LOT Of advantages that aren't being considered yet... like higher visibility and more likelihood that the community will pitch it - one consolidated place for grammar issues, etc... I think the fear is that no one will "own it" and issues will go unanswered, etc...

@egor-rogov
Copy link
Collaborator

The idea about submodules is very interesting. I use them (in other project, not on github) and didn't find them annoying or something.
The huge advantage I see is no distinction between "in core or not" languages.
I think we can turn test directory into test subdirectories for each language, so that repo owner have full access to all relevant contents.
It requires some more thought and experimenting, of couse.

@egor-rogov
Copy link
Collaborator

I'm also a fan of a shared language repository.

What do you mean by this, @yyyc514?

@egor-rogov
Copy link
Collaborator

This "already in core" vs "sorry, you just missed the cut-off!" is a VERY weird and arbitrary line.

Absolutely! I really like to removing this barrier.

@joshgoebel
Copy link
Member Author

Just another SHARED repository... so you have "core" then you have "extras" (which has a bunch of languages)... and we'd "police" core more carefully than "extras"... (if we plan to keep a distinction at all long-term).

The idea about submodules is very interesting.

Not opposed to trying if you've had good experiences. How do they work when the submodule just drops off the planet? or someone disappears from GitHub and takes their work with them? Easy to fix?

I think we can turn test directory into test subdirectories for each language, so that repo owner have full access to all relevant contents.

Yeah, I think the tests would move into the languages and then "running the full suite" would have to be taught to look there if you truly wanted to run EVERYTHING.

@joshgoebel
Copy link
Member Author

Submodules + tests is going to require fixing those silly annoying relevancy tests. ;-) Or would we simply not run them for submodule languages? I need to go back to giving that a little more thought.

It's bad enough with 184 languages it'd be even worse with 250...

@jaredlll08
Copy link
Contributor

How do they work when the submodule just drops off the planet? or someone disappears from GitHub and takes their work with them? Easy to fix?

Funilly enough, when I was making my PR to linguist, this exact thing happened, someone deleted a repo that was being used as a submodule, thankfully they were active and got github support to restore it, but that is really not ideal.

I think a requirement for the submodules should be that they need to be under the highlightJS organization, that way only a team member can delete the repo and no one (in theory) can take their work with them.

@joshgoebel
Copy link
Member Author

Oh so it works poorly? LOL.

@jaredlll08
Copy link
Contributor

If that is what you want to take from what I said, sure, they work poorly when you have a submodule of someone else's repository, and they decide to delete the repository.

Which is what I addressed in the second paragraph.

If the repositories are under the @highlightjs organization (like this repository https://github.com/highlightjs/highlightjs-robots-txt for example, or any of the other repositories that have been made for third-party language support), then the only people who can actually delete that repository, are people in the @highlightjs organization, so in this case, it should be fine.

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 12, 2019

Yeah, I followed that. Just it sounded a lot better when the only thing we had to to was accept PRs to "link" them... rather than host them all as well. :-)

@jaredlll08
Copy link
Contributor

Well you are currently hosting them, so it isn't much of a difference.

I myself wouldn't be too worried about people deleting repositories and taking their code, Linguist has over 300 submodules and if it happened that often, I'm sure they would have found a different solution.

If it does happen however, all that would need doing is to just remove the submodule, which would remove the language support, but this would most likely only happen on more third party languages, as you said:

if the next Swift comes around and 50% of the world is writing code in it, I imagine we'd consider adding it to core and someone would make time to maintain it.

So the "common" languages would probably have "official" highlightJS support, and you would only need to worry about the more "uncommon" language repositories getting deleted.

So if a submodule was deleted, there are a few things that could be done:

  1. Depending on the license, rehost the support. As long as someone has the language on their computer (which would be pulled when pulling from this repo), they could upload the language to a new repository and have highlightJS pull from that repository. This only works if the license permits this though.

  2. Write support for that language. If it is a must have language that somehow got deleted, a community member could write a new support library for it (this is the least ideal solution in my opinion)

  3. Remove the language. Just make it clear in the changelog why the language was removed, a simple: "Language X was removed because the author of the package, Y, deleted the repository holding it". You are then shifting blame onto user Y, and if people were actually using the language, a new maintainer could step up and write a new support library.

@joshgoebel
Copy link
Member Author

Does using submodules become an issue when people want duplicates or have differing opinion on core style choices? How do we handle that? Someone has a PHP grammar that is MUCH better than ours (but perhaps it's too colorful, or it's too large, ours is more "minimal", etc)... do we just give it a different name and then let people build it by name?

IE, php-super?

I'd imagined some way that such things could "grow organically" over time then one day when it turns out everyone prefers php-super perhaps it would become php default, etc.

@jaredlll08
Copy link
Contributor

In a scenario like that, so firstly whoever made php-super would probably have ran into the issue of highlightJS adding using the php name, so I would imagine they would have used a different name already.

I really do think this scenario is a bit out of scope for this issue. The way I would deal with something like this, would be to add variants to languages, so instead of the class name being like:
hljs php it would be something like hljs php super.

or better yet, add support for language replacement (not sure if this is already a thing), so the person who made php-super would register their language like:

hljs.replaceLanguage("php", hljsPHPSuper);

So instead of registering it as a new language, they register it as a replacement, and highlightJS will use their grammar instead.

In my opinion, grammars for already existing "core" languages would be denied, and would be opt in, if someone really wanted php-super they could pull the package in from NPM / CDN.

If in a few months if everyone is using php-super, then maybe host a poll on what people want, if they want php-super to be native, or if they want the current php style, and go from there.

@joshgoebel
Copy link
Member Author

For sure I'd support replaceLanguage (or re-register)... I was just thinking of what an 'open' ecosystem might look like. We have some opinions (that we aren't even consistent about) that would seem to be holding some grammars back... I'd like to see it even easier that what you describe for someone to decide they like a different flavor of PHP say, and just checkout that repository, build, and then they are using a custom package...

If in a few months if everyone is using php-super, then maybe host a poll on what people want, if they want php-super to be native, or if they want the current php style, and go from there.

Sure, I guess I was just imagining it might happen more organically than that... let people vote with what they build - but then I'm not sure how many people build this themselves vs just use a packaged version...

Obviously if they're just using the default set then they're getting what core wants them to have in any case - and would have to plug in things on top of that.

@jaredlll08
Copy link
Contributor

jaredlll08 commented Oct 12, 2019

The problem I see with voting with what they build, is how can you actually track that? Unless there is analytics code built into the build process, I don't see a feasible way to actually know what people are using.

What I do know though, is that a good amount people are using "what the core wants them to have", since they are just pulling it from NPM (based on weekly downloads) and not building it themselves.

The whole reason I am here is because I wanted Discord to support my language, but after speaking with people, the general consensus is that I should try and get my language into highlightJS itself, since they don't want to have to add another library, they just want to pull highlightJS and ideally have the language without any extra hassle.

I know I may have some bias, but I honestly think that the language situation should be sorted out in general, before worrying about someone making a different flavour of PHP, the people who care about having a different flavour of PHP, are most likely the same people who would be willing to build the package themselves to get that flavour.

Since right now new languages aren't being used, and to echo what was said in this comment

Creating a GitHub repo on your own that no one is ever going to find doesn't feel much like "contributing" to the project... It feels little bit like we're telling them "frack off, we don't really care you and your style contribution".

that applies to anything in this project, not only a style, a language grammar as well.

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 22, 2019

I think the most problem is language support delivery. For example, I added support for a new language. How developers will add it to their current apps? I added language years ago and I still do not see support for it in so many apps I use. I have to manually load language and process codes through it. That is not plug-and-play.

There are several problems with this, you might want to see related discussion here. [I've simple gone ahead and moved your thoughts as I've responded to them).

But the hard problems such as lack of developer time and security concerns might be hard to solve.
Are these apps you refer to mostly "auto-detect" or "named" usage? With named usage there would be better opportunities to solve this with JIT dynamic loading of language modules... so just so long as the language modules was compiled on a known and trusted CDN it would "just work".

Security is still a very real concern though since obviously just loading any random files from a CDN that no one validations is a huge code injection attack waiting to happen.

Auto-loading is harder since right now the official languages are over 1 mb of Javascript - not even counting 3rd party languages. Some will point out that only a few are really large, and there is some truth to this but then we're right back to having some party who plays god picking and choosing which languages are "blessed" and which are not. :-)

So from where I sit it seems like JIT loading really only works when you know the language you need to highlight in advance.

I see it that when HLJS finds a new language tag it first looks for this language support in a subfolder lets say ./langs/lang-tag, if it does not find it, then it looks in public CDN, then loads that language dynamically.

I think we'd be open to a PR that supported "just in time" loading of languages via CDN, where the CDN was configured at load time when HLJS was initialized. That could definitely be a small piece of the puzzle, allowing someone else to step in and run a "trusted" CDN source for a broader set of community languages.

This is what it means to contribute to language highlight. For instance, I go to one site where they publish articles like gitbook and start publishing my article and find that my language is not supported. I create the repository, add support, get approved by hljs, go back to my book and my code examples are highlighted.

I think this is a LONG way off. It would first require a blessed community repo and it would require everyone changing their configurations to automatically trust that repo. But if people are willing to put in the time and work towards that goal, that'd be awesome.

It might be easier to first add this support and then enable it for all core languages - such that if a language isn't compiled in we first try to fetch it from the official core library CDN (and make that easy to configure). That would instantly increase the # of languages available for highlighting.

@jaredlll08
Copy link
Contributor

Auto-loading is harder since right now the official languages are over 1 mb of Javascript - not even counting 3rd party languages. Some will point out that only a few are really large, and there is some truth to this but then we're right back to having some party who plays god picking and choosing which languages are "blessed" and which are not. :-)

Honestly I have always had an issue with your usage of "blessed", at what point do you put your foot down and remove older, larger languages that are a detriment to the project and are unfair for new languages? Just because you have to remove a language from the officially supported list doesn't mean that the language is bad or that the maintainers of HLJS don't like the language.

Like I said ages ago:

There are 3 other languages that are all more than 10x the average language size, (mathematica (95kb), 1c (64kb) and gml (59kb)), I'm not sure how used those languages are, but removing them takes the final build down to 461kb, reducing the final build size by about 40%.

4 languages take up over 40% of the whole file size.

That isn't even counting compression such as GZIP and Brotli.

Using GZIP (which is supported by all major browsers), takes the total size down to 478KB, taking 66% of the file size

Still using GZIP, but excluding ISBL, takes the compress size down to 423Kb, taking the size down to 32% of the original file size.

Still using GZIP, but excluding the "big 4" that I mentioned earlier, the file size is down to 278Kb, taking the size down to only 24% of the original file size.

image

That is just with GZIP, using brotli (disclaimer, I know that this is still fairly new, and isn't implemented everywhere, or in all the browsers, but it is still a compression algo that is being used), the file sizes go down even more.

With Brotli, all the languages are only 35% of the original size (244.6KB).

With Brotli, all the languages excluding ISBL, are only 32% of the original size (227.4KB).

With Brotli, all the languages excluding the "big 4" are only 24% of the original size (171.2KB).

image

CDNJS has brotli enabled by default btw, so you are already getting these benefits:

The "core set" of languages that get shipped with HLJS on it's own is already being compressed to just 38.6% of it's original size.

image

Maybe you should have "some party who plays god picking and choosing which languages are "blessed" and which are not.", since having 4 languages that take up over 60% of the projects file size is a bit ridiculous in general, regardless of if it is stopping new languages being accepted or not.

In my honest opinion those 4 languages should be made 3rd party, you could fit over 70 other languages in the same space they took (based on my calculation of the average earlier, which is being skewed by these 4 languages, so it is possible you could fit even more)

With named usage there would be better opportunities to solve this with JIT dynamic loading of language modules... so just so long as the language modules was compiled on a known and trusted CDN it would "just work".

There are a ton of issues with this.

Unless you control what is being posted on the CDN, then you have no control. Lets use CDNJS for example.

Lets say that you guys implemented this, and made it that if a language isn't found, it looks on CDNJS for highlightjs-${language}.

Sure that would work, it would find the language (if it existed) and load it.

However, what happens if I come along, and upload a package to CDNJS named highlightjs-mylanguage, and had it include hidden malware.

Then I go onto a forum, and post a message using mylanguage as the named syntax, causing everyone who sees that message to suddenly load highlightjs-mylanguage and now my malware steals their credentials or something.

You're just creating an even bigger security risk.

Security is still a very real concern though since obviously just loading any random files from a CDN that no one validations is a huge code injection attack waiting to happen.

And here we have the actual issue with languages, developers don't want to have to read the source code of every library that they're using, but at the same time they don't want to just add every NPM package they see for whatever language one exists for. Even if they did do that, are developers expected to monitor NPM like a hawk looking for new languages to add? They would maybe do it once, and then never again, leaving new languages to never be noticed.

I think we'd be open to a PR that supported "just in time" loading of languages via CDN, where the CDN was configured at load time when HLJS was initialized. That could definitely be a small piece of the puzzle, allowing someone else to step in and run a "trusted" CDN source for a broader set of community languages.

This is a very closed minded view, if someone made an electron app that relied on JIT being a thing, and only loaded in the core set of languages as a base, what happens when that app is used offline?

Not everything is web based, so sure this solution could maybe work for web based usages, for offline usages it doesn't do much. (Sure the developer of the app could write a offline mode that downloaded all the languages, but then we're just back where we started, how does that developer know what all the languages are, and where to get them?).

I honestly don't think there is a solution to this problem using the current code base.

At this point I would be pushing for going to version 10.0.0, start with the core set of languages and have people PR new languages in (with the PR having a checkbox saying like: - [] I will maintain this language / <xyz> will be maintaining this language, sure they could check it and then never do anything, but I don't think there is a fix for that unfortunately (the same can happen with new repository languages, there is no guarantee that people are going to maintain it).

By increasing the Major version number, it tells people that this is incompatible with previous versions, and it gives you a clean slate to add new languages and set proper guidelines, so you don't have things like languages that take up 60% of the file size.

Also things won't just break for old projects, but new, active projects could take advantage of 10.0.0 and the new languages that it could provide.

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 22, 2019

4 languages take up over 40% of the whole file size.

I'm not sure what you're ranting about for half that. This is why we don't distribute a FULL monolithic build. We already bless some languages for "common" and that currently looks like:

highlight.js        : 178917 bytes
highlight.min.js    : 86955 bytes
highlight.min.js.gz : 28808 bytes

This is the default library we publish. 28kb gzipped.

In my honest opinion those 4 languages should be made 3rd party, you could fit over 70 other languages in the same space they took

I'm not sure I completely disagree (about removing some), but I don't understand your accounting of space, since we don't include them by default... and the user ultimately decides if they are worthwhile or not... or if they just use the prepackaged library then they don't get them at all - or can fetch them from CDN.

Personally I worry more about which languages require the most MAINTENANCE. A huge language that just sits there and everyone is happy and requires no maintenance and isn't in the default "common" set is something, but it's NOT a huge concern.

regardless of if it is stopping new languages being accepted or not.

It's not. We're not "out of space". :-)

There are a ton of issues with this. Unless you control what is being posted on the CDN, then you have no control. Lets use CDNJS for example.

Well someone would have to either trust their CDN or host it themselves. This is already an issue for anyone using CDNs. If you don't trust your CDN, then you shouldn't be using it... If it's easy enough for someone to add another file I don't see why they couldn't also easily just change the core file. And you're screwed either way.

Lets say that you guys implemented this, and made it that if a language isn't found, it looks on CDNJS for highlightjs-${language}.

I don't think anyone (certainly not myself) ever suggested we load RANDOM files from a HUGE CDN that collects massive libraries... the idea would be that you could point to a SPECIFIC CDN build of highlight.js and fetch from there, or you could add languages one off by URL (as you can already do). I would opposed a feature in core that randomly loaded almost random URLs, as would hopefully any sane person. :-)

When that has been discussed it's been in the context of BUILD time (with people picking and choosing manually), not automatically at run-time.

Even if they did do that, are developers expected to monitor NPM like a hawk looking for new languages to add? They would maybe do it once, and then never again, leaving new languages to never be noticed.

No, check our README of known 3rd party languages. :-) IF someone wants to be on the list they make a PR. We already do this.

This is a very closed minded view, if someone made an electron app that relied on JIT being a thing, and only loaded in the core set of languages as a base, what happens when that app is used offline?

I'm not sure how this is a new problem. It wouldn't be "CDN or nothing". If someone wants a monolithic build they can still always do that. I'm talking about web usage here when we're talking about CDNs and JIT.

have people PR new languages in (with the PR having a checkbox saying like: - [] I will maintain this language / will be maintaining this language, sure they could check it and then never do anything, but I don't think there is a fix for that unfortunately (the same can happen with new repository languages, there is no guarantee that people are going to maintain it).

But if it's a 3rd party language it's really not our problem - so it's a huge difference. If it becomes OBVIOUS a 3rd party language is completely dead, dead and benefiting no one then it can always be removed from the README.

A checkbox means nothing.

so you don't have things like languages that take up 60% of the file size.

Again, not sure why you keep coming back to this. :-) If you're building a monolith with all 1mb of languages you should stop doing that. ;-)

@joshgoebel
Copy link
Member Author

@jaredlll08 The first step in removing ANY languages from core is making 3rd party language support silly smooth, so if you really want to force some languages to "walk the plank" I'd suggest finding a way to contribute to the 3rd party language support. :-)

@jaredlll08
Copy link
Contributor

jaredlll08 commented Dec 22, 2019

This is why we don't distribute a FULL monolithic build

I'm not sure I completely disagree (about removing some), but I don't understand your accounting of space, since we don't include them by default... and the user ultimately decides if they are worthwhile or not... or if they just use the prepackaged library then they don't get them at all - or can fetch them from CDN.

It's not. We're not "out of space". :-)

Again, not sure why you keep coming back to this. :-) If you're building a monolith with all 1mb of languages you should stop doing that. ;-)

You are the one the said:

Auto-loading is harder since right now the official languages are over 1 mb of Javascript

I'm just saying:

  1. It isn't actually 1mb of JavaScript being shipped to browsers, it is actually much less
  2. 4 languages are the reason you have nearly 1 mb of JavaScript, so if you wanted to do something like Auto-loading, then it could be worth reevaluating the current languages and deciding what is more valuable, Auto-loading or 4 huge languages.

I don't think anyone (certainly not myself) ever suggested we load RANDOM files from a HUGE CDN that collects massive libraries...

You did suggest loading libraries here though:

With named usage there would be better opportunities to solve this with JIT dynamic loading of language modules...

Regardless of where it is posted, if I had an official highlightJS repo, I assume I would be the one doing builds and releasing them, so once accepted with a nice "safe" language, I could push nasty code, have that be built and pushed to the safe CDN and do the exact same thing.

This could be solved by having a core maintainer looking over the changes before pushing, but at that point you're just adding more work for the core maintainers and that still doesn't ensure that a core maintainer didn't just glance at the commit names and approve it based on that.

A checkbox means nothing.

sure they could check it and then never do anything, but I don't think there is a fix for that unfortunately (the same can happen with new repository languages, there is no guarantee that people are going to maintain it).

Again, not sure why you keep coming back to this. :-) If you're building a monolith with all 1mb of languages you should stop doing that. ;-)

That is what you are pushing to NPM. Please don't stop making NPM builds.

The first step in removing ANY languages from core is making 3rd party language support silly smooth

Like I said I don't think languages should have to walk the plank (unless all languages besides the core languages are removed and a new system is inplace, like I said in the 10.0.0 suggestion

I'd suggest finding a way to contribute to the 3rd party language support. :-)

I did try and find a way to contribute to the 3rd party language support, I suggested using submodules, and made a Proof Of Concept of using them.

Which apparently "no one is really in FAVOR of that". Looking back though, on this:

I don't want 100 separate issue trackers, etc... so far it seems no one is really in FAVOR of that

https://github.com/issues?utf8=%E2%9C%93&q=is%3Aopen+is%3Aissue+user%3Ahighlightjs
That is a link to every single issue for all projects under the @highlightjs organization. You can filter that search to only include specific repositories and more (cheatsheet), so that wouldn't even be an issue.

@joshgoebel
Copy link
Member Author

You are the one the said:

Ah I think I led you down a confusing road. I meant to say auto-detection, or at least that's the realm in which I was talking about auto-loading... If you only have a few languages there isn't any need to auto-load them in my opinion. But if you run a blog where different pages might use say ALL of the languages (over time) then it might be VERY useful to only load the one you need for a given article, etc...

So my point about size really has nothing to do with the size of individual packages and rather more to do with the size of the total or perhaps even the quantity. For auto-detection all the languages have to be loaded in advance (so we can scan them and see which one "wins"). So having 1000 small languages that add up to 1mb is just as bad as a few big ones... you have to load them all to auto-detect them all.

Loading "as needed" from a CDN just doesn't really work for auto-detect because you can't detect until AFTER you've loaded... but on demand loading could be great if you know the languages in advance, since then you can only load what you need. So that's what I was talking about. The individual size of any language doesn't really matter so much.

If one language is TRULY 100kb but you NEED it, well that's 100kb you have to download and just deal with... shrugs

@joshgoebel
Copy link
Member Author

That is what you are pushing to NPM. Please don't stop making NPM builds.

No plans to. True, but for MOST people (AFAIK) the size of our NPM build is not really an issue. If you don't want to parse all that JS our README has easy instructions to only loading the languages you need. Or you could build a custom NPM package yourself. If someone is living in an environment where a 1mb server-side package is a real issue they need to be doing a custom build.

unless all languages besides the core languages

Well by the very definition (currently) core language are only those that are included. :-) If we drop one, it's no longer a core language. Perhaps you meant "common" language or popular or some other metric?

You can filter that search to only include specific repositories and more (cheatsheet), so that wouldn't even be an issue.

Interesting. Though I don't know how easy it is to manage permissions at that level, but still interesting.

I did try and find a way to contribute to the 3rd party language support, I suggested using submodules, and made a Proof Of Concept of using them.

But I think (trying to remember) the problem there was there you were still talking about putting them in OUR repo, yes? I believe we desire a bit more isolation than that. One logical conclusion to what you were trying to do is to make a whole highlightjs-community repository (with submodules, or maybe look at subtrees too) and organize it and throw in some build scripts (probably on top of the new build stuff).

Once we had a nice build system to me that's the next logical step if for someone to tie it all together with a bow, but I'm just not sure who that person is going to be - or if that's what the community really wants. :-)

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 23, 2019

@jaredlll08 To me the fun/scary/interesting thing here is that the 3rd party stuff exists OUTSIDE core. That is a limitation in some ways, sure, but also a huge freedom. If you have a plan and the time feel free to build a larger proof of concept... and share it with people. See if it works, see if they like it, see if you get contributors. You don't necessary need core's explicit blessing to whatever you want.

One could imagine a parent repo where you had like:

  • /vendor/[highlightjs repo]
  • /languages/[many child repos]

I'm very open to making the "extra language" path configurable somehow... so then you could just check everything out... point the extra language source to [ROOT]/languages/ run the default build script... and push your results anywhere. You could build a NPM, publish a CDN, commit releases back into Github, or all the above. :-)

Actually after the new build stuff works the only thing you might actually have to do here is create the repo that ties it all together (and of course figure out how/where to share/publish).

@jaredlll08
Copy link
Contributor

So having 1000 small languages that add up to 1mb is just as bad as a few big ones...

While it may be just as bad in terms of performance, it would be better (my opinion) to have 1000 languages being supported, compared to 180 languages.

From what I understand of what you've said:
it would take 5 seconds to auto detect with 180 languages (with big language files)
it would take 5 seconds to auto detect with 1000 languages (with small language files).

If that is correct (using my hypothetical numbers), then would it not be better to be able to detect those 1000 languages instead of having the big language files limit the project (in theory removing them would make the current project faster?)

Well by the very definition (currently) core language are only those that are included. :-) If we drop one, it's no longer a core language. Perhaps you meant "common" language or popular or some other metric?

You're right, I did mean "common", I forgot what name you used for them, sorry for the confusion.

Interesting. Though I don't know how easy it is to manage permissions at that level, but still interesting.

What permissions? It just gives a link to all the issues, so you could go into them and it should be fine.

If you're talking about things like setting author/labels/assignee without having to go into the issue itself or on multiple issues, then yes, that wouldn't be possible with that link unfortunately.

But I think (trying to remember) the problem there was there you were still talking about putting them in OUR repo, yes?

I just gave a simple POC, it could be implemented in any repo, so doing sub modules in a community repo would work.

@joshgoebel
Copy link
Member Author

(in theory removing them would make the current project faster?)

That's why we have common groups of languages. We do "remove" them for 95% of the users of Highlight.js who are just using the default distribution.

If that is correct (using my hypothetical numbers), then would it not be better to be able to detect those 1000 languages instead of having the big language files limit the project (in theory removing them would make the current project faster?)

Maybe, that's why we leave it up to the user how they build the library. If more max speed/max languages is a criteria for you then you'd simply build the library with a cap on the size and only include languages below a threshold.

In practice this has been a non-issue since people just use the default build, which gives you a nice common platform to work from. If we added JIT loading then the "common" build would gain support for all 185 languages (via auto-load). And that's about as good as we can get (citing the previous issues with having to load a language in order to auto-detect it).

Going broader than that requires someone stepping up and maintaining a 3rd party CDN and solving all the related issues doing so (when you could use it as your JIT source) and support who knows how many languages via JIT.

Auto-detection (at least for the present time) is always going be limited by the languages you've decided to build into the library.

@joshgoebel
Copy link
Member Author

What permissions? It just gives a link to all the issues, so you could go into them and it should be fine.

I was referring to managing them, etc. I don't think I magically have permission to administrate all of the issues for the highlightjs organization. That's probably possible, but it's another nuance of having things lots of different places.

If you're talking about things like setting author/labels/assignee without having to go into the issue itself or on multiple issues, then yes, that wouldn't be possible with that link unfortunately.

Exactly, I do that stuff ALL the time. :-)

I just gave a simple POC, it could be implemented in any repo, so doing sub modules in a community repo would work.

Not opposed to seeing someone try that. :-) I'd suggest you consider subtrees though as I mentioned I've read they are a lot more sane. :-) But maybe you're a die hard submodule believer, which is ok too. :-)

@joshgoebel
Copy link
Member Author

joshgoebel commented Dec 23, 2019

@jaredlll08 There is no need to discuss removing languages from "core" further here. None of the maintainers is in a hurry to do that, so it's simply not going to happen soon - if ever. I'd personally like to remove some, but I'm not in a hurry either... as for most people they have no day to day impact on their usage of the library.

As far as removing from "common", I don' think we need to... there is a thread on that and no one really spoke up, plus it'd be a breaking change for people. If we were going to... the transition to v10 would be a good time, but honestly I don't think there are any TRULY worth removing. The current gzip size is 28kb and I'm pretty happy with that.

So lets try and move past the fact that we have a few large languages and focus on the other things here.

@joshgoebel
Copy link
Member Author

@jaredlll08 If you felt strongly and would like to publish a npm-highlight-js-small package or something that'd be understandable... but as said elsewhere I think the 1mb installed size of our npm package also isn't really an issue for 99% of people. And that's really the only place most people actually "feel" the impact of those extra languages (npm package size). Or CDN size if they hosted the CDN files, but 1mb of assets is really nothing for web hosting.

@joshgoebel
Copy link
Member Author

Closing. Inactive thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big picture Policy or high level discussion
Projects
None yet
Development

No branches or pull requests

6 participants