Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(docs) Explain the 1st party vs 3rd party situation a little better #2823

Merged
merged 3 commits into from
Nov 13, 2020

Conversation

bradcray
Copy link
Contributor

@bradcray bradcray commented Nov 3, 2020

Changes

As someone interested in contributing a new language to Highlight.js, I ended up very confused by the distinct messages in language-contribution.rst and the GitHub PR template (each of which seemed to be saying "sure you can contribute new languages, no problem, come on in!") versus extra/3RD_PARTY_QUICK_START.md which seemed to be all about "here's how you can add your own language without contributing it back to us" (which led me to wonder "Why would anyone prefer to do that?" Only after some interaction with the team, did I learn that this is the only option that's really available.

To that end, I drafted some text updating item 1 in language-contribution.rst to try and clarify the situation for future developers from the outset rather than having people be confused and requiring clarification from the team to settle things out. I'm sure it could be improved further, but I think this would have set my expectations and understanding better from the very start.

…pectations

---
Signed-off-by: Brad Chamberlain <bradcray@users.noreply.github.com>
@joshgoebel
Copy link
Member

We do try to explain this is several different places. Something like this could be merged, but I'll edit it to remove a bit of the negativity. Also, turning a 1st party grammar you mistakenly added to the main repo into a 3rd party grammar really just requires moving the files... so the biggest problem here is misalignment of expectations.

The following steps describe how a core developer would add a new
language to the code base, but they can also be used as a reference
for creating support for your language within your third-party repo.

I'll look carefully but I don't see why this needs to be said at all... making a language is literally the exact same process 1st party or 3rd party (designed to make it equally easy), it's just a matter of which subdirectory the files go into... highlightjs or extra/your_grammar. Or maybe it should be stated more along those lines?

@joshgoebel
Copy link
Member

@allejo @egor-rogov Thoughts?

@joshgoebel
Copy link
Member

and the GitHub PR template (each of which seemed to be saying "sure you can contribute new languages, no problem, come on in!")

Could you elaborate more on this - and perhaps how the PR template could be improved? We merge new PRs all the time, just not new language grammar PRs. Although at the point someone is making a PR it's kind of a little bit too late to tell them.... It's no secret that we don't merge new language grammars but I also think there is a problem with hitting users over the head with that fact every 5 minutes. A lot of people have still found lots of amazing ways to contribute - 3rd party languages, plugins, first party bug fixes, etc. More than a few people originally "burned" by the fact that we don't merge new language PRs have then gone on to contribute by publishing their grammar as a 3rd party language modules (with very little effort)... it's not as terrible as it seems at first thought. :-)

So I think how this issue is communicated is pretty important. That said I'm sure we can currently do a better job of setting expectation. I just worry about placing a warning everywhere we can possibly think of - because it sets the wrong tone.

@joshgoebel
Copy link
Member

joshgoebel commented Nov 3, 2020

We could probably also do more to explain to people why getting their language merged often isn't what they think it will be. Some of the most upset people (early on) mostly wanted their language to magically work on StackOverflow/Discourse/[insert big site]... and since you can't talk to those places (so easily) they just thought if they could only get their language merged upstream... boom magic...

But that's not how it works at all.

These large sites typically (despite using Highlight.js) have a very tight list of curated languages that they support, bandwidth and CDN costs to keep in mind, etc. They only support a small fraction of our 1st party languages and are often quite hesitant to add new ones, and they have many reasons for doing so.


That said Discourse now allows injection of custom JS/HTML though, so it's now very easy to add a 3rd party grammar (or 1st) that Discourse doesn't support out of the box. :-) So things are slowly improving. StackExchange will be a harder nut to crack.

@joshgoebel joshgoebel changed the title Proposed edits to language-contribution.rst to avoid setting false expectations / reduce confusion chore(docs) Explain the 1st party vs 3rd party situation a little better Nov 3, 2020
@bradcray
Copy link
Contributor Author

bradcray commented Nov 3, 2020

I'll edit it to remove a bit of the negativity.

That's completely fine of course. It wasn't my intention to have it read negatively so much as directly/bluntly/unambiguously. After I left work yesterday, I realized that a better title for step 1 might be something like "manage your expectations".

making a language is literally the exact same process 1st party or 3rd party (designed to make it equally easy), it's just a matter of which subdirectory the files go into... highlightjs or extra/your_grammar. Or maybe it should be stated more along those lines?

I think making that clearer would be beneficial. I.e., saying something like "for core developers, the paths that follow are relative to $HIGHLIGHT_JS_HOME while for third-party developers they are relative to $HIGHLIGHT_JS_HOME/extras/ (or $HIGHLIGHT_JS_HOME/extras/your-3rd-party-repo-name?) would go a long way. For me, a lot of my confusion was due to the fact that the two files (this one and the 3rd party README) seemed to be saying very similar things, so it wasn't entirely clear why both existed, or what the difference. This paragraph was my best guess as to why they both existed after our exchange (before our exchange, I thought that some people simply wouldn't want to contribute their grammars back to a repo that they didn't have merge permissions on, so thought 3rd-party languages were a way to have them make progress without opening a PR containing their code against the main repo).

(each of which seemed to be saying "sure you can contribute new languages, no problem, come on in!")

Could you elaborate more on this

My first question when I found highlight.js was "is it possible to extend it to support a new language?" where the answer was very clearly "yes"; then the next was "is it possible for my language to be incorporated into the source base?" where the answer turned out to be a pretty clear "no", though that wasn't obvious to me from the documentation, only from your notes in the issue I opened. Specifically, I surfed to the main website, clicked on the "Contributing" ribbon (which took me awhile to find, admittedly), clicked on Developer Docs, and found myself at "Language Contributor Checklist" fairly quickly, which, by virtue of its existence seemed to imply that it was possible to contribute new languages. I also read the extras/ 3rd party readme and understood it to describe a way to create a language without contributing it back, but nothing suggested to me that this was really the only avenue available.

Now that I understand the state of things better, I can see that a literal reading of the Language Contributor Checklist says to create a PR only to add your language to the list of languages, not to add the rest of the code developed in the previous steps, but my reading of it was "of course you'd commit all the code developed in the previous steps in the PR as well" and I viewed adding the language to the list as simply being the icing on the cake. Specifically, nothing seemed to preclude that one could/would open a PR to introduce a new language definition.

The PR template's text around adding tests being a pre-requisite for merging also suggested to me that it was possible to add a new language so long as one added tests for it as well. Now I understand that text to be for changes to existing grammars which need tests to lock in those changes and make sure they don't regress, but at the time it didn't seem to rule out checking in code defining a new language and code to do it. One way to finesse this might be to take advantage of GitHub's support for multiple PR templates to have a distinct template for "add a new language" that only talked about adding it to the list and CHANGES.md and a second for "general code improvements" that talked about adding tests, CHANGES entries, etc.

Overall, I'd be inclined to combine the "Language Contributor Checklist" and 3rd party README into a single document, because a lot of my confusion was around wondering why both of these documents existed when they seemed to describe a lot of the same steps. My (incorrect) conclusion was that the former was for people who wanted to submit their code back to the project as a PR and the latter was for those who didn't or couldn't. This current PR was my quick attempt at adding text that would've put me onto a better track to begin with (or encourage someone else to do similarly). I thought about a PR to try merging the two docs into one, but that felt more significant than I had the time/interest for (and also isn't something I'd start into without getting your team's go-ahead, obviously).

but I also think there is a problem with hitting users over the head with that fact every 5 minutes.

Where do you say it now? (Because again, I don't feel like I saw that message until you responded to my issue). I suppose the "On requesting new languages" page might be considered such a place, but I didn't read it this way to me. I read it as saying "we're not going to do the language for you; you're welcome to do it yourself; you can do it as a 3rd party package if you want" but again, nothing in it suggested that if someone did show up with a new language implementation it couldn't / wouldn't be considered for inclusion in the project proper. I.e., it read to me as "if you're someone with gumption and can put such a highlighter together, that's great" but nothing said "don't expect it to be merged to the main repo, though."

[This "On requesting new languages" document also feels as though it could be integrated into, or at least linked from, the language contributor checklist, perhaps in the "managing expectations" bullet).

wanted their language to magically work on StackOverflow/Discourse/[insert big site]...

Yeah, that'd be another good message to try and get across in the "managing expectations" bullet. It's another one that I didn't fully understand until our exchange on my issue (though I was pleased that you turned out to be wrong about our Discourse site :) , which is the main thing that finally got me off my butt and into this effort; StackOverflow having been a previous one, but one that I knew would be a heavy lift in any case).

Speaking of which, if you have the interest and I have the time, I could draft some "How to get your language enabled on a Discourse site that you administrate" instructions for consideration if you have suggestions for where to place that in a PR.

Speaking of StackOverflow, since highlight.js is still "new" over there, I posted a question about third-party highlight.js languages yesterday to see whether it would get any traction, and was surprised that the main concern (though only in the comments, granted) was not one of the performance impact of supporting additional languages (which had been my guess) so much as the security concerns. I realize that it's a longshot that they'll do anything here in any case, but didn't see any harm in asking.

Anyway that made me curious whether highlight.js does anything to try and ensure that grammars don't contain malicious code (where I could imagine many third-party grammars wouldn't need to support any executable code at all, so much as lists of keywords and regexps). Is it OK if I were to open an issue here asking about this topic?

Sorry this turned out to be such a long response. Hopefully I addressed the questions you asked.

@joshgoebel
Copy link
Member

Overall, I'd be inclined to combine the "Language Contributor Checklist" and 3rd party README into a single document,

Probably not a bad idea. The second document only exists as a stop-gap to get people started on 3rd party development. It was always intended that over time it could be much improved.

saying something like "for core developers, the paths that follow are relative to $HIGHLIGHT_JS_HOME while for third-party developers they are relative to $HIGHLIGHT_JS_HOME/extras/ (or $HIGHLIGHT_JS_HOME/extras/your-3rd-party-repo-name?) would go a long way.

I'm not sure why this distinction matters though... "core developers" already know this, and I think it just muddles the water - creating confusion where none needs to exist... if 3rd party grammars is the happy path, the the docs should focus on that path and just explaining it. What "core" does shouldn't really matter.

Where do you say it now?

As you've said (and noticed) it's several places, perhaps it still is just not explicit enough.

I could draft some "How to get your language enabled on a Discourse site that you administrate" instructions for consideration if you have suggestions for where to place that in a PR.

That would be awesome, I just dunno where we'd put it either... I really think we need an "Awesome Highlight.js" type site/docs and I think that might be the landing place for a lot of that kind of thing.

so much as the security concerns.

Security concerns are a real concern - putting Highlight.js (and language grammars) on your website is injecting code into a site... you need to trust us and any 3rd party grammars you're using... a malicious grammar could potentially do anything a user could do on a website...

whether highlight.js does anything to try and ensure that grammars don't contain malicious code

The core team reviews all PRs that go into core, of course. I'm not sure how we could do anything about 3rd party grammars since (by design) grammars are executable code. Any site with serious security concerns using Highlight.js should run it in a background worker thread to isolate it from their primary website code.

Is it OK if I were to open an issue here asking about this topic?

What would the question be exactly? I think I've covered a bit above and some of this is also covered in the long, long discussion on why we no longer accept language PRs (you can find it in the long-lived discussion issue).

@joshgoebel
Copy link
Member

joshgoebel commented Nov 3, 2020

If you're interested on really working these docs you should have a brief chat with @binyamin over on #2753 also. Definitely some overlap here.

@bradcray
Copy link
Contributor Author

bradcray commented Nov 3, 2020

if 3rd party grammars is the happy path, the the docs should focus on that path and just explaining it.

That would be fine / clearer to me as well. The challenge I had was that by not saying which root they were relative to (and, again, being under the impression up until our interaction that I could contribute a new language back to the repo), the vagueness had me continue going back and forth between "wait, should it be in extras/ or where I clearly see every other language living?" So if the file had gotten me out of the mindset that I might be contributing a new language back at the outset and then either made the paths include extras/ or else stated in the prelude that they were relative to extras/, that would've removed a lot of churn from my process.

As you've said (and noticed) it's several places, perhaps it still is just not explicit enough.

Yeah, it definitely was not clear enough for me, more like hinted at from a few angles. And granted, I'm not always the brightest or most patient of readers, but I think if the first bullet were to have addressed it, it would've saved some time, churn, and eventual disappointment.

That would be awesome, I just dunno where we'd put it either...

If I should read this as "yes, please write this up and we'll figure out where to put it", I will attempt to do so. If I should read it as "it'd be great if you wrote it up, but then we'll probably drop it on the floor because we honestly don't know where we'd put it" then I probably won't. (If it were me, I'd make the first bullet about managing expectations, explaining that you'll be developing a third-party package, and talking about the uphill battle that you might face getting a StackExchange-like site to start using their package; but then to say that integrating into a Discourse site that you admin can easily be done and linking off to wherever else in the docs hierarchy this were to land).

a malicious grammar could potentially do anything a user could do on a website...
grammars are executable code.

Understood. My thinking was (as I think you implied on the SO question) that it seems like a human can scan most grammars and clearly determine that they are "inert" without much effort (e.g., it defines some key-value pairs that define some keywords and regexes). This made me wonder whether Highlight.js did (or could) do something similar (if javascript's code reflection capabilities were rich enough?) to distinguish between a more 'inert/static' vs. 'active/dynamic' grammar, if you take my meaning. I understand now that it doesn't, and am not particularly surprised / disappointed, but that's what I was going to ask about.

What would the question be exactly?

I think you've addressed my first-level curiosities here and on the SO question, thanks.

@joshgoebel
Copy link
Member

I could draft some "How to get your language enabled on a Discourse site that you administrate" instructions for consideration if you have suggestions for where to place that in a PR.

@allejo @egor-rogov @binyamin Any idea where something like that should perhaps live?

@joshgoebel
Copy link
Member

joshgoebel commented Nov 3, 2020

it seems like a human can scan most grammars and clearly determine that they are "inert" without much effort

Yes.

(This made me wonder whether Highlight.js did (or could) do something similar (if javascript's code reflection capabilities were rich enough?) to distinguish between a more 'inert/static' vs. 'active/dynamic' grammar, if you take my meaning.

You could of course easily (at run-time) determine if a grammar TELLS you it's using callbacks or some of the more dynamic features... but it could still contain a lot of code doing who knows what and then still return the simplest actual grammar definition. I think protecting ourself from malicious 3rd party grammars is out of scope... to do so we'd really need static-code analysis (to read the JS WITHOUT running it)... or we'd need to make the grammars entirely static... but even then if we aren't in charge of the WHOLE loading process, users can still shoot themselves in the foot.

We'd have to go out of our way to fetch 3rd party grammar files as JSON txt or some other non-executable format and then parse them, etc... (can't fetch them as JS because then someone could just sneak executable code back into them, etc)... just not really a problem that seems to need solving vs someone just paying closer attention to the code they are running. It's certainly an interested problem, just not one it seems anyone cares about that much right now. :)

@bradcray
Copy link
Contributor Author

bradcray commented Nov 4, 2020

it could still contain a lot of code doing who knows what and then still return the simplest actual grammar definition.

Yeah, I was imagining that such cases would be flagged as potentially dangerous by any sort of automated security scanner. So sufficiently simple third-party grammars would pass and anything more complicated would not even if it could be proven to benign.

I think protecting ourself from malicious 3rd party grammars is out of scope...

That's definitely reasonable. If you'd already done something like this and it gave external sites more confidence in using third-party grammars, that'd be cool, but since not, I agree that this is a big project to take on.

@allejo
Copy link
Member

allejo commented Nov 13, 2020

I could draft some "How to get your language enabled on a Discourse site that you administrate" instructions for consideration if you have suggestions for where to place that in a PR.

@allejo @egor-rogov @binyamin Any idea where something like that should perhaps live?

How about creating a new section in RTD along the lines of "Highlight.js Integrations"? It can have two sections "Services Using Highlight.js" and "Self-Hosted Applications"

"Services Using Highlight.js"

Here we can briefly summarize that SO + Discord use highlight.js but they don't use all of the languages. Here we can point them to resources (if we have any) on how to request them to add support for a language.

"Self-Hosted Applications"

This is where the Discourse integration can be discussed 😄

@joshgoebel
Copy link
Member

How about creating a new section in RTD along the lines of "Highlight.js Integrations"? It can have two sections "Services Using Highlight.js" and "Self-Hosted Applications"

You mean the readme? I think it's already a bit long , but perhaps if we add a TOC... or a heading that only says "Hey go read out docs on Integrations and then a matching .rst file in our documentation could work.

Here we can briefly summarize that SO + Discord use highlight.js but they don't use all of the languages.

Why, just for marketing? I'm not sure this is our support burden to handle...

Here we can point them to resources (if we have any) on how to request them to add support for a language.

I don't think there is a way (for Discourse), or no useful way... for SO there is a process, and it's documented on SO... and true we could point them there... but I think that'd be a pretty round-about route for someone to take... it's far more likely they just ask on SO and never find our README or docs at all.

So I guess the first use case (how do I use custom languages with Discourse) sounds a lot more helpful to me personally... though I don't know that I'm opposed to the latter, just not sure I see the usefulness yet.

@joshgoebel
Copy link
Member

@allejo Do you find my changes an improvement to this file though? If so we should merge it until we have something better.

@allejo
Copy link
Member

allejo commented Nov 13, 2020

So I guess the first use case (how do I use custom languages with Discourse) sounds a lot more helpful to me personally... though I don't know that I'm opposed to the latter, just not sure I see the usefulness yet.

Personally, I think the "how to use custom languages with Discourse" should live in the Discourse documentation. Same with SO's and Discord's integration. But since the question was brought up on where it'd live in highlight.js' docs, I'd think it belongs in RTD and not the README. Like you said, the README is long enough and it's not something the average user who wants to highlight their code would care about.

@allejo Do you find my changes an improvement to this file though? If so we should merge it until we have something better.

I think this PR as-is is good and clears up enough 👍

@joshgoebel
Copy link
Member

I'd think it belongs in RTD and not the README

Ah, Read the Docs, that just clicked with me, LOL. @bradcray Does Discourse have any type of place to perhaps naturally put information like that, wiki, etc?

@joshgoebel joshgoebel merged commit fb007ff into highlightjs:master Nov 13, 2020
@bradcray
Copy link
Contributor Author

Does Discourse have any type of place to perhaps naturally put information like that, wiki, etc?

I don't really know Discourse well enough to know for sure. The only things I've found are their discussion forums on meta.discourse.org (which is where I figured out it was possible for an admin to enable to new highlight.js languages in a given discourse instance, though the instructions were not particularly clear, particularly for discourse n00bs) and their API documentation (which is a lot lower level from what I've seen).

To be clear, I'm fine with dropping the knowledge on the floor as well. I've got it in my README, so I'm good and was only offering it up in case you wanted to make it available to others through your docs.

@joshgoebel
Copy link
Member

Where is your README?

@bradcray
Copy link
Contributor Author

Here, though note that it's written up as notes to myself / my team rather than anything intended for consumption by the general public:

https://github.com/chapel-lang/highlightjs-chapel#to-integrate-the-chapel-grammar-into-a-discourse-theme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants