File Extension as Aliases #2371

taufik-nurrohman · 2020-01-28T01:52:53Z

I would suggest to add common file extensions to every language into aliases as it will be easier for developers who want to create language detection without depending on the built-in language detection based on the file extension. For example, apache should have an alias htaccess too.

Human Readable Name	File Extension
Apache	HTACCESS
JavaScript	JS
Markdown	MD, MKD
Text	TXT
YAML	YML, YAML

let path = './path/to/file.js',
    container = document.querySelector('pre > code');

fetch(path).then(response => {
    let source = response.text(),
        x = path.split('.').pop(), // Take file extension as the language name
        out = hljs.highlight(x, source);
    container.className += ' hljs ' + out.language;
    container.innerHTML = out.value;
});

Imagine someone makes a git repository viewer application and then uses highlight.js to color the code syntax in their files. This will be more robust to automatically detecting language through file extensions than by reading the file contents over and over using the available language packages until it finds the most relevant match.

This also opens up various possibilities to load language packages asynchronously based on file extensions, so the amount of data transferred will be much smaller considering that JavaScript works on the client side.

The text was updated successfully, but these errors were encountered:

joshgoebel · 2020-01-29T07:49:19Z

This will be more robust to automatically detecting language through file extensions than by reading the file contents over and over using the available language packages until it finds the most relevant match.

Of course you wouldn't actually ever do it over and over. If performance mattered at all you'd only do the auto-detect a single time and then cache that result for future use. That said if you can trust the extensions, that could be much faster.

Of course historically we don't care about extensions since we aren't dealing with them, we're only dealing with source.

My suggestion here is that we make an actual extensions key so that extensions can be registered differently than aliases. This would also help the stated use case here... as file.javascript would technically be a FALSE positive as javascript is not a valid extension for Javascript. And god forbid the confusion that ensures when one languages extension happens to be the name (or common alias) of another language - though hopefully that's only a hypothetical problem.

So something like:

// plaintext.js
{
  aliases: ['text'],
  extensions: ['txt'],
  ...
}

joshgoebel · 2020-01-29T07:53:35Z

@isagalaev Do you have any historical context on what the actual original intention was here - that might guide us or add to the discussion?

I'm pretty sure this wasn't the original intent because we're sorely lacking some obvious extensions (like htaccess, as mentioned)... yet for many languages it sure appears we are already doing this.

@egor-rogov Any thoughts?

joshgoebel · 2020-01-29T07:56:37Z

The advantage of keeping them separate:

Easier to split them now (before we have all the data than later)
Easy to have different behavior for each, or user customizable behavior (impossible if they are just in one huge list)
Also easy to have exactly the same behavior as one big list just by registering them all as aliases.

The only real downside I see is that we need to go over existing data and split out the extensions from the aliases, and this would likely for the short-term mean we needed to behaviorally treat them as the same...

Although it does lead to the question should extension SOMETIMES also be an alias? like JS is ofter used in place of JS... making it a true alias, not just the extension. Does that distinction matter? Splitting them out means we have to answer sticky questions like that.

One thing I'd love to know is a rough count of how many extensions we already have as aliases.

If it's already quite high, perhaps we just roll with it... but if it's pretty low, then I think this is worth a moment of thought.

egor-rogov · 2020-01-29T12:58:55Z

And god forbid the confusion that ensures when one languages extension happens to be the name (or common alias) of another language - though hopefully that's only a hypothetical problem.

I'm afraid it's not that hypothetical. For example, *.sql file can easily be from PostgreSQL, Oracle or any other database.
Highlightjs deals with code blocks and knows nothing about files nor their extensions. Since it's the application responsibility to find out extensions, I feel like it also should be the application responsibility to map extensions to HLJS language names/aliases.

Of course we can think about how to make it easier, but I'm afraid it's not straightforward. We can, for example, provide the separate list of extensions (as you suggested), but allow different languages to have same extensions etc., and let the application make the final decision.

joshgoebel · 2020-01-29T13:16:02Z

I'm afraid it's not that hypothetical. For example, *.sql file can easily be from PostgreSQL, Oracle or any other database.

Good point.

Highlightjs deals with code blocks and knows nothing about files nor their extensions. Since it's the application responsibility to find out extensions, I feel like it also should be the application responsibility to map extensions to HLJS language names/aliases.

This feels right at first glance.

I think aliases like "js" and "rb" are pretty common (I use them all the time on Github) but I think the KEY here might be that they are used because they are SHORTCUTS, not because they are extensions. So one wouldn't write htaccess when "apache" was shorter... although we currently also have "apacheconf" as an alias for apache right now. Honestly that's probably backwards though... "apacheconf" is really the name and apache might be the alias. Though then you could argue "htaccess" is shorter...

I guess right now it's all a bit muddled, which is why my mind instantly thought about creating the separation, but I'm afraid then it will be hard to "prove" something is or isn't an alias... extensions are pretty well defined though.

We can, for example, provide the separate list of extensions (as you suggested), but allow different languages to have same extensions etc., and let the application make the final decision.

Sounds reasonable but how does that work when the block is class="lang-sql"... does it just find all potential sql matches and then run those thru the auto-detect?

And if we're TRULY going to add extensions, does that mean we need a ext-sql type namespace for people who'd like to syntax highlight based on extension? I think that' would be preferable to lang-x where x is ambiguous.

joshgoebel · 2020-01-29T13:18:08Z

And we have ridiculous things like:

aliases: ['php', 'php3', 'php4', 'php5', 'php6', 'php7'],

Which seem entirely unnecessary as aliases... I don't think they are extensions either.

joshgoebel · 2020-01-29T13:26:24Z

Then you have categories also like assembly... .s, .S, .asm seems to be the trend, but is that x86 assembly? ARM assembly? mips? And "subtypes" like Arduino, which has the same extensions as CPP.

joshgoebel · 2020-01-29T13:36:40Z

In a quick review it seems the ship might have already sailed on adding extensions to aliases... so I'm leaning towards approving things (like the PR to add htaccess to apacheconf), as least when they seem clear and unambiguous... but that still does leave a lot of the questions above open.

Honestly though I wonder if anyone wants to put in the work to split the existing aliases out into extensions... I'm not too excited about doing it... perhaps we just soldier on with aliases until it becomes a larger problem?

Right now someone who wanted to load up a bunch of conflicting aliases would have to deal with it by hand, or simply not rely on the aliases to work since really the last language loaded would be the "winner"...

egor-rogov · 2020-01-29T13:38:58Z

Extension-based autodetection looks reasonable.
We can support two use cases:

The application uses API we provide to get the list of languages by the extension. Then the app decides which one is appropriate and use it to pass to HLJS (class="..." or class="lang-...").
The application passes the extension right to HLJS (class="ext-...") and we run autodetection within the list of possible languages.

if we're TRULY going to add extensions

Looks like something useful to me, but surely not the first priority.

egor-rogov · 2020-01-29T13:39:52Z

I'm leaning towards approving things (like the PR to add htaccess to apacheconf), as least when they seem clear and unambiguous

Agree.

joshgoebel · 2020-01-29T13:40:52Z

Extension-based autodetection looks reasonable.

This starts to smell a little like shebang lines though (just another way to detect/categorize)... and I don't think you were super encouraging of that as a core feature. What would make this different?

Well, maybe it's a little different since we already seem to do it via aliases. :-)

egor-rogov · 2020-01-29T13:59:14Z

Well, "extensions feature" doesn't change the way HLJS works. It's just the matter of narrowing down the list of languages for autodetection and passing it to the the existing API.

On the other hand, shebang is inside the code, and it is grammar that we use to deal with the code. It doesn't look right to teach HLJS to look into the code using means other that the grammar. (It's okay for the application to sniff the code to be highlighted, find shebang, parse it somehow, and pass the language to the existing HLJS API, though.)

It's just how I feel it, of course. Perhaps I'm wrong.

joshgoebel · 2020-01-29T14:05:58Z

It's just the matter of narrowing down the list of languages for autodetection and passing it to the the existing API.

I think you could potentially say the same for shebang data... really aren't we talking about whether a grammar can host data that we don't use DIRECTLY, but rather plugins or the source application could use indirectly to help correctly categorize a particular file/snippet?

If we are ok hosting extension data, BUT we don't use them directly, then why not host shebang lines... or any other "per-language" meta-data that might prove useful in general? And if we're not such a repository, then perhaps we shouldn't host extension data at all? Say "that is external to us, we only look at code"....

You could even argue shebang analysis is more in-scope than extensions... since shebang is part of the code itself... where-as extensions (and filenames in general) exist completely outside that sphere. :-)

joshgoebel · 2020-01-29T14:07:46Z

It would be great to know if anyone is currently categorizing snippets by extension alone like in the use case mentioned in the first post here.

egor-rogov · 2020-01-29T14:31:51Z

really aren't we talking about whether a grammar can host data that we don't use DIRECTLY, but rather plugins or the source application could use indirectly to help correctly categorize a particular file/snippet?

Hmm. If we're talking about storing some metadata... you almost convinced me. I think we shall return to the discussion later in more detail.
(original shebang proposal: #2174)

taufik-nurrohman · 2020-01-29T15:21:08Z

Looks like something useful to me, but surely not the first priority.

At least it is standardized. Some web server configurations use file extensions as a reference to automatically determine MIME types for the browsers. I think this is pretty standard as the configuration data will be served for any browser, anywhere. And so file extensions can be used as language categories too.

joshgoebel · 2020-01-30T01:12:42Z

At least it is standardized. Some web server configurations use file extensions as a reference to automatically determine MIME types for the browsers.

But the standard you're talking about there is extension to mime type mapping... that doesn't directly help us since we don't have a list of canonical extensions or a list of mime types. If you're merely saying it's helpful to be able to map from an extension to knowing what the file is, there is no disagreement on that point. :-)

I think the long-term question here is HOW we should encode that data... whether to continue using alias or to split the data out into it's own field.

anwarhahjjeffersongeorge · 2020-02-05T21:28:50Z

It would be great to know if anyone is currently categorizing snippets by extension alone like in the use case mentioned in the first post here.

ObservableHQ uses highlight.js I made an Observable notebook that lets people include code snippets in their Observables by referencing their URLs and parsing the URL contents into Markdown. Since the user supplies a URL, I automatically get the file extension with it, and I've been using this to figure out what kind of tag to put in the generated markdown block, which is the alias

The problem is that as it stands, I have to hard-code in a bunch of file extensions and their aliases, and that seems like the wrong way to do it.

If there was an independent mapping between the language full names and the extensions, this would help.

joshgoebel · 2020-02-06T01:20:19Z

@anwarhahjjeffersongeorge Why don't you share your list of mappings just so we see what that looks like.

anwarhahjjeffersongeorge · 2020-02-06T06:54:53Z

@yyyc514 no problem. it isn't anything fancy:

    codeexts: { // these are the code extensions I'm dealing with
      javascript: ['mjs', 'ejs', 'js', 'jscad'],
      typescript: ['ts'],
      openscad: ['scad'],
      processing: ['pde'],
      arduino: ['ino'],
      c: ['c', 'h'],
      cpp: ['cpp', 'hpp', 'cxx', 'hxx'],
      bash: ['sh'],
      python: ['py'],
    }

For my use case, I just do something like

let codetypename = '' 
  for (let key in codeexts) {
    if (codeexts[key].includes(ext) ) {
      codetypename = key
      break
    }
  }

But I can see this might not work for the whole highlight.js library since I have to add custom ones, like jscad, anyway. I don't know if this helps your discussion at all.

joshgoebel · 2020-02-06T16:52:46Z

I think for now we'll keep adding these aliases unless someone wanted to make a PR and do all the work of splitting them out. Not a high priority for me.

If we did split them out I think originally we'd still have to merge extensions and aliases for the same behavior as we have now (to not break anything). So it would really just be a data enhancement to the library to let people work with extensions/query them, etc. if they wanted to.

joshgoebel · 2020-02-06T16:54:39Z

it isn't anything fancy:

Pretty sure we already have most of those as aliases.

joshgoebel · 2020-02-06T17:32:50Z

@taufik-nurrohman Any chance you want to do this work and make a PR? :-)

taufik-nurrohman · 2020-02-06T22:15:27Z

@yyyc514 It requires editing the core then. I could do it but maybe need to be very careful.

If we did split them out I think originally we'd still have to merge extensions and aliases for the same behavior as we have now (to not break anything).

Hotfix:

scope.aliases = scope.aliases.concat(scope.extensions || []);

joshgoebel · 2020-02-06T22:17:18Z

Essentially, but I was really asking about the work of going thru all the 285 files and trying to make sense of extensions for all of them. :-)

Writing the one-liner is the easy part. :-)

joshgoebel · 2020-02-06T22:18:35Z

This type of work would pair well with:

#2394

joshgoebel · 2020-05-07T03:37:36Z

No one seems to be really pushing for this. Closing due to inactivity.

taufik-nurrohman · 2020-05-07T05:32:56Z

Would like too see this kind of request #2523 being added so that I can make a plugin related to this out of highlight.js core.

joshgoebel · 2020-05-07T05:34:44Z

@taufik-nurrohman If you wanted to whip up a PR for #2523 it'd probably be pretty simple...

taufik-nurrohman mentioned this issue Feb 1, 2020

enh(php) Add additional keywords, built-in classes, and '<?=' syntax #2372

Merged

joshgoebel mentioned this issue Feb 25, 2020

add(php-template) Explicit language to detect PHP templates #2417

Merged

joshgoebel closed this as completed May 7, 2020

taufik-nurrohman mentioned this issue May 7, 2020

Add hljs.registerAlias Method #2540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Extension as Aliases #2371

File Extension as Aliases #2371

taufik-nurrohman commented Jan 28, 2020 •

edited

joshgoebel commented Jan 29, 2020 •

edited

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020 •

edited

egor-rogov commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

egor-rogov commented Jan 29, 2020

egor-rogov commented Jan 29, 2020

joshgoebel commented Jan 29, 2020 •

edited

egor-rogov commented Jan 29, 2020

joshgoebel commented Jan 29, 2020 •

edited

joshgoebel commented Jan 29, 2020

egor-rogov commented Jan 29, 2020

taufik-nurrohman commented Jan 29, 2020

joshgoebel commented Jan 30, 2020 •

edited

anwarhahjjeffersongeorge commented Feb 5, 2020 •

edited

joshgoebel commented Feb 6, 2020

anwarhahjjeffersongeorge commented Feb 6, 2020 •

edited

joshgoebel commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

taufik-nurrohman commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

joshgoebel commented May 7, 2020

taufik-nurrohman commented May 7, 2020 •

edited

joshgoebel commented May 7, 2020

File Extension as Aliases #2371

File Extension as Aliases #2371

Comments

taufik-nurrohman commented Jan 28, 2020 • edited

joshgoebel commented Jan 29, 2020 • edited

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020 • edited

egor-rogov commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

joshgoebel commented Jan 29, 2020

egor-rogov commented Jan 29, 2020

egor-rogov commented Jan 29, 2020

joshgoebel commented Jan 29, 2020 • edited

egor-rogov commented Jan 29, 2020

joshgoebel commented Jan 29, 2020 • edited

joshgoebel commented Jan 29, 2020

egor-rogov commented Jan 29, 2020

taufik-nurrohman commented Jan 29, 2020

joshgoebel commented Jan 30, 2020 • edited

anwarhahjjeffersongeorge commented Feb 5, 2020 • edited

joshgoebel commented Feb 6, 2020

anwarhahjjeffersongeorge commented Feb 6, 2020 • edited

joshgoebel commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

taufik-nurrohman commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

joshgoebel commented Feb 6, 2020

joshgoebel commented May 7, 2020

taufik-nurrohman commented May 7, 2020 • edited

joshgoebel commented May 7, 2020

taufik-nurrohman commented Jan 28, 2020 •

edited

joshgoebel commented Jan 29, 2020 •

edited

joshgoebel commented Jan 29, 2020 •

edited

joshgoebel commented Jan 29, 2020 •

edited

joshgoebel commented Jan 29, 2020 •

edited

joshgoebel commented Jan 30, 2020 •

edited

anwarhahjjeffersongeorge commented Feb 5, 2020 •

edited

anwarhahjjeffersongeorge commented Feb 6, 2020 •

edited

taufik-nurrohman commented May 7, 2020 •

edited