Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Extension as Aliases #2371

Closed
taufik-nurrohman opened this issue Jan 28, 2020 · 29 comments
Closed

File Extension as Aliases #2371

taufik-nurrohman opened this issue Jan 28, 2020 · 29 comments

Comments

@taufik-nurrohman
Copy link
Member

taufik-nurrohman commented Jan 28, 2020

I would suggest to add common file extensions to every language into aliases as it will be easier for developers who want to create language detection without depending on the built-in language detection based on the file extension. For example, apache should have an alias htaccess too.

Human Readable Name File Extension
Apache HTACCESS
JavaScript JS
Markdown MD, MKD
Text TXT
YAML YML, YAML
let path = './path/to/file.js',
    container = document.querySelector('pre > code');

fetch(path).then(response => {
    let source = response.text(),
        x = path.split('.').pop(), // Take file extension as the language name
        out = hljs.highlight(x, source);
    container.className += ' hljs ' + out.language;
    container.innerHTML = out.value;
});

Imagine someone makes a git repository viewer application and then uses highlight.js to color the code syntax in their files. This will be more robust to automatically detecting language through file extensions than by reading the file contents over and over using the available language packages until it finds the most relevant match.

This also opens up various possibilities to load language packages asynchronously based on file extensions, so the amount of data transferred will be much smaller considering that JavaScript works on the client side.

@joshgoebel
Copy link
Member

joshgoebel commented Jan 29, 2020

This will be more robust to automatically detecting language through file extensions than by reading the file contents over and over using the available language packages until it finds the most relevant match.

Of course you wouldn't actually ever do it over and over. If performance mattered at all you'd only do the auto-detect a single time and then cache that result for future use. That said if you can trust the extensions, that could be much faster.

Of course historically we don't care about extensions since we aren't dealing with them, we're only dealing with source.


My suggestion here is that we make an actual extensions key so that extensions can be registered differently than aliases. This would also help the stated use case here... as file.javascript would technically be a FALSE positive as javascript is not a valid extension for Javascript. And god forbid the confusion that ensures when one languages extension happens to be the name (or common alias) of another language - though hopefully that's only a hypothetical problem.

So something like:

// plaintext.js
{
  aliases: ['text'],
  extensions: ['txt'],
  ...
}

@joshgoebel
Copy link
Member

@isagalaev Do you have any historical context on what the actual original intention was here - that might guide us or add to the discussion?

I'm pretty sure this wasn't the original intent because we're sorely lacking some obvious extensions (like htaccess, as mentioned)... yet for many languages it sure appears we are already doing this.

@egor-rogov Any thoughts?

@joshgoebel
Copy link
Member

joshgoebel commented Jan 29, 2020

The advantage of keeping them separate:

  • Easier to split them now (before we have all the data than later)
  • Easy to have different behavior for each, or user customizable behavior (impossible if they are just in one huge list)
  • Also easy to have exactly the same behavior as one big list just by registering them all as aliases.

The only real downside I see is that we need to go over existing data and split out the extensions from the aliases, and this would likely for the short-term mean we needed to behaviorally treat them as the same...

Although it does lead to the question should extension SOMETIMES also be an alias? like JS is ofter used in place of JS... making it a true alias, not just the extension. Does that distinction matter? Splitting them out means we have to answer sticky questions like that.

One thing I'd love to know is a rough count of how many extensions we already have as aliases.

If it's already quite high, perhaps we just roll with it... but if it's pretty low, then I think this is worth a moment of thought.

@egor-rogov
Copy link
Collaborator

And god forbid the confusion that ensures when one languages extension happens to be the name (or common alias) of another language - though hopefully that's only a hypothetical problem.

I'm afraid it's not that hypothetical. For example, *.sql file can easily be from PostgreSQL, Oracle or any other database.
Highlightjs deals with code blocks and knows nothing about files nor their extensions. Since it's the application responsibility to find out extensions, I feel like it also should be the application responsibility to map extensions to HLJS language names/aliases.

Of course we can think about how to make it easier, but I'm afraid it's not straightforward. We can, for example, provide the separate list of extensions (as you suggested), but allow different languages to have same extensions etc., and let the application make the final decision.

@joshgoebel
Copy link
Member

I'm afraid it's not that hypothetical. For example, *.sql file can easily be from PostgreSQL, Oracle or any other database.

Good point.

Highlightjs deals with code blocks and knows nothing about files nor their extensions. Since it's the application responsibility to find out extensions, I feel like it also should be the application responsibility to map extensions to HLJS language names/aliases.

This feels right at first glance.

I think aliases like "js" and "rb" are pretty common (I use them all the time on Github) but I think the KEY here might be that they are used because they are SHORTCUTS, not because they are extensions. So one wouldn't write htaccess when "apache" was shorter... although we currently also have "apacheconf" as an alias for apache right now. Honestly that's probably backwards though... "apacheconf" is really the name and apache might be the alias. Though then you could argue "htaccess" is shorter...

I guess right now it's all a bit muddled, which is why my mind instantly thought about creating the separation, but I'm afraid then it will be hard to "prove" something is or isn't an alias... extensions are pretty well defined though.

We can, for example, provide the separate list of extensions (as you suggested), but allow different languages to have same extensions etc., and let the application make the final decision.

Sounds reasonable but how does that work when the block is class="lang-sql"... does it just find all potential sql matches and then run those thru the auto-detect?

And if we're TRULY going to add extensions, does that mean we need a ext-sql type namespace for people who'd like to syntax highlight based on extension? I think that' would be preferable to lang-x where x is ambiguous.

@joshgoebel
Copy link
Member

And we have ridiculous things like:

aliases: ['php', 'php3', 'php4', 'php5', 'php6', 'php7'],

Which seem entirely unnecessary as aliases... I don't think they are extensions either.

@joshgoebel
Copy link
Member

Then you have categories also like assembly... .s, .S, .asm seems to be the trend, but is that x86 assembly? ARM assembly? mips? And "subtypes" like Arduino, which has the same extensions as CPP.

@joshgoebel
Copy link
Member

In a quick review it seems the ship might have already sailed on adding extensions to aliases... so I'm leaning towards approving things (like the PR to add htaccess to apacheconf), as least when they seem clear and unambiguous... but that still does leave a lot of the questions above open.

Honestly though I wonder if anyone wants to put in the work to split the existing aliases out into extensions... I'm not too excited about doing it... perhaps we just soldier on with aliases until it becomes a larger problem?

Right now someone who wanted to load up a bunch of conflicting aliases would have to deal with it by hand, or simply not rely on the aliases to work since really the last language loaded would be the "winner"...

@egor-rogov
Copy link
Collaborator

Extension-based autodetection looks reasonable.
We can support two use cases:

  1. The application uses API we provide to get the list of languages by the extension. Then the app decides which one is appropriate and use it to pass to HLJS (class="..." or class="lang-...").
  2. The application passes the extension right to HLJS (class="ext-...") and we run autodetection within the list of possible languages.

if we're TRULY going to add extensions

Looks like something useful to me, but surely not the first priority.

@egor-rogov
Copy link
Collaborator

I'm leaning towards approving things (like the PR to add htaccess to apacheconf), as least when they seem clear and unambiguous

Agree.

@joshgoebel
Copy link
Member

joshgoebel commented Jan 29, 2020

Extension-based autodetection looks reasonable.

This starts to smell a little like shebang lines though (just another way to detect/categorize)... and I don't think you were super encouraging of that as a core feature. What would make this different?

Well, maybe it's a little different since we already seem to do it via aliases. :-)

@egor-rogov
Copy link
Collaborator

Well, "extensions feature" doesn't change the way HLJS works. It's just the matter of narrowing down the list of languages for autodetection and passing it to the the existing API.

On the other hand, shebang is inside the code, and it is grammar that we use to deal with the code. It doesn't look right to teach HLJS to look into the code using means other that the grammar. (It's okay for the application to sniff the code to be highlighted, find shebang, parse it somehow, and pass the language to the existing HLJS API, though.)

It's just how I feel it, of course. Perhaps I'm wrong.

@joshgoebel
Copy link
Member

joshgoebel commented Jan 29, 2020

It's just the matter of narrowing down the list of languages for autodetection and passing it to the the existing API.

I think you could potentially say the same for shebang data... really aren't we talking about whether a grammar can host data that we don't use DIRECTLY, but rather plugins or the source application could use indirectly to help correctly categorize a particular file/snippet?

If we are ok hosting extension data, BUT we don't use them directly, then why not host shebang lines... or any other "per-language" meta-data that might prove useful in general? And if we're not such a repository, then perhaps we shouldn't host extension data at all? Say "that is external to us, we only look at code"....

You could even argue shebang analysis is more in-scope than extensions... since shebang is part of the code itself... where-as extensions (and filenames in general) exist completely outside that sphere. :-)

@joshgoebel
Copy link
Member

It would be great to know if anyone is currently categorizing snippets by extension alone like in the use case mentioned in the first post here.

@egor-rogov
Copy link
Collaborator

really aren't we talking about whether a grammar can host data that we don't use DIRECTLY, but rather plugins or the source application could use indirectly to help correctly categorize a particular file/snippet?

Hmm. If we're talking about storing some metadata... you almost convinced me. I think we shall return to the discussion later in more detail.
(original shebang proposal: #2174)

@taufik-nurrohman
Copy link
Member Author

Looks like something useful to me, but surely not the first priority.

At least it is standardized. Some web server configurations use file extensions as a reference to automatically determine MIME types for the browsers. I think this is pretty standard as the configuration data will be served for any browser, anywhere. And so file extensions can be used as language categories too.

@joshgoebel
Copy link
Member

joshgoebel commented Jan 30, 2020

At least it is standardized. Some web server configurations use file extensions as a reference to automatically determine MIME types for the browsers.

But the standard you're talking about there is extension to mime type mapping... that doesn't directly help us since we don't have a list of canonical extensions or a list of mime types. If you're merely saying it's helpful to be able to map from an extension to knowing what the file is, there is no disagreement on that point. :-)

I think the long-term question here is HOW we should encode that data... whether to continue using alias or to split the data out into it's own field.

@anwarhahjjeffersongeorge
Copy link

anwarhahjjeffersongeorge commented Feb 5, 2020

It would be great to know if anyone is currently categorizing snippets by extension alone like in the use case mentioned in the first post here.

ObservableHQ uses highlight.js I made an Observable notebook that lets people include code snippets in their Observables by referencing their URLs and parsing the URL contents into Markdown. Since the user supplies a URL, I automatically get the file extension with it, and I've been using this to figure out what kind of tag to put in the generated markdown block, which is the alias

The problem is that as it stands, I have to hard-code in a bunch of file extensions and their aliases, and that seems like the wrong way to do it.

If there was an independent mapping between the language full names and the extensions, this would help.

@joshgoebel
Copy link
Member

@anwarhahjjeffersongeorge Why don't you share your list of mappings just so we see what that looks like.

@anwarhahjjeffersongeorge
Copy link

anwarhahjjeffersongeorge commented Feb 6, 2020

@yyyc514 no problem. it isn't anything fancy:

    codeexts: { // these are the code extensions I'm dealing with
      javascript: ['mjs', 'ejs', 'js', 'jscad'],
      typescript: ['ts'],
      openscad: ['scad'],
      processing: ['pde'],
      arduino: ['ino'],
      c: ['c', 'h'],
      cpp: ['cpp', 'hpp', 'cxx', 'hxx'],
      bash: ['sh'],
      python: ['py'],
    }

For my use case, I just do something like

let codetypename = '' 
  for (let key in codeexts) {
    if (codeexts[key].includes(ext) ) {
      codetypename = key
      break
    }
  }

But I can see this might not work for the whole highlight.js library since I have to add custom ones, like jscad, anyway. I don't know if this helps your discussion at all.

@joshgoebel
Copy link
Member

I think for now we'll keep adding these aliases unless someone wanted to make a PR and do all the work of splitting them out. Not a high priority for me.

If we did split them out I think originally we'd still have to merge extensions and aliases for the same behavior as we have now (to not break anything). So it would really just be a data enhancement to the library to let people work with extensions/query them, etc. if they wanted to.

@joshgoebel
Copy link
Member

it isn't anything fancy:

Pretty sure we already have most of those as aliases.

@joshgoebel
Copy link
Member

@taufik-nurrohman Any chance you want to do this work and make a PR? :-)

@taufik-nurrohman
Copy link
Member Author

@yyyc514 It requires editing the core then. I could do it but maybe need to be very careful.

If we did split them out I think originally we'd still have to merge extensions and aliases for the same behavior as we have now (to not break anything).

Hotfix:

scope.aliases = scope.aliases.concat(scope.extensions || []);

@joshgoebel
Copy link
Member

Essentially, but I was really asking about the work of going thru all the 285 files and trying to make sense of extensions for all of them. :-)

Writing the one-liner is the easy part. :-)

@joshgoebel
Copy link
Member

This type of work would pair well with:

#2394

@joshgoebel
Copy link
Member

No one seems to be really pushing for this. Closing due to inactivity.

@taufik-nurrohman
Copy link
Member Author

taufik-nurrohman commented May 7, 2020

Would like too see this kind of request #2523 being added so that I can make a plugin related to this out of highlight.js core.

@joshgoebel
Copy link
Member

@taufik-nurrohman If you wanted to whip up a PR for #2523 it'd probably be pretty simple...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants