enh(autodetect) multiple autodetect fixes #2745

joshgoebel · 2020-10-08T15:34:30Z

These changes were made using a modified test suite from https://github.com/andreasjansson/language-detection.el

Before:

67% correct. 80% first or second correct.

After:

72% detection is correct.  84% first OR second detected language is correct.

So a 4-5% measurable improvement and there are probably also additional improvements "hidden" as in matches that are still not correct, but likely to be better than they were before - though that's a hard one to measure. (it's very hard to tell the difference between Lisp-like languages, etc)

There were a few common areas of focus:

Removing incorrect illegal to allow auto-detect to work for languages that were being flagged as illegal
Remove double scoring element of many beginKeywords rules... I'm wondering if this shouldn't be done at the parser level itself.
Add illegal in a few places to make false positives less likely.
Increase match precision of some rules (also guard against false positives)
Reduce relevancy of rules that could potentially match almost anything (name rules for Lisp like languages, || in Ruby matching params when it could be OR or a concat operator in SQL, etc...)

joshgoebel · 2020-10-16T16:45:35Z

Moving this to 10.4 also since it'll probably conflict with the UTF8 stuff and I still want to merge in our long lost CSS branch also...

joshgoebel · 2020-11-03T03:20:00Z

This is definitely going to make more sense if you view the commits one at a time vs as a whole.

egor-rogov

Extensive work! Generally looks good, see some comments.

src/languages/bash.js

egor-rogov · 2020-11-14T21:39:12Z

src/languages/css.js

@@ -49,7 +49,7 @@ export default function(hljs) {
  var AT_PROPERTY_RE = /@-?\w[\w]*(-\w+)*/ // @-webkit-keyframes
  var IDENT_RE = '[a-zA-Z-][a-zA-Z0-9_-]*';
  var RULE = {
-    begin: /(?:[A-Z_.-]+|--[a-zA-Z0-9_-]+)\s*:/, returnBegin: true, end: ';', endsWithParent: true,
+    begin: /([*]\s?)?(?:[A-Z_.\-\\]+|--[a-zA-Z0-9_-]+)\s*(\/\*\*\/)?:/, returnBegin: true, end: ';', endsWithParent: true,


What /**/ is meant for?

Common css hack from the old days.

src/languages/php.js

egor-rogov · 2020-11-14T22:15:13Z

src/languages/tcl.js

-            begin: '\\$(\\{)?(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*\\(([a-zA-Z0-9_])*\\)',
-            end: '[^a-zA-Z0-9_\\}\\$]'
+            begin: '\\$(::)?[a-zA-Z_]+((::)?[a-zA-Z0-9_]+)*' + regex.optional(ARRAY_ACCESS),
          },
          {
-            begin: '\\$(\\{)?(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*',
-            end: '(\\))?[^a-zA-Z0-9_\\}\\$]'
+            begin: '\\$\\{(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*' + regex.optional(ARRAY_ACCESS) + '\\}',


Hmm... This doesn't look right to me. Consider e. g. $foo($bar): array index can be a full blown expression. You can have more fun:

set foo bar set ${foo}(1) 42 ; # this is not captured by the grammar puts $bar(1)

It has nothing to do with autodection fixes though.

Do you actually know TCL are you just looked it up? I’ll go back and review this again because I think maybe I see some issues now but I’m not sure. It be really helpful to have some additional test to show what supposed to be possible here. On some of these languages I’m really shooting blind just looking at the examples I have and what the existing rules seem to do.

I used to knew it... wrote several programs in Tck/Tk long ago. Now I can hardly remember anything.

src/languages/vbscript.js

- can start with `*` (css hacks) - can include a comment after attribute name before : (css hacks)

- `value` is too common variable name to score points as keyword - reduce 2x relevance for beginKeywords - bump csharp relevance slightly

- operators get 0 relevance (consistency: no other grammars score them) - "name" gets 0 relevance since almost any identifier will match This reduces false positives in the language-detection.el rosetta data set significantly.

- Add relevance for groovy shebang line - Ternary should not grant extra relevance

- "name" gets 0 relevance since almost any identifier will match Applying same logic as used with Clojure.

- only count => in `fn` context - prevent beginKeywords double relevancy - reduce relevance of `match`

- add `__FILE__` to keywords - add `proc` and `lambda` Kernel methods to build_ins - stricter rule for identifying method definition - highlight variables - `|` style params now get no relevance (can be too many other things) - add SHEBANG rule - make Ruby REPL matching a little stricter

- built-ins should only match if they are a call - fix function detection

@Ident

For languages with $ident and @Ident style variables this attempts to prevent positives for $ident$ and @Ident@ type expressions, which are likely something else entirely. - bash - perl - php - ruby

…om clojure

- I looked but couldn't find any reference to this.

- There is no reason to do this every other language gets credit for simple strings.

- This is found in other langauges and isn't a strong signal.

- also fix pgsql markup test

joshgoebel changed the title ~~enh(autodetect) multiple autodetect fixes~~ WIP: enh(autodetect) multiple autodetect fixes Oct 8, 2020

joshgoebel force-pushed the autodetect_fixes branch from 119031e to f9f3bc8 Compare October 9, 2020 17:08

joshgoebel modified the milestones: 10.3, 10.4 Oct 14, 2020

joshgoebel added the WIP label Oct 20, 2020

joshgoebel changed the title ~~WIP: enh(autodetect) multiple autodetect fixes~~ enh(autodetect) multiple autodetect fixes Oct 29, 2020

joshgoebel removed the WIP label Oct 29, 2020

joshgoebel requested a review from egor-rogov October 29, 2020 12:47

joshgoebel force-pushed the autodetect_fixes branch from 8297e73 to d64762e Compare October 29, 2020 12:47

joshgoebel requested a review from allejo October 29, 2020 12:47

joshgoebel force-pushed the autodetect_fixes branch 2 times, most recently from 6478b01 to e18cd94 Compare November 3, 2020 03:09

egor-rogov approved these changes Nov 14, 2020

View reviewed changes

joshgoebel added 14 commits November 14, 2020 21:54

fix(autodetect) swift should not get double relevance for import

ca540b0

fix(autodetect) css can include a forward slash

d6e9d22

fix(autodetect) css class selectors must be valid identifiers

f55bc37

fix(autodetect) css: allow extra ;

dd02953

fix(autodetect) improve rule matcher

5bdd357

- can start with `*` (css hacks) - can include a comment after attribute name before : (css hacks)

enh(autodetect) csharp: improve autodetection

dc6bbaf

- `value` is too common variable name to score points as keyword - reduce 2x relevance for beginKeywords - bump csharp relevance slightly

enh(autodetect) clojure: reduce runaway relevance

8e7b582

- operators get 0 relevance (consistency: no other grammars score them) - "name" gets 0 relevance since almost any identifier will match This reduces false positives in the language-detection.el rosetta data set significantly.

enh(autodetect) matlab: remove relevancy from i and j

aa45193

enh(autodetect) groovy

ad2aea4

- Add relevance for groovy shebang line - Ternary should not grant extra relevance

enh(autodetect) lisp: tune relevancy

659f794

- "name" gets 0 relevance since almost any identifier will match Applying same logic as used with Clojure.

enh(autodetect) php: improve auto-detection

c4062f3

- only count => in `fn` context - prevent beginKeywords double relevancy - reduce relevance of `match`

enh(autodetect) add additional common keywords

36ab9a2

enh(autodetect) java: relevance boost for import java.*.

e36c161

enh(autodetect) python: self is super common convention

250fa62

joshgoebel added 16 commits November 14, 2020 22:35

enh(autodetect) groovy: reduce @meta tags relevance

e4319db

enh(autodetect) vbscript: improve auto-detection

2421e24

- built-ins should only match if they are a call - fix function detection

enh(autodetect) r: detect <-, illegal: ->

656402c

enh(autodetect) fewer false positives on variables

de11ad1

For languages with $ident and @Ident style variables this attempts to prevent positives for $ident$ and @Ident@ type expressions, which are likely something else entirely. - bash - perl - php - ruby

fix(autodetect) yaml: simple numbers should not add relevancy

49f430a

fix(autodetect) crystal: lower function relevance (5 -> 2)

de74ab8

fix(autodetect) hy/scheme: bring in line with new name relevance fr…

f92cc46

…om clojure

fix(autodetect) protobuf: tighten enum item rule

c8c52e8

fix(ocaml) => does not actually seem to be a part of language

b60b1c4

- I looked but couldn't find any reference to this.

fix(parser) add value to common keywords (0 relevance)

d39c107

fix(n1ql) do not hobble relevancy of strings

7c60e51

- There is no reason to do this every other language gets credit for simple strings.

fix(javascript) remove relevance of ident =>

8e55681

- This is found in other langauges and isn't a strong signal.

fix(angelscript/lsl) no relevance for simple numbers

50a0483

fix(properties) auto-detect: no points for ident[space]ident

b0de56d

add comment, fix typos

9d06fe9

- also fix pgsql markup test

joshgoebel mentioned this pull request Nov 15, 2020

enh(autodetect) tcl: improve autodetection #2865

Merged

3 tasks

joshgoebel force-pushed the autodetect_fixes branch from cc6b05e to 9d06fe9 Compare November 15, 2020 03:38

joshgoebel merged commit 8acfeeb into highlightjs:master Nov 15, 2020

joshgoebel deleted the autodetect_fixes branch November 15, 2020 03:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enh(autodetect) multiple autodetect fixes #2745

enh(autodetect) multiple autodetect fixes #2745

joshgoebel commented Oct 8, 2020 •

edited

Loading

joshgoebel commented Oct 16, 2020

joshgoebel commented Nov 3, 2020

egor-rogov left a comment

egor-rogov Nov 14, 2020

joshgoebel Nov 15, 2020 •

edited

Loading

egor-rogov Nov 14, 2020

joshgoebel Nov 15, 2020

egor-rogov Nov 15, 2020

enh(autodetect) multiple autodetect fixes #2745

enh(autodetect) multiple autodetect fixes #2745

Conversation

joshgoebel commented Oct 8, 2020 • edited Loading

joshgoebel commented Oct 16, 2020

joshgoebel commented Nov 3, 2020

egor-rogov left a comment

Choose a reason for hiding this comment

egor-rogov Nov 14, 2020

Choose a reason for hiding this comment

joshgoebel Nov 15, 2020 • edited Loading

Choose a reason for hiding this comment

egor-rogov Nov 14, 2020

Choose a reason for hiding this comment

joshgoebel Nov 15, 2020

Choose a reason for hiding this comment

egor-rogov Nov 15, 2020

Choose a reason for hiding this comment

joshgoebel commented Oct 8, 2020 •

edited

Loading

joshgoebel Nov 15, 2020 •

edited

Loading