Refactor/Rewrite analyze_text #1702

Anteru · 2021-02-06T13:16:28Z

analyze_text is a source of problems and has an impossible task to do -- it's supposed to locally decide how well a lexer matches a given piece of code, which is then used to make a global decision among all lexers. This is leading to tons of issues, and we have a large number out outstanding PRs related to it:

as well as issues like (those have been closed as they are all related to this issue):

There are two "right" solutions to this, and they can be implemented incrementally. The first one is to default analyze_text to only invoke lexers which support the given file extension. This will dramatically improve the accuracy, as there will be only 2-3 lexers that need to get coordinated instead of having any lexer participate (and some lexers do accept nearly everything.) This can be implemented right now, but it requires touching every single lexer and clean up the weights relative to other candidates.

The second step -- which is more involved -- is to turn to machine learning to train a model that can make global decisions by knowing all languages and being able to weight the probabilities globally. This would be the "endgame" solution for analyze_text, with the hope that we can have a reinforcement process in place to teach the model whenever an issue occurs. That's a much larger task though, similar in scope to https://github.com/github/linguist

The text was updated successfully, but these errors were encountered:

alanhamlett · 2021-02-10T01:56:07Z

Re-using linguist's training would be ideal for the machine learning second solution.

Anteru added the help wanted Community help appreciated! label Mar 5, 2021

amitkummer mentioned this issue Sep 29, 2021

C comments and preprocessor directives not being properly highlighted #1901

Open

Anteru mentioned this issue Dec 29, 2021

Better Lexer.analyse_text #2005

Closed

Anteru mentioned this issue Jan 23, 2022

Priority-based lexer selection in get_lexer_by_name et al. #1603

Open

jeanas mentioned this issue Mar 29, 2022

Please add a JSON5 lexer #1880

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/Rewrite analyze_text #1702

Refactor/Rewrite analyze_text #1702

Anteru commented Feb 6, 2021 •

edited

alanhamlett commented Feb 10, 2021

Refactor/Rewrite analyze_text #1702

Refactor/Rewrite analyze_text #1702

Comments

Anteru commented Feb 6, 2021 • edited

alanhamlett commented Feb 10, 2021

Anteru commented Feb 6, 2021 •

edited