Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/Rewrite analyze_text #1702

Open
Anteru opened this issue Feb 6, 2021 · 1 comment
Open

Refactor/Rewrite analyze_text #1702

Anteru opened this issue Feb 6, 2021 · 1 comment
Labels
help wanted Community help appreciated!

Comments

@Anteru
Copy link
Collaborator

Anteru commented Feb 6, 2021

analyze_text is a source of problems and has an impossible task to do -- it's supposed to locally decide how well a lexer matches a given piece of code, which is then used to make a global decision among all lexers. This is leading to tons of issues, and we have a large number out outstanding PRs related to it:

as well as issues like (those have been closed as they are all related to this issue):

There are two "right" solutions to this, and they can be implemented incrementally. The first one is to default analyze_text to only invoke lexers which support the given file extension. This will dramatically improve the accuracy, as there will be only 2-3 lexers that need to get coordinated instead of having any lexer participate (and some lexers do accept nearly everything.) This can be implemented right now, but it requires touching every single lexer and clean up the weights relative to other candidates.

The second step -- which is more involved -- is to turn to machine learning to train a model that can make global decisions by knowing all languages and being able to weight the probabilities globally. This would be the "endgame" solution for analyze_text, with the hope that we can have a reinforcement process in place to teach the model whenever an issue occurs. That's a much larger task though, similar in scope to https://github.com/github/linguist

@alanhamlett
Copy link
Contributor

Re-using linguist's training would be ideal for the machine learning second solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Community help appreciated!
Projects
None yet
Development

No branches or pull requests

2 participants