You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
analyze_text is a source of problems and has an impossible task to do -- it's supposed to locally decide how well a lexer matches a given piece of code, which is then used to make a global decision among all lexers. This is leading to tons of issues, and we have a large number out outstanding PRs related to it:
There are two "right" solutions to this, and they can be implemented incrementally. The first one is to default analyze_text to only invoke lexers which support the given file extension. This will dramatically improve the accuracy, as there will be only 2-3 lexers that need to get coordinated instead of having any lexer participate (and some lexers do accept nearly everything.) This can be implemented right now, but it requires touching every single lexer and clean up the weights relative to other candidates.
The second step -- which is more involved -- is to turn to machine learning to train a model that can make global decisions by knowing all languages and being able to weight the probabilities globally. This would be the "endgame" solution for analyze_text, with the hope that we can have a reinforcement process in place to teach the model whenever an issue occurs. That's a much larger task though, similar in scope to https://github.com/github/linguist
The text was updated successfully, but these errors were encountered:
analyze_text
is a source of problems and has an impossible task to do -- it's supposed to locally decide how well a lexer matches a given piece of code, which is then used to make a global decision among all lexers. This is leading to tons of issues, and we have a large number out outstanding PRs related to it:as well as issues like (those have been closed as they are all related to this issue):
There are two "right" solutions to this, and they can be implemented incrementally. The first one is to default
analyze_text
to only invoke lexers which support the given file extension. This will dramatically improve the accuracy, as there will be only 2-3 lexers that need to get coordinated instead of having any lexer participate (and some lexers do accept nearly everything.) This can be implemented right now, but it requires touching every single lexer and clean up the weights relative to other candidates.The second step -- which is more involved -- is to turn to machine learning to train a model that can make global decisions by knowing all languages and being able to weight the probabilities globally. This would be the "endgame" solution for analyze_text, with the hope that we can have a reinforcement process in place to teach the model whenever an issue occurs. That's a much larger task though, similar in scope to https://github.com/github/linguist
The text was updated successfully, but these errors were encountered: