New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate token types and token documentation #1816
Comments
Hi, thanks, that's actually a really nice test. Would you mind adding this to make check and highlight "unicorn" types only used by a single lexer? I think that would be the best approach here. I'll see if we can get them unified, because it's quite clear those one time uses are not really useful. It would probably make sense to require language-specific token types to be named as |
OK, here is a script that greps the source code of the lexers for references to tokens. #!/usr/bin/env python
"""
Count how often each token is used by the lexers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This makes it possible to find typos in token names,
as those tokens are only used by one lexer.
"""
import sys, argparse, re, pathlib
from pygments import token, lexers
def lookup_all_lexers():
"""
Iterate through all lexers and fetch them.
This should create all tokens that any of the lexers produce.
"""
count = 0
for (name, aliases, patterns, mimetypes) in lexers.get_all_lexers():
for a in aliases:
l = lexers.get_lexer_by_name(a)
break
else:
for p in patterns:
l = lexers.get_lexer_for_filename(p)
break
else:
for m in mimetypes:
l = lexers.get_lexer_for_mimetype(m)
break
count += 1
return count
def fetch_lexer_sources():
"""
Return the source code of all lexers as a dictionary, mapping filenames
to a list of lines.
"""
lexer_dir = (pathlib.Path(__file__).parent / "../pygments/lexers").resolve()
lexer_sources = {fn: fn.read_text().splitlines(keepends=False) for fn in lexer_dir.glob("*.py")}
return lexer_sources
def sub_tokens(token):
"""
Generator that yields a token and all of its sub-tokens recursively.
"""
yield token
for subtoken in token.subtypes:
yield from sub_tokens(subtoken)
class FileCount:
"""
Stores information about line numbers in a file.
This is used to store from which lines in a files a certain token is
referenced.
"""
def __init__(self, filename):
self.filename = filename
self.lines = []
def __str__(self):
if len(self.lines) > 3:
lines = ", ".join(f"{line:,}" for line in self.lines[:5])
lines = f"{lines}, ... ({len(lines):,} lines)"
else:
lines = ", ".join(f"{line:,}" for line in self.lines)
return f"{self.filename.name}[{lines}]"
def add(self, linenumber):
self.lines.append(linenumber)
def count_lines(self):
return len(self.lines)
class TokenCount:
"""
Stores information about a token and in which files it is referenced.
"""
def __init__(self, token):
self.token = token
self.files = {}
def add(self, filename, linenumber):
if filename not in self.files:
self.files[filename] = FileCount(filename)
self.files[filename].add(linenumber)
def __str__(self):
if len(self.files) > 3:
files = []
for (i, filecount) in enumerate(self.files.values()):
files.append(str(filecount))
if i >= 5:
break
files = ", ".join(files) + f", ... ({len(self.files):,} files)"
else:
files = ", ".join(str(filecount) for filecount in self.files.values())
return f"{self.count_files():,} files, {self.count_lines():,} locations: {files}"
def count_files(self):
return len(self.files)
def count_lines(self):
return sum(fc.count_lines() for fc in self.files.values())
def find_token_references(lexer_sources, args):
"""
Find all references to all tokens in the source code of all lexers.
Note that this can't be 100% reliable, as it searches the source code for
certain patterns: It searches for the last two components of a token name,
i.e. to find references to the token ``Token.Literal.Number.Integer.Long``
it searches for the regular expression ``\\bInteger.Long\\b``. This
won't work reliably for top level token like ``Token.String`` since this
is often referred to as ``String``, but searching for ``\\bString\\b``
yields to many false positives.
"""
# Maps token to :class:`TokenCount` objects.
token_references = {}
# Search for each token in each lexer source file and record in which file
# and in which line they are referenced
for t in sub_tokens(token.Token):
parts = list(t)[-2:]
if len(parts) == 0:
name = "Token"
elif len(parts) == 1:
name = f"Token.{parts[0]}"
else:
name = ".".join(parts)
token_references[t] = tokencount = TokenCount(t)
if name != "Token":
pattern = re.compile(f"\\b{name}\\b")
for (filename, sourcelines) in lexer_sources.items():
for (i, line) in enumerate(sourcelines, 1):
if pattern.search(line) is not None:
tokencount.add(filename, i)
if args.subtoken:
t2 = t
while t2 is not token.Token:
t2 = t2.parent
tokencount2 = token_references[t2]
tokencount2.add(filename, i)
return token_references
def print_result(token_references, args):
def key(item):
return (item[1].count_files(), item[1].count_lines())
for (token, locations) in sorted(token_references.items(), key=key):
if args.minfiles <= locations.count_files() <= args.maxfiles and \
args.minlines <= locations.count_lines() <= args.maxlines:
print(f"{token}: {locations}")
def main(args=None):
p = argparse.ArgumentParser(description="Count how often each token is used by the lexers")
p.add_argument("-v", "--verbose", dest="verbose", help="Give more output.", default=False, action="store_true")
p.add_argument("--minfiles", dest="minfiles", metavar="COUNT", type=int, help="Report all tokens referenced by at least COUNT lexer source files (default %(default)s)", default=1)
p.add_argument("--maxfiles", dest="maxfiles", metavar="COUNT", type=int, help="Report all tokens referenced by at most COUNT lexer source files (default %(default)s)", default=2)
p.add_argument("--minlines", dest="minlines", metavar="COUNT", type=int, help="Report all tokens referenced by at least COUNT lexer source lines (default %(default)s)", default=1)
p.add_argument("--maxlines", dest="maxlines", metavar="COUNT", type=int, help="Report all tokens referenced by at most COUNT lexer source lines (default %(default)s)", default=10)
p.add_argument("-s", "--subtoken", dest="subtoken", help="Include count of references to subtokens in the count for each token (default %(default)s)", default=False, action="store_true")
args = p.parse_args(args)
if args.verbose:
print("Looking up all lexers ... ", end="", flush=True)
count = lookup_all_lexers()
if args.verbose:
print(f"found {count:,} lexers")
if args.verbose:
print("Fetching lexer source code ... ", end="", flush=True)
lexer_sources = fetch_lexer_sources()
if args.verbose:
print(f"found {len(lexer_sources):,} lexer source files")
if args.verbose:
print("Finding token references ... ", end="", flush=True)
token_references = find_token_references(lexer_sources, args)
if args.verbose:
print(f"found references to {len(token_references):,} tokens")
if args.verbose:
print()
print("Result:")
print_result(token_references, args)
if __name__ == "__main__":
sys.exit(main()) When I call it with
Is this what we are aiming for? If yes I can put it into |
Yes, that looks super useful. Thanks a lot for preparing this! That should definitely go into the scripts folder, and I'll try to clean up the affected lexers (right away looking at this I see |
Very nice, thanks! Apart from obvious fixes, singular token types should still be allowed, since some people use specialized styles for them. The form Token..Foo would only make sense if the token cannot be classified into one of the basic semantic meanings though. |
I was wondering if we should require a language specific part though in that case. I.e. Token.foo.Name.Spaceship and foo would be the lexer name. This way it's obvious it's an unicorn and it creates a whole namespace. Any thoughts? |
Well that would preclude it from being highlighted as Name on a non-specialized style. |
The syntax highlighting of TextMate and Sublime Text uses language specific parts only as the final component of the token type (e.g. |
Right, so that would work for us as well for the unicorn styles and make it obvious what it's good for. |
OK, here is the pull request: #1819 It's unfortunate that the script isn't 100% reliably, because a token might be referenced without the reference appearing in exactly the form the script searches for, or the script might find a reference, which isn't a real one (for example if the "reference" is in a comment). But I can think of no better way to implement that. It would be great if every lexer would provide an example source code that contained all the tokens the lexer uses. This might serve multiple purposes:
|
I've used the following script to get all token types that are defined in any of the lexers:
The output is the following:
Some of those tokens seem to be duplicates or typos:
Token.Comment.SingleLine
(used byBSTLexer
inlexers/bibtex.py
),Token.Comment.Singleline
(used byFloScriptLexer
inlexers/floscript.py
) andToken.Comment.Single
(used anywhere else).Token.Comment.Multi
(used byCleanLexer
inlexers/clean.py
andToken.Comment.Multiline
(used anywhere else).Token.Literal.Number.Dec
(used byNuSMVLexer
inlexers/smv.py
) andToken.Literal.Number.Decimal
(used anywhere else).Token.Literal.Number.Int
(used byCddlLexer
inlexers/cddl.py
) andToken.Literal.Number.Integer
(used anywhere else).Token.Literal.Number.Octal
(used byThingsDBLexer
inlexers/thingsdb.py
) andToken.Literal.Number.Oct
(used anywhere else).The following are inconclusive:
Token.Literal.String.Character
(used byUniconLexer
andIconLexer
inlexers/unicon.py
andDelphiLexer
inlexers/pascal.py
) and (Token.Literal.String.Char
(used anywhere else)Token.Literal.String.Interp
(used byColdfusionLexer
inlexers/templates.py
) andToken.Literal.String.Interpol
(use anywhere else). But here "Interp" might mean interpreted, not interpolated.Would it make sense to consolidate those tokens that are clearly typos?
Also it would be really helpful to have documentation what each token is supposed to represent (for example, what is
Token.Operator.DBS
?). An attribute__doc__
on each token object would really help.The text was updated successfully, but these errors were encountered: