Duplicate token types and token documentation #1816

doerwalter · 2021-05-24T17:32:31Z

I've used the following script to get all token types that are defined in any of the lexers:

from pygments import token, lexers

for (name, aliases, patterns, mimetypes) in lexers.get_all_lexers():
	for a in aliases:
		l = lexers.get_lexer_by_name(a)
		break
	else:
		for p in patterns:
			l = lexers.get_lexer_for_filename(p)
			break
		else:
			for m in mimetypes:
				l = lexers.get_lexer_for_mimetype(m)
				break

def list_tokens(t):
	yield t
	for st in t.subtypes:
		yield from list_tokens(st)

tokens = list(list_tokens(token.Token))

for t in sorted(tokens):
	print(t)

The output is the following:

Token
Token.Comment
Token.Comment.Directive
Token.Comment.Doc
Token.Comment.Hashbang
Token.Comment.Multi
Token.Comment.Multiline
Token.Comment.Preproc
Token.Comment.PreprocFile
Token.Comment.Single
Token.Comment.Singleline
Token.Comment.SingleLine
Token.Comment.Special
Token.Error
Token.Escape
Token.Generic
Token.Generic.Deleted
Token.Generic.Emph
Token.Generic.Error
Token.Generic.Heading
Token.Generic.Inserted
Token.Generic.Output
Token.Generic.Prompt
Token.Generic.Strong
Token.Generic.Subheading
Token.Generic.Traceback
Token.Generic.Whitespace
Token.Keyword
Token.Keyword.Builtin
Token.Keyword.Constant
Token.Keyword.Control
Token.Keyword.Declaration
Token.Keyword.Keyword
Token.Keyword.Namespace
Token.Keyword.PreProc
Token.Keyword.Pseudo
Token.Keyword.Removed
Token.Keyword.Reserved
Token.Keyword.Token
Token.Keyword.Tokens
Token.Keyword.Type
Token.Keyword.Word
Token.Literal
Token.Literal.Char
Token.Literal.Date
Token.Literal.Number
Token.Literal.Number.Attribute
Token.Literal.Number.Bin
Token.Literal.Number.Dec
Token.Literal.Number.Decimal
Token.Literal.Number.Float
Token.Literal.Number.Hex
Token.Literal.Number.Int
Token.Literal.Number.Integer
Token.Literal.Number.Integer.Long
Token.Literal.Number.Oct
Token.Literal.Number.Octal
Token.Literal.Number.Radix
Token.Literal.Other
Token.Literal.Scalar
Token.Literal.Scalar.Plain
Token.Literal.String
Token.Literal.String.Affix
Token.Literal.String.Atom
Token.Literal.String.Backtick
Token.Literal.String.Boolean
Token.Literal.String.Char
Token.Literal.String.Character
Token.Literal.String.Delimiter
Token.Literal.String.Doc
Token.Literal.String.Double
Token.Literal.String.Escape
Token.Literal.String.Heredoc
Token.Literal.String.Interp
Token.Literal.String.Interpol
Token.Literal.String.Moment
Token.Literal.String.Name
Token.Literal.String.Other
Token.Literal.String.Regex
Token.Literal.String.Single
Token.Literal.String.Symbol
Token.Name
Token.Name.Attribute
Token.Name.Attribute.Variable
Token.Name.Attributes
Token.Name.Builtin
Token.Name.Builtin.Pseudo
Token.Name.Builtin.Type
Token.Name.Builtins
Token.Name.Class
Token.Name.Class.DBS
Token.Name.Class.Start
Token.Name.Classes
Token.Name.Constant
Token.Name.Decorator
Token.Name.Entity
Token.Name.Entity.DBS
Token.Name.Exception
Token.Name.Field
Token.Name.Function
Token.Name.Function.Magic
Token.Name.Keyword
Token.Name.Keyword.Tokens
Token.Name.Label
Token.Name.Namespace
Token.Name.Operator
Token.Name.Other
Token.Name.Other.Member
Token.Name.Property
Token.Name.Pseudo
Token.Name.Quoted
Token.Name.Quoted.Escape
Token.Name.Symbol
Token.Name.Tag
Token.Name.Type
Token.Name.Variable
Token.Name.Variable.Anonymous
Token.Name.Variable.Class
Token.Name.Variable.Global
Token.Name.Variable.Instance
Token.Name.Variable.Magic
Token.Operator
Token.Operator.DBS
Token.Operator.Word
Token.Other
Token.OutPrompt
Token.OutPromptNum
Token.Prompt
Token.PromptNum
Token.Punctuation
Token.Punctuation.Indicator
Token.Text
Token.Text.Symbol
Token.Text.Whitespace

Some of those tokens seem to be duplicates or typos:

Token.Comment.SingleLine (used by BSTLexer in lexers/bibtex.py), Token.Comment.Singleline (used by FloScriptLexer in lexers/floscript.py) and Token.Comment.Single (used anywhere else).
Token.Comment.Multi (used by CleanLexer in lexers/clean.py and Token.Comment.Multiline (used anywhere else).
Token.Literal.Number.Dec (used by NuSMVLexer in lexers/smv.py) and Token.Literal.Number.Decimal (used anywhere else).
Token.Literal.Number.Int (used by CddlLexer in lexers/cddl.py) and Token.Literal.Number.Integer (used anywhere else).
Token.Literal.Number.Octal (used by ThingsDBLexer in lexers/thingsdb.py) and Token.Literal.Number.Oct (used anywhere else).

The following are inconclusive:

Token.Literal.String.Character (used by UniconLexer and IconLexer in lexers/unicon.py and DelphiLexer in lexers/pascal.py) and (Token.Literal.String.Char (used anywhere else)
Token.Literal.String.Interp (used by ColdfusionLexer in lexers/templates.py) and Token.Literal.String.Interpol (use anywhere else). But here "Interp" might mean interpreted, not interpolated.

Would it make sense to consolidate those tokens that are clearly typos?

Also it would be really helpful to have documentation what each token is supposed to represent (for example, what is Token.Operator.DBS?). An attribute __doc__ on each token object would really help.

The text was updated successfully, but these errors were encountered:

Anteru · 2021-05-25T07:13:19Z

Hi, thanks, that's actually a really nice test. Would you mind adding this to make check and highlight "unicorn" types only used by a single lexer? I think that would be the best approach here. I'll see if we can get them unified, because it's quite clear those one time uses are not really useful. It would probably make sense to require language-specific token types to be named as Token.<Language>. ... or something if someone ever wants to highlight them.

doerwalter · 2021-05-25T18:48:10Z

OK, here is a script that greps the source code of the lexers for references to tokens.

#!/usr/bin/env python
"""
    Count how often each token is used by the lexers
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    This makes it possible to find typos in token names,
    as those tokens are only used by one lexer.
"""

import sys, argparse, re, pathlib

from pygments import token, lexers


def lookup_all_lexers():
    """
    Iterate through all lexers and fetch them.
    This should create all tokens that any of the lexers produce.
    """
    count = 0
    for (name, aliases, patterns, mimetypes) in lexers.get_all_lexers():
        for a in aliases:
            l = lexers.get_lexer_by_name(a)
            break
        else:
            for p in patterns:
                l = lexers.get_lexer_for_filename(p)
                break
            else:
                for m in mimetypes:
                    l = lexers.get_lexer_for_mimetype(m)
                    break
        count += 1
    return count


def fetch_lexer_sources():
    """
    Return the source code of all lexers as a dictionary, mapping filenames
    to a list of lines.
    """
    lexer_dir = (pathlib.Path(__file__).parent / "../pygments/lexers").resolve()
    lexer_sources = {fn: fn.read_text().splitlines(keepends=False) for fn in lexer_dir.glob("*.py")}
    return lexer_sources


def sub_tokens(token):
    """
    Generator that yields a token and all of its sub-tokens recursively.
    """
    yield token
    for subtoken in token.subtypes:
        yield from sub_tokens(subtoken)


class FileCount:
    """
    Stores information about line numbers in a file.

    This is used to store from which lines in a files a certain token is
    referenced.
    """
    def __init__(self, filename):
        self.filename = filename
        self.lines = []

    def __str__(self):
        if len(self.lines) > 3:
            lines = ", ".join(f"{line:,}" for line in self.lines[:5])
            lines = f"{lines}, ... ({len(lines):,} lines)"
        else:
            lines = ", ".join(f"{line:,}" for line in self.lines)
        return f"{self.filename.name}[{lines}]"

    def add(self, linenumber):
        self.lines.append(linenumber)

    def count_lines(self):
        return len(self.lines)


class TokenCount:
    """
    Stores information about a token and in which files it is referenced.
    """
    def __init__(self, token):
        self.token = token
        self.files = {}

    def add(self, filename, linenumber):
        if filename not in self.files:
            self.files[filename] = FileCount(filename)
        self.files[filename].add(linenumber)

    def __str__(self):
        if len(self.files) > 3:
            files = []
            for (i, filecount) in enumerate(self.files.values()):
                files.append(str(filecount))
                if i >= 5:
                    break
            files = ", ".join(files) + f", ... ({len(self.files):,} files)"
        else:
            files = ", ".join(str(filecount) for filecount in self.files.values())
        return f"{self.count_files():,} files, {self.count_lines():,} locations: {files}"

    def count_files(self):
        return len(self.files)

    def count_lines(self):
        return sum(fc.count_lines() for fc in self.files.values())


def find_token_references(lexer_sources, args):
    """
    Find all references to all tokens in the source code of all lexers.

    Note that this can't be 100% reliable, as it searches the source code for
    certain patterns: It searches for the last two components of a token name,
    i.e. to find references to the token ``Token.Literal.Number.Integer.Long``
    it searches for the regular expression ``\\bInteger.Long\\b``. This
    won't work reliably for top level token like ``Token.String`` since this
    is often referred to as ``String``, but searching for ``\\bString\\b``
    yields to many false positives.
    """

    # Maps token to :class:`TokenCount` objects.
    token_references = {}

    # Search for each token in each lexer source file and record in which file
    # and in which line they are referenced
    for t in sub_tokens(token.Token):
        parts = list(t)[-2:]
        if len(parts) == 0:
            name = "Token"
        elif len(parts) == 1:
            name = f"Token.{parts[0]}"
        else:
            name = ".".join(parts)

        token_references[t] = tokencount = TokenCount(t)

        if name != "Token":
            pattern = re.compile(f"\\b{name}\\b")

            for (filename, sourcelines) in lexer_sources.items():
                for (i, line) in enumerate(sourcelines, 1):
                    if pattern.search(line) is not None:
                        tokencount.add(filename, i)
                        if args.subtoken:
                            t2 = t
                            while t2 is not token.Token:
                                t2 = t2.parent
                                tokencount2 = token_references[t2]
                                tokencount2.add(filename, i)

    return token_references


def print_result(token_references, args):
    def key(item):
        return (item[1].count_files(), item[1].count_lines())

    for (token, locations) in sorted(token_references.items(), key=key):
        if args.minfiles <= locations.count_files() <= args.maxfiles and \
           args.minlines <= locations.count_lines() <= args.maxlines:
            print(f"{token}: {locations}")


def main(args=None):
    p = argparse.ArgumentParser(description="Count how often each token is used by the lexers")
    p.add_argument("-v", "--verbose", dest="verbose", help="Give more output.", default=False, action="store_true")
    p.add_argument("--minfiles", dest="minfiles", metavar="COUNT", type=int, help="Report all tokens referenced by at least COUNT lexer source files (default %(default)s)", default=1)
    p.add_argument("--maxfiles", dest="maxfiles", metavar="COUNT", type=int, help="Report all tokens referenced by at most COUNT lexer source files (default %(default)s)", default=2)
    p.add_argument("--minlines", dest="minlines", metavar="COUNT", type=int, help="Report all tokens referenced by at least COUNT lexer source lines (default %(default)s)", default=1)
    p.add_argument("--maxlines", dest="maxlines", metavar="COUNT", type=int, help="Report all tokens referenced by at most COUNT lexer source lines (default %(default)s)", default=10)
    p.add_argument("-s", "--subtoken", dest="subtoken", help="Include count of references to subtokens in the count for each token (default %(default)s)", default=False, action="store_true")

    args = p.parse_args(args)

    if args.verbose:
        print("Looking up all lexers ... ", end="", flush=True)
    count = lookup_all_lexers()
    if args.verbose:
        print(f"found {count:,} lexers")

    if args.verbose:
        print("Fetching lexer source code ... ", end="", flush=True)
    lexer_sources = fetch_lexer_sources()
    if args.verbose:
        print(f"found {len(lexer_sources):,} lexer source files")

    if args.verbose:
        print("Finding token references ... ", end="", flush=True)
    token_references = find_token_references(lexer_sources, args)
    if args.verbose:
        print(f"found references to {len(token_references):,} tokens")

    if args.verbose:
        print()
        print("Result:")
    print_result(token_references, args)


if __name__ == "__main__":
    sys.exit(main())

When I call it with python count_token_references.py -s --maxfiles=1 --maxlines=1 I get the following output:

Token.Literal.Char: 1 files, 1 locations: clean.py[148]
Token.Literal.String.Boolean: 1 files, 1 locations: templates.py[2,228]
Token.Literal.String.Interp: 1 files, 1 locations: templates.py[1,514]
Token.Literal.Number.Attribute: 1 files, 1 locations: templates.py[1,793]
Token.Literal.Number.Dec: 1 files, 1 locations: smv.py[71]
Token.Literal.Number.Int: 1 files, 1 locations: cddl.py[167]
Token.Literal.Number.Octal: 1 files, 1 locations: thingsdb.py[40]
Token.Literal.Number.Radix: 1 files, 1 locations: javascript.py[1,464]
Token.Operator.DBS: 1 files, 1 locations: javascript.py[1,315]
Token.Keyword.Builtin: 1 files, 1 locations: mosel.py[417]
Token.Keyword.Keyword: 1 files, 1 locations: haxe.py[912]
Token.Keyword.Removed: 1 files, 1 locations: d.py[58]
Token.Comment.Singleline: 1 files, 1 locations: floscript.py[62]
Token.Comment.SingleLine: 1 files, 1 locations: bibtex.py[157]
Token.Text.Symbol: 1 files, 1 locations: theorem.py[338]
Token.Name.Attributes: 1 files, 1 locations: teal.py[76]
Token.Name.Field: 1 files, 1 locations: javascript.py[1,370]
Token.Name.Builtins: 1 files, 1 locations: usd.py[64]
Token.Name.Symbol: 1 files, 1 locations: javascript.py[1,412]
Token.Name.Entity.DBS: 1 files, 1 locations: javascript.py[1,313]
Token.Name.Class.Start: 1 files, 1 locations: javascript.py[1,295]
Token.Name.Class.DBS: 1 files, 1 locations: javascript.py[1,311]
Token.Name.Classes: 1 files, 1 locations: boa.py[91]
Token.Name.Pseudo: 1 files, 1 locations: css.py[433]

Is this what we are aiming for? If yes I can put it into scripts/count_token_references.py, add a call to it in the Makefile and create a pull request.

Anteru · 2021-05-25T19:54:17Z

Yes, that looks super useful. Thanks a lot for preparing this! That should definitely go into the scripts folder, and I'll try to clean up the affected lexers (right away looking at this I see Token.name.Builtins which seems dubious :) )

birkenfeld · 2021-05-26T10:43:58Z

Very nice, thanks! Apart from obvious fixes, singular token types should still be allowed, since some people use specialized styles for them.

The form Token..Foo would only make sense if the token cannot be classified into one of the basic semantic meanings though.

Anteru · 2021-05-26T11:15:33Z

I was wondering if we should require a language specific part though in that case. I.e. Token.foo.Name.Spaceship and foo would be the lexer name. This way it's obvious it's an unicorn and it creates a whole namespace. Any thoughts?

birkenfeld · 2021-05-26T11:46:19Z

Well that would preclude it from being highlighted as Name on a non-specialized style.

doerwalter · 2021-05-26T12:34:22Z

The syntax highlighting of TextMate and Sublime Text uses language specific parts only as the final component of the token type (e.g. string.quoted.double.json). (Matching works the same way as in Pygments, i.e. a token string.quoted.double.json can be matched by string or string.quoted etc.)

Anteru · 2021-05-26T16:21:38Z

Right, so that would work for us as well for the unicorn styles and make it obvious what it's good for.

doerwalter · 2021-05-26T17:23:35Z

OK, here is the pull request: #1819

It's unfortunate that the script isn't 100% reliably, because a token might be referenced without the reference appearing in exactly the form the script searches for, or the script might find a reference, which isn't a real one (for example if the "reference" is in a comment). But I can think of no better way to implement that.

It would be great if every lexer would provide an example source code that contained all the tokens the lexer uses. This might serve multiple purposes:

It would make this script more reliable;
It would serve as some kind of documentation for the language;
It could function as a testbed for developing Pygments themes.

doerwalter mentioned this issue May 26, 2021

Add scripts/count_token_references.py to check for "unicorn" tokens. #1819

Merged

Anteru closed this as completed in 59481ba Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate token types and token documentation #1816

Duplicate token types and token documentation #1816

doerwalter commented May 24, 2021

Anteru commented May 25, 2021

doerwalter commented May 25, 2021

Anteru commented May 25, 2021

birkenfeld commented May 26, 2021

Anteru commented May 26, 2021

birkenfeld commented May 26, 2021

doerwalter commented May 26, 2021

Anteru commented May 26, 2021

doerwalter commented May 26, 2021

Duplicate token types and token documentation #1816

Duplicate token types and token documentation #1816

Comments

doerwalter commented May 24, 2021

Anteru commented May 25, 2021

doerwalter commented May 25, 2021

Anteru commented May 25, 2021

birkenfeld commented May 26, 2021

Anteru commented May 26, 2021

birkenfeld commented May 26, 2021

doerwalter commented May 26, 2021

Anteru commented May 26, 2021

doerwalter commented May 26, 2021