Skip to content

Comparison With Pygments

Nathan Youngman edited this page Oct 28, 2012 · 3 revisions

General differences

  • CodeRay is a Ruby library, Pygments is written in Python.
  • CodeRay supports 19 languages, while Pygments supports over 90.
  • CodeRay has handwritten scanners. In Pygments, scanners are defined with a scanner DSL.

Handwritten vs. DSL, Pro & Contra

The last two differences in the list above are very much related.

Handwritten scanners (CodeRay)

Pro:

  • faster
    • lots of fine tuning is possible
    • no overhead for DSL transformation and interpretation
  • more flexible

Contra:

  • writing scanners is a lot of work
  • almost nobody understands how to create good scanners

Scanner definition (Pygments)

(Note: In Pygments, scanners are called “lexers”.)

Pro:

  • easier to write, read, and maintain
    • less code
    • even beginners can write decent scanners
  • DSL interpreter can be optimized/changed independently
  • porting scanners is easier
  • use of higher-level features (like token groups or stacks) is simple

Contra:

Thoughts: LexDL

A common scanner/lexer definition language, which can be read by both Pygments and a hypothetical ports in other languages, would be most useful. The definitions could be maintained in a common code repository.

Here’s a spontaneous example of a possible JSON representation:

{
  "name": "Diff",
  "aliases": ["diff"],
  "filenames": ["*.diff"],
  "tokens": {
    "root": [
      [" .*\n", "Text"],
      ["\+.*\n", "Generic.Inserted"],
      ["-.*\n", "Generic.Deleted"],
      ["@.*\n", "Generic.Subheading"],
      ["Index.*\n", "Generic.Heading"],
      ["=.*\n", "Generic.Heading"],
      [".*\n", "Text"]
    ],
    ...
  }
}

Other differences

Regular expressions engine

Python’s regexps are more powerful than the regexps of Ruby 1.8, and less powerful than the new Ruby 1.9 ones. However, most expressions used in the scanners can be interpreted by all engines. Ruby’s StringScanner has some limitations in the use of regexps.

Token kinds vs. token types

CodeRay represents tokens with a Token Kinds, which is just a Ruby :symbol (source).

Pygments uses a hierarchical token type/subtype system (source), which is more complex to implement (and slower), but more flexible and easier to understand for authors of new language definitions.

Token groups

CodeRay supports token groups, which map nicely to SPANs in the HTML output. A token group has a token kind and can contain tokens and other token groups. The final color of a token depends on the group nesting it is in (for example, string/delimiter has a different color than regexp/delimiter.) Groups are represented with special :open and :close tokens.

Token groups allow CSS-style color definitions, which are most useful for HTML output. Pygments doesn’t have a comparable feature; you can see that strings are usually a single token in Pygments, while the delimiting quotes are usually separate tokens in CodeRay.

CodeRay is optimized for HTML/CSS output. The concept of token groups may be ported to LaTeX or console output, but it’s not trivial.

Filters

Pygments has filters, which manipulate the token stream in some way. You can do some cool tricks with these. CodeRay currently lacks such a feature.

Plugins

Pygments and CodeRay allow extension via plugins. The specific details are different, but it’s simple.