Skip to content
Nathan Youngman edited this page Oct 27, 2012 · 2 revisions

Scanners are the heart of CodeRay. They split input code into tokens and classify them.

Each language has its own scanner: You can see what languages are currently supported in the repository.

Why is the CodeRay language support list so short?

CodeRay developing is a slow process, because the total number of active developers is 1 and he insists on high quality software.

Special attention is paid to the scanners: Every CodeRay scanner is being tested carefully against lots of example source code, and also randomized and junk code to make it safe. A CodeRay scanner is not officially released unless it highlights very, very well.

I need a new Scanner – What can I do?

Here’s what you can do to speed up the development of a new scanner:

  1. Request it! File a new ticket unless it already exists or add a +1 or something to existing tickets to show your interest.
  2. Upload or link to example code in the ticket discussion.
    • Typical code in large quantities is very helpful, also for benchmarking.
    • But we also need the most weird and strange code you can find to make the scanner.
  3. Provide links to useful information about the language lexic, such as:
    • a list of reserved words (Did you know that “void” is a JavaScript keyword?)
    • rules for string and number literals (Can a double quoted string contain a newline?)
    • rules for comments and other token types (Does Language have a special syntax for multiline comments?)
    • a description of any unusual syntactic features (There’s this weird %w() thing in Ruby…)
    • If there are different versions / implementations / dialects of this language: How do they differ?
  4. Give examples for good and bad highlighters / syntax definitions for the language (usually from editors or other libraries),
  5. Find more example code!

Also, read the next section.

I want to write a Scanner myself

Wow, you’re brave! Writing CodeRay scanners is not an easy task because:

  • You need excellent knowledge about the language you want to scan. Every language has a dark side!
  • You need good knowledge of (Ruby) regular expressions.
  • There’s no documentation to speak of.
    • But this is a wiki hint hint ;o)

But it has been done before, so go and try it!

  1. You should still request the scanner (as described above) and announce that you are working on a patch yourself.
  2. Check out the Repository and try the Test Suite.
  3. Copy a scanner of your choice as a base. You would know what language comes closest.
  4. Make sure you have run rake test:scanners to get the scanner test suite.
  5. Create a test case directory in test/scanners/<lang> and add example files for your language.
  6. Run your tests cases with rake test:scanner:<lang> and write your scanner!
  7. Also, look into lib/coderay/scanners/_map.rb and lib/coderay/helpers/file_type.rb.
  8. Make a patch (scanner, test cases and other changes) and upload it to the ticket.
  9. Follow the following discussion.
  10. Prepare to be added to the THX list.

Contact me (murphy rubychan de) if you have any questions.

How does a Scanner look?

For example, the JSON scanner:

# Namespace; use this form instead of CodeRay::Scanners to avoid messages like
# "uninitialized constant CodeRay" when testing it.
module CodeRay
module Scanners

  # Always inherit from CodeRay::Scanners::Scanner.
  #
  # Scanner inherits directly from StringScanner, the Ruby class for fast
  # string scanning. Read the documentation to understand what's going on here:
  #
  #   http://www.ruby-doc.org/stdlib/libdoc/strscan/rdoc/index.html
  class JSON < Scanner

    # Deprecation notice: The Streamable module is gone.

    # Scanners are plugins and must be registered like this:
    register_for :json

    # You can provide a file extension associated with this language.
    file_extension 'json'

    # List all token kinds that are not considered to be running code
    # in this language. For a typical language, this would just be
    # :comment, but for a data or markup language like JSON, no tokens
    # should count as Line of Code.
    KINDS_NOT_LOC = [
      :float, :char, :content, :delimiter,
      :error, :integer, :operator, :value,
    ]  # :nodoc:

    # See the WordList documentation.
    CONSTANTS = %w( true false null )
    IDENT_KIND = WordList.new(:key).add(CONSTANTS, :value)

    ESCAPE = / [bfnrt\\"\/] /x
    UNICODE_ESCAPE = / u[a-fA-F0-9]{4} /x

    # This is the only method you need to define. It scans code.
    #
    # encoder is an object which encodes tokens. It provides the following API:
    # * encoder.text_token(text, kind) for tokens
    # * encoder.begin_group(kind) and encoder.end_group(kind) for token groups
    # * encoder.begin_line(kind) and encoder.end_line(kind) for line tokens
    #
    # options is a hash. Standard options are:
    # * keep_state: Try to save the current scanner state and restore it in the
    #   next call of scan_tokens.
    #
    # scan_tokens must return the encoder variable it was given.
    #
    # You are completely free to use any style you want, just make sure encoder
    # gets what it needs. But typically, a Scanner follows the following scheme:
    def scan_tokens encoder, options

      # The scanner is always in a certain state, which is :initial by default.
      # We use local variables and symbols to maximize speed.
      state = :initial

      # Sometimes, you need a stack. Ruby arrays are perfect for this.
      stack = []

      # Define more flags and variables as you need them.
      key_expected = false

      # The main loop; eos? is true when the end of the code is reached.
      until eos?

        # Deprecation notice: The use of local variables kind and match no longer
        # recommended.

        # Depending on the state, we want to do different things.
        case state

        # Normally, we use this case.
        when :initial
          # I like the / ... /x style regexps because white space makes them more
          # readable. x means white space is ignored.
          if match = scan(/ \s+ /x)
            # White space and masked line ends are :space.
            # Make sure you never send an empty token! /\s*/ for example would be
            # very bad (actually creating an infinite loop).
            encoder.text_token match, :space
          elsif match = scan(/ [:,\[{\]}] /x)
            # Operators of JSON. stack is used to determine where we are. stack and
            # key_expected are set depending on which operator was found.
            # key_expected is used to decide whether a "quoted" thing should be
            # classified as key or string.
            encoder.text_token match, :operator
            case match
            when '{' then stack << :object; key_expected = true
            when '[' then stack << :array
            when ':' then key_expected = false
            when ',' then key_expected = true if stack.last == :object
            when '}', ']' then stack.pop  # no error recovery, but works for valid JSON
            end
          elsif match = scan(/ true | false | null /x)
            # These are the only idents that are allowed in JSON. Normally, IDENT_KIND
            # would be used to tell keywords and idents apart.
            encoder.text_token match, IDENT_KIND[match]
          elsif match = scan(/ -? (?: 0 | [1-9]\d* ) /x)
            # Pay attention to the details: JSON doesn't allow numbers like 00.
            if scan(/ \.\d+ (?:[eE][-+]?\d+)? | [eE][-+]? \d+ /x)
              match << matched
              encoder.text_token match, :float
            else
              encoder.text_token match, :integer
            end
          elsif match = scan(/"/)
            # A "quoted" token was found, and we know whether it is a key or a string.
            state = key_expected ? :key : :string
            # This opens a token group and encodes the delimiter token.
            encoder.begin_group state
            encoder.text_token match, :delimiter
          else
            # Don't forget to add this case: If we reach invalid code, we try to discard
            # chars one by one and mark them as :error.
            encoder.text_token getch, :error
          end

        # String scanning is a bit more complicated, so we use another state for it.
        # The scanner stays in :string state until the string ends or an error occurs.
        #
        # JSON uses the same notation for strings and keys. We want keys to be in a
        # different color, but the lexical rules are the same. This is why we use this
        # case also for the :key state.
        when :string, :key
          # Another if-elsif-else-switch, for strings this time.
          if match = scan(/[^\\"]+/)
            # Everything that is not \ or " is just string content.
            encoder.text_token match, :content
          elsif match = scan(/"/)
            # A " is found, which means this string or key is ending here.
            # A special token class, :delimiter, is used for tokens like this one.
            encoder.text_token match, :delimiter
            # Always close your token groups using the right token kind!
            encoder.end_group state
            # We're going back to normal scanning here.
            state = :initial
            # Deprecation notice: Don't use "next" any more.
          elsif match = scan(/ \\ (?: #{ESCAPE} | #{UNICODE_ESCAPE} ) /mox)
            # A valid special character should be classified as :char.
            encoder.text_token match, :char
          elsif match = scan(/\\./m)
            # Anything else that is escaped (including \n, we use the m modifier) is
            # just content.
            encoder.text_token match, :content
          elsif match = scan(/ \\ | $ /x)
            # A string that suddenly ends in the middle, or reaches the end of the
            # line. This is an error; we go back to :initial now.
            encoder.end_group state
            encoder.text_token match, :error
            state = :initial
          else
            # Nice for debugging. Should never happen.
            raise_inspect "else case \" reached; %p not handled." % [peek(1)], encoder
          end

        else
          # Nice for debugging. Should never happen.
          raise_inspect 'Unknown state: %p' % [state], encoder

        end

        # Deprecation notice: The block using the match local variable block is gone.
      end

      # If we still have a string or key token group open, close it.
      if [:string, :key].include? state
        encoder.end_group state
      end

      # Return the encoder.
      encoder
    end

  end

end
end