Skip to content

Adding a new lexer

Jakub Klímek edited this page Jul 21, 2016 · 1 revision

This is a newbie's guide to a new lexer in Rouge as I did not find one while working on my Turtle lexer. Note that I have never before seen Ruby and related tools, so some pieces of the guide may be obvious. This works on Linux and on Windows using Cygwin. Thanks @mjclemente for his useful blogposts on setting up Rouge and creating a lexer, this guide is based on them.

Prerequisites

  1. Fork Rouge
  2. Clone your fork, i.e. not $ git clone https://github.com/jneen/rouge.git, but your repo
  3. Follow Setting up Ruby or @mjclemente blogpost to setup the environment.
  4. Note that if you cannot find rackup, add it to PATH like this: export PATH=$PATH:~/.gem/ruby/gems/rack-1.6.4/bin/ e.g. in .bashrc in your home directory.

Checklist

  1. You can run the rougify script on a file
  2. You can run rackup and see all available lexers and their demos on http://localhost:9292 and a specific one (e.g. XML) on http://localhost:9292/xml
  3. Think of a name for your new lexer, I was doing a lexer for Turtle, so I chose turtle.

First steps

Yes, we are going to copy & paste an existing lexer and iteratively make it our own. I use turtle, you will use your lexer name. I start with the xml lexer, you should start with a lexer which is somehow closest to your language. However, if you are doing a lexer that is very close to another existing one, consider extending that one instead of creating a new one.

  1. Copy /spec/lexers/lexername_spec.rb to /spec/lexers/turtle_spec.rb. This is basically just an outside description (like an interface) of the lexer. Change Rouge::Lexers::XML to Rouge::Lexers::Turtle on line 3 and Rouge::Lexers::XML.new to Rouge::Lexers::Turtle.newon line 4. Rouge guesses the input file format based on filename extension, MIME-type and content, so adjust the three blocks by adding/removing lines and adjusting to your format's extensions and MIME-types.
  2. Copy /spec/visual/samples/xml to /spec/visual/samples/turtle. This is a longer input file that gets lexed on http://localhost:9292/xml. Change it to be a longer file in your language, using as many of the language constructs as possible, ideally all.
  3. Copy /lib/rouge/demos/lexername to /lib/rouge/demos/turtle. This is the short language demo shown in the list on http://localhost:9292. Again, provide a short input in your language, showing as much of the language as possible.
  4. Copy /lib/rouge/lexers/xml.rb to /lib/rouge/lexers/turtle.rb. This is the code of the lexer itself. Change class XML < RegexLexer to class Turtle < RegexLexer on line 5, change the title, description, filenames (extensions) and MIME-types to match those from the spec file. Finally, adjust the def self.analyze_text(text) method, which takes e.g. first 1000 characters from the input file and matches it using a regex. In case of a match returns a match probability number.

If you are new to Ruby and its regexes, read the specification, especially if in doubt about %r, /i, /b, etc.

Now, when you access and http://localhost:9292, you should see your language, turtle in my case, listed with a demo and on http://localhost:9292/turtle you should see the longer sample. Of course, the highlight is still the untouched source, which probably means lots of errors (red highlights) in highlighting your file.

Also, you should be able to run the test without errors.

Implementing the lexer

The work on the lexer usually goes like this:

  1. With rackup running, in one browser window you have http://localhost:9292 to see the demo file and http://localhost:9292/turtle to see the sample file.
  2. In your favorite text editor, ideally with Ruby syntax highlight, you have the lexer /lib/rouge/lexers/turtle.rb, which contains a set of rules.
  3. In another window you have the list of tokens produced by the rules which annotate the text.
  4. You change the rules in the lexer (few tips in the next section), save, refresh the browser and do this until done.

After you are done with your lexer, commit and push it to your forked repository (it should be the 4 files) and create a pull request and after a while, check whether it passes tests.

Lexer implementation tips

OK, with the Turtle lexer, I did a simple thing with no custom states (all rules in :root). If you need something more complex, it is up to you. This is a list of documents I used:

And a few tips

  • Order of the rules matters!
  • Start with something simple :)