Skip to content
kmcallister edited this page Oct 17, 2014 · 2 revisions

The parser is consistently a "push" design. Net or script sends buffers to the tokenizer, which sends tokens to the tree builder, which sends tree construction ops to script. The "sending" here is just a method call, but the tree op method will do an actual message send in the case of off-thread parsing. The use of traits allows swapping in other consumers of tokens and tree ops, e.g. a test harness.

The tokenizer holds a queue of uniquely-owned buffers. Using this rather than a single buffer allows us to avoid intermediate copies. The points at which input is broken into discrete buffers should have no effect on the output; this is tested by the tokenizer test runner. At any time the tokenizer may get stuck waiting on additional input, so all of its state lives in a struct that persists across method calls.

This would be much cleaner using tasks as coroutines, but that would impose extra requirements on the library consumer.

Buffers will be tagged with IDs of running scripts, so that document.write can insert characters in the right place.

Input encoding detection is not part of this codebase yet. It seems pretty orthogonal and will probably happen after the new parser lands in Servo.

The tokenizer is coded as a very direct translation of the state machine in the spec, using macros to condense the common state machine actions. There are fast paths (pop_except_from) to handle long runs of characters from the same state.

Character references have their own state machine within the tokenizer. It uses rust-phf to build a static map for the several thousand character names and all their prefixes.

Clone this wiki locally