Update docs after lexer changes

nikic · Sep 17, 2023 · 21ead39 · 21ead39
1 parent b11fca0
commit 21ead39
Show file tree

Hide file tree

Showing 4 changed files with 78 additions and 124 deletions.
diff --git a/doc/2_Usage_of_basic_components.markdown b/doc/2_Usage_of_basic_components.markdown
@@ -206,11 +206,13 @@ without the `PhpParser\Node\` prefix and `\` replaced with `_`. It also does not
 It is possible to associate custom metadata with a node using the `setAttribute()` method. This data
 can then be retrieved using `hasAttribute()`, `getAttribute()` and `getAttributes()`.
 
-By default, the lexer adds the `startLine`, `endLine` and `comments` attributes. `comments` is an array
-of `PhpParser\Comment[\Doc]` instances.
+By default, the parser adds the `startLine`, `endLine`, `startTokenPos`, `endTokenPos`,
+`startFilePos`, `endFilePos` and `comments` attributes. `comments` is an array of
+`PhpParser\Comment[\Doc]` instances.
 
-The start line can also be accessed using `getStartLine()` (instead of `getAttribute('startLine')`).
-The last doc comment from the `comments` attribute can be obtained using `getDocComment()`.
+The pre-defined attributes can also be accessed using `getStartLine()` instead of
+`getAttribute('startLine')`, and so on. The last doc comment from the `comments` attribute can be
+obtained using `getDocComment()`.
 
 Pretty printer
 --------------

diff --git a/doc/component/Error_handling.markdown b/doc/component/Error_handling.markdown
@@ -4,29 +4,12 @@ Error handling
 Errors during parsing or analysis are represented using the `PhpParser\Error` exception class. In addition to an error
 message, an error can also store additional information about the location the error occurred at.
 
-How much location information is available depends on the origin of the error and how many lexer attributes have been
-enabled. At a minimum the start line of the error is usually available.
+How much location information is available depends on the origin of the error. At a minimum the start line of the error
+is usually available.
 
 Column information
 ------------------
 
-In order to receive information about not only the line, but also the column span an error occurred at, the file
-position attributes in the lexer need to be enabled:
-
-```php
-$lexerOptions = array(
-    'usedAttributes' => array('comments', 'startLine', 'endLine', 'startFilePos', 'endFilePos'),
-);
-$parser = (new PhpParser\ParserFactory())->createForHostVersion($lexerOptions);
-
-try {
-    $stmts = $parser->parse($code);
-    // ...
-} catch (PhpParser\Error $e) {
-    // ...
-}
-```
-
 Before using column information, its availability needs to be checked with `$e->hasColumnInfo()`, as the precise
 location of an error cannot always be determined. The methods for retrieving column information also have to be passed
 the source code of the parsed file. An example for printing an error:

diff --git a/doc/component/Lexer.markdown b/doc/component/Lexer.markdown
@@ -1,41 +1,79 @@
 Lexer component documentation
 =============================
 
-The lexer is responsible for providing tokens to the parser. The project comes with two lexers: `PhpParser\Lexer` and
-`PhpParser\Lexer\Emulative`. The latter is an extension of the former, which adds the ability to emulate tokens of
-newer PHP versions and thus allows parsing of new code on older versions.
+The lexer is responsible for providing tokens to the parser. Typical use of the library does not require direct
+interaction with the lexer, as an appropriate lexer is created by `PhpParser\ParserFactory`. The tokens produced
+by the lexer can then be retrieved using `PhpParser\Parser::getTokens()`.
 
-This documentation discusses options available for the default lexers and explains how lexers can be extended.
+Emulation
+---------
 
-Lexer options
--------------
+While this library implements a custom parser, it relies on PHP's `ext/tokenizer` extension to perform lexing. However,
+this extension only supports lexing code for the PHP version you are running on, while this library also wants to support
+parsing newer code. For that reason, the lexer performs additional "emulation" in three layers:
 
-The two default lexers accept an `$options` array in the constructor. Currently only the `'usedAttributes'` option is
-supported, which allows you to specify which attributes will be added to the AST nodes. The attributes can then be
-accessed using `$node->getAttribute()`, `$node->setAttribute()`, `$node->hasAttribute()` and `$node->getAttributes()`
-methods. A sample options array:
+First, PhpParser uses the `PhpToken` based representation introduced in PHP 8.0, rather than the array-based tokens from
+previous versions. The `PhpParser\Token` class either extends `PhpToken` (on PHP 8.0) or a polyfill implementation. The
+polyfill implementation will also perform two emulations that are required by the parser and cannot be disabled:
+
+ * Single-line comments use the PHP 8.0 representation that does not include a trailing newline. The newline will be
+   part of a following `T_WHITESPACE` token.
+ * Namespaced names use the PHP 8.0 representation using `T_NAME_FULLY_QUALIFIED`, `T_NAME_QUALIFIED` and
+   `T_NAME_RELATIVE` tokens, rather than the previous representation using a sequence of `T_STRING` and `T_NS_SEPARATOR`.
+   This means that certain code that is legal on older versions (namespaced names including whitespace, such as `A \ B`)
+   will not be accepted by the parser.
+
+Second, the `PhpParser\Lexer` base class will convert `&` tokens into the PHP 8.1 representation of either
+`T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG` or `T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG`. This is required by the parser
+and cannot be disabled.
+
+Finally, `PhpParser\Lexer\Emulative` performs other, optional emulations. This lexer is parameterized by `PhpVersion`
+and will try to emulate `ext/tokenizer` output for that version. This is done using separate `TokenEmulator`s for each
+emulated feature.
+
+Emulation is usually used to support newer PHP versions, but there is also very limited support for reverse emulation to
+older PHP versions, which can make keywords from newer versions non-reserved.
+
+Tokens, positions and attributes
+--------------------------------
+
+The `Lexer::tokenize()` method returns an array of `PhpParser\Token`s. The most important parts of the interface can be
+summarized as follows:
 
 ```php
-$lexer = new PhpParser\Lexer(array(
-    'usedAttributes' => array(
-        'comments', 'startLine', 'endLine'
-    )
-));
+class Token {
+    /** @var int Token ID, either T_* or ord($char) for single-character tokens. */
+    public int $id;
+    /** @var string The textual content of the token. */
+    public string $text;
+    /** @var int The 1-based starting line of the token (or -1 if unknown). */
+    public int $line;
+    /** @var int The 0-based starting position of the token (or -1 if unknown). */
+    public int $pos;
+
+    /** @param int|string|(int|string)[] $kind Token ID or text (or array of them) */
+    public function is($kind): bool;
+}
 ```
 
-The attributes used in this example match the default behavior of the lexer. The following attributes are supported:
+Unlike PHP's own `PhpToken::tokenize()` output, the token array is terminated by a sentinel token with ID 0.
+
+The lexer is normally invoked implicitly by the parser. In that case, the tokens for the last parse can be retrieved
+using `Parser::getTokens()`.
 
- * `comments`: Array of `PhpParser\Comment` or `PhpParser\Comment\Doc` instances, representing all comments that occurred
-   between the previous non-discarded token and the current one. Use of this attribute is required for the
-   `$node->getComments()` and `$node->getDocComment()` methods to work. The attribute is also needed if you wish the pretty
-   printer to retain comments present in the original code.
- * `startLine`: Line in which the node starts. This attribute is required for the `$node->getLine()` to work. It is also
-   required if syntax errors should contain line number information.
- * `endLine`: Line in which the node ends. Required for `$node->getEndLine()`.
- * `startTokenPos`: Offset into the token array of the first token in the node. Required for `$node->getStartTokenPos()`.
- * `endTokenPos`: Offset into the token array of the last token in the node. Required for `$node->getEndTokenPos()`.
- * `startFilePos`: Offset into the code string of the first character that is part of the node. Required for `$node->getStartFilePos()`.
- * `endFilePos`: Offset into the code string of the last character that is part of the node. Required for `$node->getEndFilePos()`.
+Nodes in the AST produced by the parser always corresponds to some range of tokens. The parser adds a number of
+positioning attributes to allow mapping nodes back to lines, tokens or file offsets:
+
+ * `startLine`: Line in which the node starts. Used by `$node->getStartLine()`.
+ * `endLine`: Line in which the node ends. Used by `$node->getEndLine()`.
+ * `startTokenPos`: Offset into the token array of the first token in the node. Used by `$node->getStartTokenPos()`.
+ * `endTokenPos`: Offset into the token array of the last token in the node. Used by `$node->getEndTokenPos()`.
+ * `startFilePos`: Offset into the code string of the first character that is part of the node. Used by `$node->getStartFilePos()`.
+ * `endFilePos`: Offset into the code string of the last character that is part of the node. Used by `$node->getEndFilePos()`.
+
+Note that `start`/`end` here are closed rather than half-open ranges. This means that a node consisting of a single
+token will have `startTokenPos == endTokenPos` rather than `startTokenPos + 1 == endTokenPos`. This also means that a
+zero-length node will have `startTokenPos -1 == endTokenPos`.
 
 ### Using token positions
 
@@ -73,83 +111,16 @@ class MyNodeVisitor extends PhpParser\NodeVisitorAbstract {
     }
 }
 
-$lexerOptions = array(
-    'usedAttributes' => array(
-        'comments', 'startLine', 'endLine', 'startTokenPos', 'endTokenPos'
-    )
-);
 $parser = (new PhpParser\ParserFactory())->createForHostVersion($lexerOptions);
 
 $visitor = new MyNodeVisitor();
 $traverser = new PhpParser\NodeTraverser($visitor);
 
 try {
     $stmts = $parser->parse($code);
-    $visitor->setTokens($lexer->getTokens());
+    $visitor->setTokens($parser->getTokens());
     $stmts = $traverser->traverse($stmts);
 } catch (PhpParser\Error $e) {
     echo 'Parse Error: ', $e->getMessage();
 }
 ```
-
-The same approach can also be used to perform specific modifications in the code, without changing the formatting in
-other places (which is the case when using the pretty printer).
-
-Lexer extension
----------------
-
-The primary public interface of the lexer consists of the following methods:
-
-```php
-function startLexing(string $code, ErrorHandler $errorHandler = null): void;
-function getTokens(): array;
-function getNextToken(string &$value = null, array &$startAttributes = null, array &$endAttributes = null): int;
-```
-
-The `startLexing()` method is invoked whenever the `parse()` method of the parser is called and is passed the source
-code that is to be lexed (including the opening tag). It can be used to reset state or preprocess the source code or tokens. The
-passed `ErrorHandler` should be used to report lexing errors.
-
-The `getTokens()` method returns the current array of `PhpParser\Token`s, which are compatible with the PHP 8 `PhpToken`
-class. This method is not used by the parser (which uses `getNextToken()`), but is useful in combination with the token
-position attributes.
-
-The `getNextToken()` method returns the ID of the next token (in the sense of `Token::$id`). If no more
-tokens are available it must return `0`, which is the ID of the `EOF` token. Furthermore, the string content of the
-token should be written into the by-reference `$value` parameter (which will then be available as `$n` in the parser).
-
-### Attribute handling
-
-The other two by-ref variables `$startAttributes` and `$endAttributes` define which attributes will eventually be
-assigned to the generated nodes: The parser will take the `$startAttributes` from the first token which is part of the
-node and the `$endAttributes` from the last token that is part of the node.
-
-E.g. if the tokens `T_FUNCTION T_STRING ... '{' ... '}'` constitute a node, then the `$startAttributes` from the
-`T_FUNCTION` token will be taken and the `$endAttributes` from the `'}'` token.
-
-An application of custom attributes is storing the exact original formatting of literals: While the parser does retain
-some information about the formatting of integers (like decimal vs. hexadecimal) or strings (like used quote type), it
-does not preserve the exact original formatting (e.g. leading zeros for integers or escape sequences in strings). This
-can be remedied by storing the original value in an attribute:
-
-```php
-use PhpParser\Lexer;
-
-class KeepOriginalValueLexer extends Lexer // or Lexer\Emulative
-{
-    public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
-        $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);
-
-        if ($tokenId == \T_CONSTANT_ENCAPSED_STRING   // non-interpolated string
-            || $tokenId == \T_ENCAPSED_AND_WHITESPACE // interpolated string
-            || $tokenId == \T_LNUMBER                 // integer
-            || $tokenId == \T_DNUMBER                 // floating point number
-        ) {
-            // could also use $startAttributes, doesn't really matter here
-            $endAttributes['originalValue'] = $value;
-        }
-
-        return $tokenId;
-    }
-}
-```
diff --git a/doc/component/Pretty_printing.markdown b/doc/component/Pretty_printing.markdown
@@ -64,21 +64,19 @@ code which has been modified or newly inserted.
 Use of the formatting-preservation functionality requires some additional preparatory steps:
 
 ```php
-use PhpParser\{Lexer, NodeTraverser, NodeVisitor, ParserFactory, PrettyPrinter};
+use PhpParser\{NodeTraverser, NodeVisitor, ParserFactory, PrettyPrinter};
 
 $parser = (new ParserFactory())->createForHostVersion();
-
-$traverser = new NodeTraverser(new NodeVisitor\CloningVisitor());
-
-$printer = new PrettyPrinter\Standard();
-
 $oldStmts = $parser->parse($code);
 $oldTokens = $parser->getTokens();
 
+// Run CloningVisitor before making changes to the AST.
+$traverser = new NodeTraverser(new NodeVisitor\CloningVisitor());
 $newStmts = $traverser->traverse($oldStmts);
 
 // MODIFY $newStmts HERE
 
+$printer = new PrettyPrinter\Standard();
 $newCode = $printer->printFormatPreserving($newStmts, $oldStmts, $oldTokens);
 ```