Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster tokenizer lookahead #13341

Merged
merged 8 commits into from May 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
@@ -0,0 +1,22 @@
import Benchmark from "benchmark";
import baseline from "@babel-baseline/parser";
import current from "../../lib/index.js";
import { report } from "../util.mjs";

const suite = new Benchmark.Suite();
function createInput(length) {
return "type A = " + "| (x) => void".repeat(length);
}
function benchCases(name, implementation, options) {
for (const length of [256, 512, 1024, 2048]) {
const input = createInput(length);
suite.add(`${name} ${length} arrow function types`, () => {
implementation.parse(input, options);
});
}
}

benchCases("baseline", baseline, { plugins: ["flow"] });
benchCases("current", current, { plugins: ["flow"] });

suite.on("cycle", report).run();
3 changes: 1 addition & 2 deletions packages/babel-parser/src/plugins/flow/index.js
Expand Up @@ -9,7 +9,6 @@ import type Parser from "../../parser";
import { types as tt, type TokenType } from "../../tokenizer/types";
import * as N from "../../types";
import type { Pos, Position } from "../../util/location";
import type State from "../../tokenizer/state";
import { types as tc } from "../../tokenizer/context";
import * as charCodes from "charcodes";
import { isIteratorStart, isKeyword } from "../../util/identifier";
Expand Down Expand Up @@ -154,7 +153,7 @@ function hasTypeImportKind(node: N.Node): boolean {
return node.importKind === "type" || node.importKind === "typeof";
}

function isMaybeDefaultImport(state: State): boolean {
function isMaybeDefaultImport(state: { type: TokenType, value: any }): boolean {
return (
(state.type === tt.name || !!state.type.keyword) && state.value !== "from"
);
Expand Down
12 changes: 12 additions & 0 deletions packages/babel-parser/src/plugins/jsx/index.js
Expand Up @@ -15,6 +15,10 @@ import { isIdentifierChar, isIdentifierStart } from "../../util/identifier";
import type { Position } from "../../util/location";
import { isNewLine } from "../../util/whitespace";
import { Errors, makeErrorTemplates, ErrorCodes } from "../../parser/error";
import type { LookaheadState } from "../../tokenizer/state";
import State from "../../tokenizer/state";

type JSXLookaheadState = LookaheadState & { inPropertyName: boolean };

const HEX_NUMBER = /^[\da-fA-F]+$/;
const DECIMAL_NUMBER = /^\d+$/;
Expand Down Expand Up @@ -573,6 +577,14 @@ export default (superClass: Class<Parser>): Class<Parser> =>
}
}

createLookaheadState(state: State): JSXLookaheadState {
const lookaheadState = ((super.createLookaheadState(
state,
): any): JSXLookaheadState);
lookaheadState.inPropertyName = state.inPropertyName;
return lookaheadState;
}

getTokenFromCode(code: number): void {
if (this.state.inPropertyName) return super.getTokenFromCode(code);

Expand Down
11 changes: 2 additions & 9 deletions packages/babel-parser/src/tokenizer/context.js
Expand Up @@ -7,22 +7,15 @@
import { types as tt } from "./types";

export class TokContext {
constructor(
token: string,
isExpr?: boolean,
preserveSpace?: boolean,
override?: ?Function, // Takes a Tokenizer as a this-parameter, and returns void.
) {
constructor(token: string, isExpr?: boolean, preserveSpace?: boolean) {
this.token = token;
this.isExpr = !!isExpr;
this.preserveSpace = !!preserveSpace;
this.override = override;
}

token: string;
isExpr: boolean;
preserveSpace: boolean;
override: ?Function;
}

export const types: {
Expand All @@ -34,7 +27,7 @@ export const types: {
templateQuasi: new TokContext("${", false),
parenStatement: new TokContext("(", false),
parenExpression: new TokContext("(", true),
template: new TokContext("`", true, true, p => p.readTmplToken()),
template: new TokContext("`", true, true),
functionExpression: new TokContext("function", true),
functionStatement: new TokContext("function", false),
};
Expand Down
77 changes: 59 additions & 18 deletions packages/babel-parser/src/tokenizer/index.js
Expand Up @@ -19,6 +19,7 @@ import {
skipWhiteSpace,
} from "../util/whitespace";
import State from "./state";
import type { LookaheadState } from "./state";

const VALID_REGEX_FLAGS = new Set(["g", "m", "s", "i", "y", "u"]);

Expand Down Expand Up @@ -144,11 +145,9 @@ export default class Tokenizer extends ParserErrors {
// Move to the next token

next(): void {
if (!this.isLookahead) {
this.checkKeywordEscapes();
if (this.options.tokens) {
this.pushToken(new Token(this.state));
}
this.checkKeywordEscapes();
Copy link
Contributor Author

@JLHwung JLHwung May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isLookahead condition is removed because we call nextToken in lookahead().

if (this.options.tokens) {
this.pushToken(new Token(this.state));
}

this.state.lastTokEnd = this.state.end;
Expand All @@ -175,14 +174,51 @@ export default class Tokenizer extends ParserErrors {
return this.state.type === type;
}

// TODO
/**
* Create a LookaheadState from current parser state
*
* @param {State} state
* @returns {LookaheadState}
* @memberof Tokenizer
*/
createLookaheadState(state: State): LookaheadState {
return {
pos: state.pos,
value: null,
type: state.type,
start: state.start,
end: state.end,
lastTokEnd: state.end,
context: [this.curContext()],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to copy the whole context because lookahead never update contexts.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you got 2x state.end. It's better to allocate some memory for a var and avoid duplicate code and 2x. object access on each run

exprAllowed: state.exprAllowed,
inType: state.inType,
};
}

lookahead(): State {
/**
* lookahead peeks the next token, skipping changes to token context and
* comment stack. For performance it returns a limited LookaheadState
* instead of full parser state.
*
* The { column, line } Loc info is not included in lookahead since such usage
* is rare. Although it may return other location properties e.g. `curLine` and
* `lineStart`, these properties are not listed in the LookaheadState interface
* and thus the returned value is _NOT_ reliable.
*
* The tokenizer should make best efforts to avoid using any parser state
* other than those defined in LookaheadState
*
* @returns {LookaheadState}
* @memberof Tokenizer
*/
lookahead(): LookaheadState {
const old = this.state;
this.state = old.clone(true);
// For performance we use a simpified tokenizer state structure
// $FlowIgnore
this.state = this.createLookaheadState(old);

this.isLookahead = true;
this.next();
this.nextToken();
this.isLookahead = false;

const curr = this.state;
Expand Down Expand Up @@ -247,17 +283,16 @@ export default class Tokenizer extends ParserErrors {

nextToken(): void {
const curContext = this.curContext();
if (!curContext?.preserveSpace) this.skipSpace();
if (!curContext.preserveSpace) this.skipSpace();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of context is [ct.braceStatement] so curContext is always non-nullish.

this.state.start = this.state.pos;
this.state.startLoc = this.state.curPosition();
if (!this.isLookahead) this.state.startLoc = this.state.curPosition();
if (this.state.pos >= this.length) {
this.finishToken(tt.eof);
return;
}

const override = curContext?.override;
if (override) {
override(this);
if (curContext === ct.template) {
this.readTmplToken();
} else {
this.getTokenFromCode(this.codePointAtPos(this.state.pos));
}
Expand Down Expand Up @@ -285,7 +320,8 @@ export default class Tokenizer extends ParserErrors {
}

skipBlockComment(): void {
const startLoc = this.state.curPosition();
let startLoc;
if (!this.isLookahead) startLoc = this.state.curPosition();
const start = this.state.pos;
const end = this.input.indexOf("*/", this.state.pos + 2);
if (end === -1) throw this.raise(start, Errors.UnterminatedComment);
Expand All @@ -304,6 +340,7 @@ export default class Tokenizer extends ParserErrors {
// If we are doing a lookahead right now we need to advance the position (above code)
// but we do not want to push the comment to the state.
if (this.isLookahead) return;
/*:: invariant(startLoc) */

this.pushComment(
true,
Expand All @@ -317,7 +354,8 @@ export default class Tokenizer extends ParserErrors {

skipLineComment(startSkip: number): void {
const start = this.state.pos;
const startLoc = this.state.curPosition();
let startLoc;
if (!this.isLookahead) startLoc = this.state.curPosition();
let ch = this.input.charCodeAt((this.state.pos += startSkip));
if (this.state.pos < this.length) {
while (!isNewLine(ch) && ++this.state.pos < this.length) {
Expand All @@ -328,6 +366,7 @@ export default class Tokenizer extends ParserErrors {
// If we are doing a lookahead right now we need to advance the position (above code)
// but we do not want to push the comment to the state.
if (this.isLookahead) return;
/*:: invariant(startLoc) */

this.pushComment(
false,
Expand Down Expand Up @@ -398,12 +437,14 @@ export default class Tokenizer extends ParserErrors {

finishToken(type: TokenType, val: any): void {
this.state.end = this.state.pos;
this.state.endLoc = this.state.curPosition();
const prevType = this.state.type;
this.state.type = type;
this.state.value = val;

if (!this.isLookahead) this.updateContext(prevType);
if (!this.isLookahead) {
this.state.endLoc = this.state.curPosition();
this.updateContext(prevType);
}
}

// ### Token reading
Expand Down
11 changes: 11 additions & 0 deletions packages/babel-parser/src/tokenizer/state.js
Expand Up @@ -178,3 +178,14 @@ export default class State {
return state;
}
}

export type LookaheadState = {
pos: number,
value: any,
type: TokenType,
start: number,
end: number,
/* Used only in readSlashToken */
exprAllowed: boolean,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be 5% faster if we get rid of exprAllowed and inType, so that they don't have to be copied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does copying a couple of bool influence that much perf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Unlike C, in V8 the boolean literal occupies 64 bit memory, essentially an address points to the true/false built-ins. And a sequence of boolean literals will not be compressed into a bit array. So two boolean literal = 16 byte memory.

In Babel 7.7 we improved the traverser performance by compressing 3 booleans to a bit array: https://hackmd.io/UMdqwvVgQGaofjCZHfGKrA?view#Compress-boolean-flags

Note that the benchmark is specifically constructed to highlight the performance impact of lookahead(), so in the real word they are not that significant. However, the performance improvement is a long-term task and eventually every tiny improvement counts.

Copy link

@KFlash KFlash May 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JLHwung Funny to see devs playing with perf :) Your suggested changes is only micro-optimization and not scalable, but yeah you are on right track.
The entire tokenizer is an Class that extends ParserErrors . Run a benchmark and check the extends perf and you get chocked about perf loss.
Each token is it's own Class. Try flatten it instead into an infinite state machine. I'm sure you gain 20% perf. As it is right now it consumes memory.

You really should reduce amount of object access too. This is expensive
this.state.lastTokEnd vs this.lastTokEnd. Why do you need this extra state object inside the class?
A quick google search will also let you know that this is expensive.

There are many other things you can do to optimize Babel parser. I mentioned only a few.

inType: boolean,
};
@@ -0,0 +1,2 @@
/*1*/ export /*2*/ { /*3*/ A /*4*/, /*5*/ B /*6*/ as /*7*/ C /*8*/ } /*9*/ from /*10*/ "foo";
/*1*/ export /*2*/ * /*3*/ from /*4*/ "foo"