Handling escape symbols #1349

jonathantneal · 2020-04-18T03:19:23Z

Each line should parse as one word (i.e. one identifier):

C\(\#0280ae\)
C\(brandColor\)
C\(\#fff\)\:h:hover

This can be verified with https://rawgit.com/tabatkins/parse-css/master/example.html

This issue continues from shellscape/postcss-values-parser#93

With thanks and credit to @nex3 and @ai for identifying this issue.

Update: I have started work on updating the Tokenizer, but I may need assistance as I integrate or abandon the current multichar tokens. I don’t necessarily see how those tokens benefit speed. Any risk of inaccuracy seems too steep a price to pay.

@ai, thank you for the wonderful documentation @ https://github.com/postcss/postcss/blob/7.0.27/docs/architecture.md#tokenizer--libtokenizees6-

ai · 2020-04-18T12:38:00Z

What are CSS escaping rules? Ignore the next symbol after \?
What is the current output?

jonathantneal · 2020-04-18T13:37:03Z

What are CSS escaping rules? Ignore the next symbol after ?

The general escape rule is that a backslash swallows the next character (Reference).

Within names (or URLs), an escape cannot consume a newline, but within strings it can.

In the above diagram, you’ll see exceptions for hex digits. In most cases, we could normally ignore this, as escape codes always begin words (or what the spec calls "identifiers") and hex digits are already covered by identifier rules, but the swallowing of whitespace afterward may be important.

What is the current output?

Given the CSS f\:1, the token output currently is this:

[
  [ 'word', 'f', 1, 1, 1, 2 ],
  [ 'word', '\\:', 1, 3, 1, 4 ],
  [ 'word', '1', 1, 5, 1, 5 ]
]

And it should be more like this:

[
  [ 'word', 'f\\:1', 1, 1, 1, 5 ]
]

jonathantneal · 2020-04-20T14:16:53Z

I only have a small update. I’m also wrapping up school with the kids and need to start work, so please forgive typos or confusing statements.

I was able to complete my first pass of the tokenizer this weekend. Here is the source:

https://gist.github.com/jonathantneal/dd859523df584b012dfa1cc8795d7ca9

Here are some benchmark results.

PostCSS Tokenizer 7.0.27     x 133 ops/sec ±1.69% (79 runs sampled)
PostCSS Tokenizer 2020-04-20 x 72.12 ops/sec ±3.51% (77 runs sampled)
Fastest test is PostCSS Tokenizer 7.0.27 at 1.85x faster than PostCSS Tokenizer 2020-04-20

PostCSS Tokenizer 7.0.27:      8 ms
PostCSS Tokenizer 2020-04-20: 14 ms (1.8 times slower)

PostCSS Tokenizer 7.0.27     x 113 ops/sec ±5.00% (70 runs sampled)
PostCSS Tokenizer 2020-04-20 x 70.98 ops/sec ±3.91% (77 runs sampled)
Fastest test is PostCSS Tokenizer 7.0.27 at 1.59x faster than PostCSS Tokenizer 2020-04-20

PostCSS Tokenizer 7.0.27:      9 ms
PostCSS Tokenizer 2020-04-20: 14 ms (1.6 times slower)

😢 Yes, it is significantly slower.
🤞 I hope it is slower because I wrote it poorly.
🤔 I understand that a spec-compliant tokenizer should be as performant as the current tokenizer to even consider such a change.

And here is an example structure of a token:

// where css is "-.567800E-0012780em"
const tokenExample1 = [
  /** @type {TokenType}  */ "number",
  /** @type {TokenValue} */ "-.567800E-0012780em",
  /** @type {TokenStart} */ 0,
  /** @type {TokenEnd}   */ 19,
  /** @type {TokenSplit} */ 17
]

// where css is "1138--thx"
const tokenExample2 = [
  /** @type {TokenType}  */ "number",
  /** @type {TokenValue} */ "1138--thx",
  /** @type {TokenStart} */ 0,
  /** @type {TokenEnd}   */ 9,
  /** @type {TokenSplit} */ 4
]

// where css is "foo\:bar"
const tokenExample3 = [
  /** @type {TokenType}  */ "identifier",
  /** @type {TokenValue} */ "foo\\:bar",
  /** @type {TokenStart} */ 0,
  /** @type {TokenEnd}   */ 8,
  /** @type {TokenSplit} */ 0
]

/** @typedef {"comment" | "delimiter" | "identifier" | "number" | "string" | "whitespace"} TokenType - Name of the token. */
/** @typedef {string} TokenValue - Value of the token. */
/** @typedef {number} TokenStart - Character index of the source where the token starts. */
/** @typedef {number} TokenEnd - Character index of the source where the token ends. */
/** @typedef {number} TokenSplit - Character index of the source where the token can be split. Splits occur between numbers and their units, and splits occur between identifier symbols (like `@` or `#`) and their identifier names). */

nex3 · 2020-04-21T04:14:56Z

Thanks a ton for looking into this, @jonathantneal! Maybe @ai can provide some performance tips?

ai · 2020-04-21T14:06:43Z

Do we really need to change it in the tokenizer if it affects performance?

We can change parser and avoid any performance investigations.

jonathantneal · 2020-04-24T05:02:14Z

Having some time tonight, I’ve pushed the latest version of the tokenizer to https://github.com/csstools/tokenizer

It includes benchmark tests and the ability to create token JSON files, if that’s at all helpful.

These benchmarks were averaged from my local machine:

PostCSS Tokenizer 7.0.27:      49548 tokens in  7 ms
PostCSS Tokenizer Development: 60458 tokens in 14 ms (2.0 times slower)

Do we really need to change it in the tokenizer if it affects performance?

I don’t know, because I don’t yet understand some parts of the current tokenizer. For instance, the current tokenizer produces a single token from CSS like sans-serif, (note the comma), but those should be 2 separate tokens. Is this a shortcut to save parsing time?

I think the answer to your question has a couple dependencies. The first dependency is to find out the performance impact the current PostCSS parser adds to its tokenizer. If the currently combined tokenizer and parser are slower than this new tokenizer I’m working on, then I think it’s worth it to have a "slower" tokenizer but a faster overall product that handles CSS accurately.

We can change parser and avoid any performance investigations.

Would you have time to investigate this? My earlier comment describes how escapes are assembled.

I’ve also included a small demo of its tokenization on CodePen.

jonathantneal · 2020-04-28T15:49:56Z

I want to provide more frequent updates here, but my free time is split between this and a significant upgrade to PostCSS Preset Env.

I think the next thing I can do is plug the tokenizer into the PostCSS object model, and see what the final performance impact is on Bootstrap’s CSS.

Let me admit that I have trouble solving the small issue (escapes) without solving a big issue (css syntax compliance). I do believe solving the bigger issue will knock out the little issue, tho. My hope is that I can at least push this far enough along that you and others can bring it across the finish line.

jonathantneal · 2020-04-30T13:45:47Z

Please let me know if I should post less frequently. Here are my updates:

In Progress: Combine the Experimental Tokenizer with the PostCSS Object Model

I have not yet integrated the experimental tokenizer into the PostCSS object model. I think I need a highly focused 4 hour window to accomplish this; time I have not yet had. Once this is done, I should be able to verify whether the escape symbols are properly handled with this combination. The weekend is soon.

Now: Compression May Improve Performance

I have added a compression step to the experimental tokenizer using terser, and the compressed tokenizer is regularly scoring better benchmarks than the uncompressed version. A unique element of the compression is that it is replacing every character code variable with the actual character code number; e.g. case TAB becomes case 9.

As of April 30, 2020, these benchmarks were averaged from my local machine:

PostCSS Tokenizer 7.0.27:            49548 tokens in 10 ms
PostCSS Tokenizer Development (min): 60458 tokens in 14 ms (1.4 times slower)
PostCSS Tokenizer Development:       60458 tokens in 18 ms (1.8 times slower)

Benchmarks are not an exact science for me. I have not changed anything about the experimental tokenizer, but it is more regularly parsing 1.8x slower, an improvement from 2x slower. Still, the compressed experimental tokenizer is benchmarking only 1.4x slower than the current PostCSS tokenizer, and it is picking up 1.2x the tokens.

Soon: Improving Syntax Tests

I would like to migrate some of the css-syntax tests from the web-platform-tests project into this development work. This is lower priority than integrating the Object Model, and I will fight the temptation to do this first. With new versions of these tests, PostCSS could protect itself against syntactically valid edge cases, and even trail-blaze work toward the Houdini Parser API. The WPT tests themselves are not entirely helpful, because browsers are limited to testing the serialization results of CSS.

ai · 2020-04-30T15:07:31Z

I recommend copying indexOf/RegExps jumps first since they are bringing the main performance improvements

https://github.com/postcss/postcss/blob/master/lib/tokenize.es6#L253-L254
https://github.com/postcss/postcss/blob/master/lib/tokenize.es6#L161-L175
https://github.com/postcss/postcss/blob/master/lib/tokenize.es6#L116
https://github.com/postcss/postcss/blob/master/lib/tokenize.es6#L284-L285

JS is very slow when you are tokenizing symbol by symbol.

jonathantneal · 2020-05-08T01:26:29Z

I’m feeling much better after analyzing those jumps. Let me share what I’ve learned. I’d like to know what you think.

I was able to implement the comment jump. There was no perceivable yield. This is what it looked like:

			// comment
			case FS:
				if (css.charCodeAt(pos + 1) === STAR) {
					typ = 'comment'
					pos = css.indexOf('*/', pos + 2) + 1
					if (pos === 0) {
						pos = len
					} else {
						++pos
					}
				} else {
					++pos
				}
				break

I was able to implement the string jump. There was no perceivable improvement. This is what it looked like:

			// string
			case DBLQ:
			case SNGQ:
				var quote = cp0 === DBLQ ? '"' : "'"
				var escaped
				do {
					escaped = false
					pos = css.indexOf(quote, pos + 1) + 1
					if (pos === 0) {
						pos = len
						break
					}
					escapePos = pos - 1
					while (css.charCodeAt(escapePos - 1) === BS) {
						escapePos -= 1
						escaped = !escaped
					}
				} while (escaped)
				break

So far, it seems those jumps are not beating symbol checks.

However, after looking at the next two jumps — I think I see where some of the performance differences are coming from. However, I’m not sure I should implement those jumps.

I did not want to implement the parentheses jumps because they skip tokenizing the contents between the parentheses. This is probably one of the reasons the current tokenizer is picking up 49,548 tokens from bootstrap.css, while this experimental tokenizer is picking up 60,458 tokens.

Put another way; the experimental tokenizer is 1.4 times slower but it is also picking up 1.2 times more tokens, and it is accurately handling the escapes.

Up Next: I think I have time tonight to wire in this new tokenizer to the current object model.

EDIT: I have added a very crude implementation of a PostCSS-like Object Model. The code is pushed to the same repo.

Collecting PostCSS Parser Benchmarks...

PostCSS Parser 7.0.27:       1204 nodes in 14 ms
PostCSS Experimental Parser: 13680 nodes in 20 ms (1.4 times slower)

On the down side — the experimental parser is 1.4 times slower than the current parser, same as the tokenizer comparison. 😬
On the up side — the current parser generated about 100 tokens/ms, while the experimental parser generated about 684 tokens/ms. 😅

ai · 2020-05-08T04:41:27Z

I did not want to implement the parentheses jumps because they skip tokenizing the contents between the parentheses. This is probably one of the reasons the current tokenizer is picking up 49,548 tokens from bootstrap.css, while this experimental tokenizer is picking up 60,458 tokens.

Yeap, we need to count PostCSS core + PostCSS Value parser time.

If a new tokenizer will prepare tokens for value/selector parsing, a little slow down will bring a big speedup in value/selector parsing.

the current parser generated about 100 tokens/ms, while the experimental parser generated about 684 tokens/ms.

I like this metric :D

jonathantneal · 2020-05-13T04:38:16Z

I haven’t had much time, but I have added another test that runs PostCSS with Selector and Value parsing. The functionality looks like this, which I hope we’ll find generous:

parsed.root.walk(node => {
  switch (node.type) {
    case 'decl':
      postcssValuesParser(node.value)
      break
    case 'rule':
      postcssSelectorParser.processSync(node.selector)
      break
  }
})

In reality, I think most plugins would runs those parsers in separate walks, and most plugin packs would run those parsers multiple times. Still, here are the updated parse times:

Collecting PostCSS Parser Benchmarks...

PostCSS Parser 7.0.29:              1204 nodes in 14 ms
PostCSS Experimental Parser:       13686 nodes in 17 ms (1.2 times slower)
PostCSS + Selector + Value Parser:  1204 nodes in 78 ms (5.5 times slower)

https://github.com/csstools/tokenizer#collecting-postcss-parser-benchmarks

jonathantneal · 2020-05-15T17:29:58Z

Andrey, please let me know if I’m pestering with these updates or if I can make them more helpful.

While analyzing the "slower" parts of the tokenizer, it seems like eagerly checking the character ahead improves overall performance. I have rewritten the tokenizer to take advantage of this.

I have also added 2 fields to a token; they are the opening and closing distances between the meaningful value of a token and its delimiters. Although “delimiter” is a poor term, this refers to the split between things like the @ & media in a @media At-Identifier token, the 2 & em in a 2em Number token, or the " & hello & " in a "hello" String token.

Anyway, I think you’ll really like these results!

Compressing PostCSS Tokenizer...

PostCSS Tokenizer Development:       1910 B
PostCSS Tokenizer Development (min):  638 B
PostCSS Tokenizer Development (web):  639 B

Collecting PostCSS Tokenizer Benchmarks...

PostCSS Tokenizer Development:       58721 tokens in 8 ms (1.0 times faster)
PostCSS Tokenizer Development (min): 58721 tokens in 8 ms (1.0 times faster)
PostCSS Tokenizer 7.0.30:            49548 tokens in 8 ms


Compressing PostCSS Parser...

PostCSS Parser Development:       1369 B
PostCSS Parser Development (min):  836 B
PostCSS Parser Development (web):  805 B

Collecting PostCSS Parser Benchmarks...

PostCSS Experimental Parser:       56024 nodes in 10 ms (1.6 times faster)
PostCSS Parser 7.0.30:              6240 nodes in 15 ms
PostCSS + Selector + Value Parser: 28491 nodes in 86 ms (5.5 times slower)

— From https://github.com/csstools/tokenizer#collecting-postcss-parser-benchmarks

ai · 2020-05-15T18:35:05Z

Andrey, please let me know if I’m pestering with these updates or if I can make them more helpful.

A new performance breakthrough is awesome 😍.

Let’s change tokenizer in 8.0.

Is it possible to use it in the safe parser or SCSS parser without changes? I forked the current one and can fork a new one too if it will be impossible to customise it.

alexander-akait · 2020-08-26T12:27:07Z

@ai what is status this feature for 8 version? Postpone?

ai · 2020-08-26T17:00:08Z

@evilebottnawi yeap, we decided to do it after 8.0, since new parser will make 8.0 release too big and will force us to wait too long

alexander-akait · 2020-08-26T17:06:31Z

Sounds good 👍

romainmenke · 2023-01-21T08:06:50Z

Took me a while to analyse this issue because initially I interpreted this as "PostCSS has a bug with escaped symbols".

But this is not a bug in PostCSS. As far as I can tell PostCSS produces the correct AST with or without escaped symbols.

It produces a few words too many at the tokenizer level, but this goes away in the parser.

The original issue was that a downstream package did not produce the correct AST and uses the same tokenizer. Changing the tokenizer here would produce a different output there and might have fixed some issues.

From a different perspective this sounds very risky to me.
A fix for one issue might be a number of new bugs for other cases.

That the downstream package uses internals from PostCSS also caused other issues and most plugins are no longer uses this dependency because of that.

I think that this has a lot of overlap with : #1145 (comment)

jonathantneal mentioned this issue Apr 18, 2020

Handling escape symbols shellscape/postcss-values-parser#93

Closed

ai added this to the 8.0 milestone Apr 18, 2020

ai added the bug label Apr 18, 2020

jonathantneal mentioned this issue Jul 1, 2020

Expand to include css vars that are not in :root: postcss/postcss-custom-properties#223

Open

ai removed this from the 8.0 milestone Jul 22, 2020

Mouvedia mentioned this issue Apr 26, 2024

Fix false positives for escaped multi-line URL in string-no-newline stylelint/stylelint#4057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling escape symbols #1349

Handling escape symbols #1349

jonathantneal commented Apr 18, 2020 •

edited

ai commented Apr 18, 2020

jonathantneal commented Apr 18, 2020 •

edited

jonathantneal commented Apr 20, 2020 •

edited

nex3 commented Apr 21, 2020

ai commented Apr 21, 2020

jonathantneal commented Apr 24, 2020 •

edited

jonathantneal commented Apr 28, 2020 •

edited

jonathantneal commented Apr 30, 2020

ai commented Apr 30, 2020

jonathantneal commented May 8, 2020 •

edited

ai commented May 8, 2020 •

edited

jonathantneal commented May 13, 2020 •

edited

jonathantneal commented May 15, 2020 •

edited

ai commented May 15, 2020 •

edited

alexander-akait commented Aug 26, 2020

ai commented Aug 26, 2020

alexander-akait commented Aug 26, 2020

romainmenke commented Jan 21, 2023

Handling escape symbols #1349

Handling escape symbols #1349

Comments

jonathantneal commented Apr 18, 2020 • edited

ai commented Apr 18, 2020

jonathantneal commented Apr 18, 2020 • edited

jonathantneal commented Apr 20, 2020 • edited

nex3 commented Apr 21, 2020

ai commented Apr 21, 2020

jonathantneal commented Apr 24, 2020 • edited

jonathantneal commented Apr 28, 2020 • edited

jonathantneal commented Apr 30, 2020

In Progress: Combine the Experimental Tokenizer with the PostCSS Object Model

Now: Compression May Improve Performance

Soon: Improving Syntax Tests

ai commented Apr 30, 2020

jonathantneal commented May 8, 2020 • edited

ai commented May 8, 2020 • edited

jonathantneal commented May 13, 2020 • edited

jonathantneal commented May 15, 2020 • edited

ai commented May 15, 2020 • edited

alexander-akait commented Aug 26, 2020

ai commented Aug 26, 2020

alexander-akait commented Aug 26, 2020

romainmenke commented Jan 21, 2023

jonathantneal commented Apr 18, 2020 •

edited

jonathantneal commented Apr 18, 2020 •

edited

jonathantneal commented Apr 20, 2020 •

edited

jonathantneal commented Apr 24, 2020 •

edited

jonathantneal commented Apr 28, 2020 •

edited

jonathantneal commented May 8, 2020 •

edited

ai commented May 8, 2020 •

edited

jonathantneal commented May 13, 2020 •

edited

jonathantneal commented May 15, 2020 •

edited

ai commented May 15, 2020 •

edited