Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling escape symbols #1349

Open
jonathantneal opened this issue Apr 18, 2020 · 18 comments
Open

Handling escape symbols #1349

jonathantneal opened this issue Apr 18, 2020 · 18 comments
Labels

Comments

@jonathantneal
Copy link
Member

jonathantneal commented Apr 18, 2020

From https://acss.io/#pseudo-classes:

Each line should parse as one word (i.e. one identifier):

C\(\#0280ae\)
C\(brandColor\)
C\(\#fff\)\:h:hover

This can be verified with https://rawgit.com/tabatkins/parse-css/master/example.html

This issue continues from shellscape/postcss-values-parser#93

With thanks and credit to @nex3 and @ai for identifying this issue.


Update: I have started work on updating the Tokenizer, but I may need assistance as I integrate or abandon the current multichar tokens. I don’t necessarily see how those tokens benefit speed. Any risk of inaccuracy seems too steep a price to pay.

@ai, thank you for the wonderful documentation @ https://github.com/postcss/postcss/blob/7.0.27/docs/architecture.md#tokenizer--libtokenizees6-

@ai
Copy link
Member

ai commented Apr 18, 2020

  1. What are CSS escaping rules? Ignore the next symbol after \?
  2. What is the current output?

@jonathantneal
Copy link
Member Author

jonathantneal commented Apr 18, 2020

  1. What are CSS escaping rules? Ignore the next symbol after ?

The general escape rule is that a backslash swallows the next character (Reference).

Within names (or URLs), an escape cannot consume a newline, but within strings it can.

CSS Escape Diagram

In the above diagram, you’ll see exceptions for hex digits. In most cases, we could normally ignore this, as escape codes always begin words (or what the spec calls "identifiers") and hex digits are already covered by identifier rules, but the swallowing of whitespace afterward may be important.

  1. What is the current output?

Given the CSS f\:1, the token output currently is this:

[
  [ 'word', 'f', 1, 1, 1, 2 ],
  [ 'word', '\\:', 1, 3, 1, 4 ],
  [ 'word', '1', 1, 5, 1, 5 ]
]

And it should be more like this:

[
  [ 'word', 'f\\:1', 1, 1, 1, 5 ]
]

@jonathantneal
Copy link
Member Author

jonathantneal commented Apr 20, 2020

I only have a small update. I’m also wrapping up school with the kids and need to start work, so please forgive typos or confusing statements.

I was able to complete my first pass of the tokenizer this weekend. Here is the source:

https://gist.github.com/jonathantneal/dd859523df584b012dfa1cc8795d7ca9

Here are some benchmark results.

PostCSS Tokenizer 7.0.27     x 133 ops/sec ±1.69% (79 runs sampled)
PostCSS Tokenizer 2020-04-20 x 72.12 ops/sec ±3.51% (77 runs sampled)
Fastest test is PostCSS Tokenizer 7.0.27 at 1.85x faster than PostCSS Tokenizer 2020-04-20

PostCSS Tokenizer 7.0.27:      8 ms
PostCSS Tokenizer 2020-04-20: 14 ms (1.8 times slower)
PostCSS Tokenizer 7.0.27     x 113 ops/sec ±5.00% (70 runs sampled)
PostCSS Tokenizer 2020-04-20 x 70.98 ops/sec ±3.91% (77 runs sampled)
Fastest test is PostCSS Tokenizer 7.0.27 at 1.59x faster than PostCSS Tokenizer 2020-04-20

PostCSS Tokenizer 7.0.27:      9 ms
PostCSS Tokenizer 2020-04-20: 14 ms (1.6 times slower)
  • 😢 Yes, it is significantly slower.
  • 🤞 I hope it is slower because I wrote it poorly.
  • 🤔 I understand that a spec-compliant tokenizer should be as performant as the current tokenizer to even consider such a change.

And here is an example structure of a token:

// where css is "-.567800E-0012780em"
const tokenExample1 = [
  /** @type {TokenType}  */ "number",
  /** @type {TokenValue} */ "-.567800E-0012780em",
  /** @type {TokenStart} */ 0,
  /** @type {TokenEnd}   */ 19,
  /** @type {TokenSplit} */ 17
]

// where css is "1138--thx"
const tokenExample2 = [
  /** @type {TokenType}  */ "number",
  /** @type {TokenValue} */ "1138--thx",
  /** @type {TokenStart} */ 0,
  /** @type {TokenEnd}   */ 9,
  /** @type {TokenSplit} */ 4
]

// where css is "foo\:bar"
const tokenExample3 = [
  /** @type {TokenType}  */ "identifier",
  /** @type {TokenValue} */ "foo\\:bar",
  /** @type {TokenStart} */ 0,
  /** @type {TokenEnd}   */ 8,
  /** @type {TokenSplit} */ 0
]

/** @typedef {"comment" | "delimiter" | "identifier" | "number" | "string" | "whitespace"} TokenType - Name of the token. */
/** @typedef {string} TokenValue - Value of the token. */
/** @typedef {number} TokenStart - Character index of the source where the token starts. */
/** @typedef {number} TokenEnd - Character index of the source where the token ends. */
/** @typedef {number} TokenSplit - Character index of the source where the token can be split. Splits occur between numbers and their units, and splits occur between identifier symbols (like `@` or `#`) and their identifier names). */

@nex3
Copy link
Contributor

nex3 commented Apr 21, 2020

Thanks a ton for looking into this, @jonathantneal! Maybe @ai can provide some performance tips?

@ai
Copy link
Member

ai commented Apr 21, 2020

Do we really need to change it in the tokenizer if it affects performance?

We can change parser and avoid any performance investigations.

@jonathantneal
Copy link
Member Author

jonathantneal commented Apr 24, 2020

Having some time tonight, I’ve pushed the latest version of the tokenizer to https://github.com/csstools/tokenizer

It includes benchmark tests and the ability to create token JSON files, if that’s at all helpful.

These benchmarks were averaged from my local machine:

PostCSS Tokenizer 7.0.27:      49548 tokens in  7 ms
PostCSS Tokenizer Development: 60458 tokens in 14 ms (2.0 times slower)

Do we really need to change it in the tokenizer if it affects performance?

I don’t know, because I don’t yet understand some parts of the current tokenizer. For instance, the current tokenizer produces a single token from CSS like sans-serif, (note the comma), but those should be 2 separate tokens. Is this a shortcut to save parsing time?

I think the answer to your question has a couple dependencies. The first dependency is to find out the performance impact the current PostCSS parser adds to its tokenizer. If the currently combined tokenizer and parser are slower than this new tokenizer I’m working on, then I think it’s worth it to have a "slower" tokenizer but a faster overall product that handles CSS accurately.

We can change parser and avoid any performance investigations.

Would you have time to investigate this? My earlier comment describes how escapes are assembled.


I’ve also included a small demo of its tokenization on CodePen.

@jonathantneal
Copy link
Member Author

jonathantneal commented Apr 28, 2020

I want to provide more frequent updates here, but my free time is split between this and a significant upgrade to PostCSS Preset Env.

I think the next thing I can do is plug the tokenizer into the PostCSS object model, and see what the final performance impact is on Bootstrap’s CSS.

Let me admit that I have trouble solving the small issue (escapes) without solving a big issue (css syntax compliance). I do believe solving the bigger issue will knock out the little issue, tho. My hope is that I can at least push this far enough along that you and others can bring it across the finish line.

@jonathantneal
Copy link
Member Author

Please let me know if I should post less frequently. Here are my updates:

In Progress: Combine the Experimental Tokenizer with the PostCSS Object Model

I have not yet integrated the experimental tokenizer into the PostCSS object model. I think I need a highly focused 4 hour window to accomplish this; time I have not yet had. Once this is done, I should be able to verify whether the escape symbols are properly handled with this combination. The weekend is soon.

Now: Compression May Improve Performance

I have added a compression step to the experimental tokenizer using terser, and the compressed tokenizer is regularly scoring better benchmarks than the uncompressed version. A unique element of the compression is that it is replacing every character code variable with the actual character code number; e.g. case TAB becomes case 9.

As of April 30, 2020, these benchmarks were averaged from my local machine:

PostCSS Tokenizer 7.0.27:            49548 tokens in 10 ms
PostCSS Tokenizer Development (min): 60458 tokens in 14 ms (1.4 times slower)
PostCSS Tokenizer Development:       60458 tokens in 18 ms (1.8 times slower)

Benchmarks are not an exact science for me. I have not changed anything about the experimental tokenizer, but it is more regularly parsing 1.8x slower, an improvement from 2x slower. Still, the compressed experimental tokenizer is benchmarking only 1.4x slower than the current PostCSS tokenizer, and it is picking up 1.2x the tokens.

Soon: Improving Syntax Tests

I would like to migrate some of the css-syntax tests from the web-platform-tests project into this development work. This is lower priority than integrating the Object Model, and I will fight the temptation to do this first. With new versions of these tests, PostCSS could protect itself against syntactically valid edge cases, and even trail-blaze work toward the Houdini Parser API. The WPT tests themselves are not entirely helpful, because browsers are limited to testing the serialization results of CSS.

@ai
Copy link
Member

ai commented Apr 30, 2020

@jonathantneal
Copy link
Member Author

jonathantneal commented May 8, 2020

I’m feeling much better after analyzing those jumps. Let me share what I’ve learned. I’d like to know what you think.

I was able to implement the comment jump. There was no perceivable yield. This is what it looked like:

			// comment
			case FS:
				if (css.charCodeAt(pos + 1) === STAR) {
					typ = 'comment'
					pos = css.indexOf('*/', pos + 2) + 1
					if (pos === 0) {
						pos = len
					} else {
						++pos
					}
				} else {
					++pos
				}
				break

I was able to implement the string jump. There was no perceivable improvement. This is what it looked like:

			// string
			case DBLQ:
			case SNGQ:
				var quote = cp0 === DBLQ ? '"' : "'"
				var escaped
				do {
					escaped = false
					pos = css.indexOf(quote, pos + 1) + 1
					if (pos === 0) {
						pos = len
						break
					}
					escapePos = pos - 1
					while (css.charCodeAt(escapePos - 1) === BS) {
						escapePos -= 1
						escaped = !escaped
					}
				} while (escaped)
				break

So far, it seems those jumps are not beating symbol checks.

However, after looking at the next two jumps — I think I see where some of the performance differences are coming from. However, I’m not sure I should implement those jumps.

I did not want to implement the parentheses jumps because they skip tokenizing the contents between the parentheses. This is probably one of the reasons the current tokenizer is picking up 49,548 tokens from bootstrap.css, while this experimental tokenizer is picking up 60,458 tokens.

Put another way; the experimental tokenizer is 1.4 times slower but it is also picking up 1.2 times more tokens, and it is accurately handling the escapes.


Up Next: I think I have time tonight to wire in this new tokenizer to the current object model.

EDIT: I have added a very crude implementation of a PostCSS-like Object Model. The code is pushed to the same repo.

Collecting PostCSS Parser Benchmarks...

PostCSS Parser 7.0.27:       1204 nodes in 14 ms
PostCSS Experimental Parser: 13680 nodes in 20 ms (1.4 times slower)
  • On the down side — the experimental parser is 1.4 times slower than the current parser, same as the tokenizer comparison. 😬
  • On the up side — the current parser generated about 100 tokens/ms, while the experimental parser generated about 684 tokens/ms. 😅

@ai
Copy link
Member

ai commented May 8, 2020

I did not want to implement the parentheses jumps because they skip tokenizing the contents between the parentheses. This is probably one of the reasons the current tokenizer is picking up 49,548 tokens from bootstrap.css, while this experimental tokenizer is picking up 60,458 tokens.

Yeap, we need to count PostCSS core + PostCSS Value parser time.

If a new tokenizer will prepare tokens for value/selector parsing, a little slow down will bring a big speedup in value/selector parsing.

the current parser generated about 100 tokens/ms, while the experimental parser generated about 684 tokens/ms.

I like this metric :D

@jonathantneal
Copy link
Member Author

jonathantneal commented May 13, 2020

I haven’t had much time, but I have added another test that runs PostCSS with Selector and Value parsing. The functionality looks like this, which I hope we’ll find generous:

parsed.root.walk(node => {
  switch (node.type) {
    case 'decl':
      postcssValuesParser(node.value)
      break
    case 'rule':
      postcssSelectorParser.processSync(node.selector)
      break
  }
})

In reality, I think most plugins would runs those parsers in separate walks, and most plugin packs would run those parsers multiple times. Still, here are the updated parse times:

Collecting PostCSS Parser Benchmarks...

PostCSS Parser 7.0.29:              1204 nodes in 14 ms
PostCSS Experimental Parser:       13686 nodes in 17 ms (1.2 times slower)
PostCSS + Selector + Value Parser:  1204 nodes in 78 ms (5.5 times slower)

https://github.com/csstools/tokenizer#collecting-postcss-parser-benchmarks

@jonathantneal
Copy link
Member Author

jonathantneal commented May 15, 2020

Andrey, please let me know if I’m pestering with these updates or if I can make them more helpful.

While analyzing the "slower" parts of the tokenizer, it seems like eagerly checking the character ahead improves overall performance. I have rewritten the tokenizer to take advantage of this.

I have also added 2 fields to a token; they are the opening and closing distances between the meaningful value of a token and its delimiters. Although “delimiter” is a poor term, this refers to the split between things like the @ & media in a @media At-Identifier token, the 2 & em in a 2em Number token, or the " & hello & " in a "hello" String token.

Anyway, I think you’ll really like these results!

Compressing PostCSS Tokenizer...

PostCSS Tokenizer Development:       1910 B
PostCSS Tokenizer Development (min):  638 B
PostCSS Tokenizer Development (web):  639 B

Collecting PostCSS Tokenizer Benchmarks...

PostCSS Tokenizer Development:       58721 tokens in 8 ms (1.0 times faster)
PostCSS Tokenizer Development (min): 58721 tokens in 8 ms (1.0 times faster)
PostCSS Tokenizer 7.0.30:            49548 tokens in 8 ms


Compressing PostCSS Parser...

PostCSS Parser Development:       1369 B
PostCSS Parser Development (min):  836 B
PostCSS Parser Development (web):  805 B

Collecting PostCSS Parser Benchmarks...

PostCSS Experimental Parser:       56024 nodes in 10 ms (1.6 times faster)
PostCSS Parser 7.0.30:              6240 nodes in 15 ms
PostCSS + Selector + Value Parser: 28491 nodes in 86 ms (5.5 times slower)

— From https://github.com/csstools/tokenizer#collecting-postcss-parser-benchmarks

@ai
Copy link
Member

ai commented May 15, 2020

Andrey, please let me know if I’m pestering with these updates or if I can make them more helpful.

A new performance breakthrough is awesome 😍.

Let’s change tokenizer in 8.0.

Is it possible to use it in the safe parser or SCSS parser without changes? I forked the current one and can fork a new one too if it will be impossible to customise it.

@alexander-akait
Copy link

@ai what is status this feature for 8 version? Postpone?

@ai
Copy link
Member

ai commented Aug 26, 2020

@evilebottnawi yeap, we decided to do it after 8.0, since new parser will make 8.0 release too big and will force us to wait too long

@alexander-akait
Copy link

Sounds good 👍

@romainmenke
Copy link
Contributor

Took me a while to analyse this issue because initially I interpreted this as "PostCSS has a bug with escaped symbols".

But this is not a bug in PostCSS. As far as I can tell PostCSS produces the correct AST with or without escaped symbols.

It produces a few words too many at the tokenizer level, but this goes away in the parser.

The original issue was that a downstream package did not produce the correct AST and uses the same tokenizer. Changing the tokenizer here would produce a different output there and might have fixed some issues.

From a different perspective this sounds very risky to me.
A fix for one issue might be a number of new bugs for other cases.

That the downstream package uses internals from PostCSS also caused other issues and most plugins are no longer uses this dependency because of that.


I think that this has a lot of overlap with : #1145 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants