Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native selector and value parsing in PostCSS #1145

Open
jonathantneal opened this issue May 1, 2018 · 35 comments
Open

Native selector and value parsing in PostCSS #1145

jonathantneal opened this issue May 1, 2018 · 35 comments
Labels

Comments

@jonathantneal
Copy link
Member

jonathantneal commented May 1, 2018

I would like to make an ambitious proposal. In PostCSS 7, could we parse selectors and values, or provide that parsing functionality out-of-the-box? If so, how could I help?

At the lowest effort, could we integrate postcss-selector-parser and postcss-values-parser?

At a higher effort, could we integrate the tokenizers? And have one tokenization to rule them all?

@ai
Copy link
Member

ai commented May 1, 2018

In PostCSS 8 ;).

PostCSS 7 will be a quick release without Node.js 4.

The only thing blocking us from PostCSS 7 is cssnano 4 release by @evilebottnawi

@ai
Copy link
Member

ai commented May 1, 2018

Technically we don’t need a major release for it. We can release it in 7.1.

I suggest starting with selectors. And add values parser only in 7.2.

@ai
Copy link
Member

ai commented May 1, 2018

How you can help:

  • I need a list of AST nodes for selectors.
  • I need a token list to parse those AST nodes.
  • I need ideas of AST API to change these nodes. You had a great experience with plugins. What API will be compact and easy?

@chriseppstein
Copy link
Contributor

The postcss-selector-parser api is really weird when it comes to handling raws (escapes, comments, and intervening whitespace). Sometimes comments are their own node, sometimes comments are part of raws, etc. I have a design in my head that is much better and more uniform that I would like to build out that I think should be the foundation of a first-class postcss api for selectors.

There's a lossy parse that trims whitespace but preserves comments but it can only be done at parse time (I think there should be methods to normalize/minify whitespace, prune comments, etc instead or as an option during stringification). Nodes didn't maintain internal consistency across mutation, and there's no sourcemap support to speak of.

I'd like to write up a design proposal soon.

@ai
Copy link
Member

ai commented May 1, 2018

@chriseppstein thanks, This is very useful feedback.

Sometimes comments are their own node, sometimes comments are part of raws, etc.

Let’s make the rule for new selector/value API. Comments must be a separated node.

But whitespaces we must put inside nodes to be compatible with other PostCSS API.

Nodes didn't maintain internal consistency across mutation

Can you show an example?

There's a lossy parse that trims whitespace but preserves comments but it can only be done at parse time
there's no sourcemap support to speak of.

Yeap, new parser will continue PostCSS best practices. Keep all bytes of origin input. Support source map.

@chriseppstein
Copy link
Contributor

chriseppstein commented May 1, 2018

Comments must be a separated node.

PostCSS also does not follow this rule. I think it's very confusing. There are some comments in "inconvenient locations" that will not be iterated when walkComments() is invoked. Because of the complexity of selectors, the number of places that are "inconvenient" is increased.

But whitespaces we must put inside nodes to be compatible with other PostCSS API.

Whitespace is tricky, I think the proper solution is to store whitespace and comments as nodes and provide convenient setters/getters to manipulate them.

Can you show an example?

In 3.0 when I took over the value stored in raws was not kept in sync if a value was changes and the raw value would override the new value with the original raw value.

https://github.com/postcss/postcss-selector-parser/blob/v3.0.0/src/selectors/attribute.js#L11-L34

I've fixed this in a lot of places now. but a better design would make it easier to manage.

Speaking of raws. postcss itself does not handle escapes and this makes working with it unpredictable for edge cases. For instance:

#bg {
    background-color: #efefef;
    back\ground-color: red;
    back\67 round-color: blue;
}

The browser will happily treat all three of these as valid declarations of the background-color. But the postcss API user will see three distinct properties. I get that this is not normal code. but this is the kind of abnormality that the parser api should insulate users from.

This is less of an edge case in selectors though. Depending on the file encoding, escapes for the value for ❤️ and the bare character in utf-8 should be the same when it comes to doing things like selector matching. In the selector parser I now store every ident that has an encoding in raws and store the unescaped value in the normal value that a user will read. The raws value is used to stringify back to the original input if left untouched.

Keep all bytes of origin input. Support source map. 👍

@ai
Copy link
Member

ai commented May 1, 2018

PostCSS also [does not follow this rule][declaration_comment_ast].

Yeap. It is impossible if you don’t have value and selectors parser. But now we will have them ;)

I think the proper solution is to store whitespace and comments as nodes and provide convenient setters/getters to manipulate them.

Could you write some pseudo-code to show what API your are thinking of?

The browser will happily treat all three of these as valid declarations of the background-color. But the postcss API user will see three distinct properties.

We can fix it in the tokenizer in a patch release. Please open an issue with some details how it should be parsed by the spec. I will found a person who will fix the tokenizer.

@chriseppstein
Copy link
Contributor

chriseppstein commented May 2, 2018

Please open an issue with some details how it should be parsed by the spec

I'm not sure exactly how these bugs manifest. But I just verified that @m\65 dia is not parsed correctly. Chrome will happily parse that as @media and apply the styles accordingly.

This is the best, most accessible writing on the subject: https://mathiasbynens.be/notes/css-escapes

Here's the basic escape sequence parser I wrote for the selector tokenizer: https://github.com/postcss/postcss-selector-parser/blob/master/src/tokenize.js#L67-L94

It doesn't convert the escape sequence to characters but it prevents word tokens from being prematurely ended (A space can be used to end the escape sequence if it's fewer than 6 characters 🙄)

The parser then does extra work later to check if a token contains an escape and then unescapes them. A better implementation would be for the tokenizer to return both a raw value and a value after resolving all escapes to unicode, but that change was too complex for the time that I had available.

Could you write some pseudo-code to show what API your are thinking of?

Roughly, I'd like to parse selectors more correctly into the structures specified by CSS and keep values as nodes
https://gist.github.com/chriseppstein/53298cef0c8cf62d7a275a8bdf7516cb

This keeps values associated as nodes with source location down to a much more granular location (as is typical with an AST). I think this makes mutating nodes more predictable and keeps a very sane model for how to handle the raw value, and ignorable whitespace before and after nodes in a variety of locations within the file.

whitespace and comment nodes can be interspersed with each other in the before and after of any node but would have an affinity for being part of a node's before. Any string value would have a raw where the value with escape sequences would be stored.

For simplified access to value nodes as strings, we can use a proxy that boxes/unboxes them and manages the escaping. or have properties that correspond to string or node accessors. With nodes for values, a lot of the apis in the selector parser just go away.

@ai
Copy link
Member

ai commented May 2, 2018

I think we should focus API on how people will use it.

This is why I am not sure that putting whitespace as a tokens is a great idea. User can put a whitespace in any place of CSS. As result PostCSS plugins will need more checks.

But maybe I am wrong and we can provide some syntax sugar on top of it.

How do you imagine if using this AST?

@chriseppstein
Copy link
Contributor

As result PostCSS plugins will need more checks.

The AST I wrote up didn't have any whitespace nodes except in before and after of a node.
so the api would not expose a whitespace node unless looking for them. a walker would not iterate over them by default. I've separated nodes into types, by default, walking would only walk over the nodes with a kind of "selector", which is how walking occurs right now. But you could imagine a whitespace walker that normalizes whitespace, or a comment walker that strips comments except for those beginning with !. I suspect you are not very familiar with the current selector API, it is very cumbersome. I've tried to make it more usable with the current API surface, but it's not easy to work with. The API for manipulating attribute value quotations is an example of this complexity. It's easier to assign the attribute selector's value an object that represents a string or an identifier.

Also, by preserving values as nodes, it should make the apis more sourcemap friendly.

anyways, that's what I've had in mind, if you have something else, write it up and we can compare some use cases.

@ai
Copy link
Member

ai commented May 2, 2018

Are you sure, that all cases could be solved by walkers? 😏

For instance, in Autoprefixer I need to find function and replace it (yeap, it is easy for walker). But for replace function I need to change its arguments.

@chriseppstein
Copy link
Contributor

@ai I'm not sure I understand what you're suggesting is hard about that use case. having the arguments as nodes makes this easier as a developer and makes sourcemap tracking more robust.

For the case where the arguments are left unchanged a value node for a function would be trivial:

rule.walkFunctionCalls((fnCall) => {
  if (fnCall.name.value === "oldname") {
    fnCall.name = postcss.ident("newname");
    // or the accessor can be smart in such cases an allow string assignment and create a new ident.
  }
});

For the case where arguments are manipulated, nodes also seem to make the most sense. Maybe you're expecting that the whitespace and comment nodes would be part of the arguments to the function?

rule.walkFunctionCalls((fnCall) => {
  if (fnCall.name.value === "oldname") {
    fnCall.name = postcss.ident("newname");
    for (let arg of fnCall.arguments) {
       // iterate over only the arguments, whitespace and comments nodes
       // would only be found in those nodes' `before` and `after` properties.
    }
  }
});

Does autoprefixer handle the replacement if there's an escape sequence in the function name? I suspect it doesn't. By making this a node, the value can easily represent the unescaped name but also store the escape sequences in a raw attribute. For values especially, this is important because there is no specific attribute to use to store raws against as part of a value is manipulated.

Anways, again, I don't see why you think this makes the code more complicated. Please write some code to show me what you're worried about.

@strarsis
Copy link
Contributor

+1!

.test {
  background: /*test*/ red;
}

This is valid CSS. A PostCSS plugin should be able to retrieve the comments in this declaration value for further use, e.g. annotations for enabling or controlling a PostCSS plugin. Currently the Declaration node raws have to be used.

@alexander-akait
Copy link

@ai Can we start discussion and prioritize it? Or does it make no sense and is it better to wait for another parser? We are having - incredible perf problems now in cssnano, in almost every plugin we run postcss-selector-parser and postcss-value-parser (we can of course try to cache, but it's incredibly complicated code and logic),

  • we have the same problems in stylelint,
  • prettier can't format CSS in good way also (postcss-values-parser violate spec and output wrong ast in many places, same for postcss-media-query-parser)
  • we have bloated dependencies due one plugin uses postcss-value-parser, other postcss-values-parser or just custom parser in deps
  • postcss-value-parser/postcss-values-parser is not maintained, also as contributor I don't have enough time and unfortunately there are no developers who would like to help, if i get sick tomorrow and appear security problem we are unlikely to fix it fast enough to create huge problems for the entire ecosystem

To be honest, I can continue to list it for a long time, let's finish this already and make life easier for everyone who uses postcss. Evaluate this as a cry for help from OSS.

@ai
Copy link
Member

ai commented May 26, 2021

I am right now working on my new project. Will not start a new feature in PostCSS at least for a few months.

@alexander-akait
Copy link

I am right now working on my new project. Will not start a new feature in PostCSS at least for a few months.

Like for three years four years (#754), this is not my first attempt and I deeply do not understand why for so many years we have not been able to integrate the current solutions, we would have reworked and improved them long ago. You always say how important it is that there is no stagnation in the project, but here it is.

I deeply understand what it is OSS and everyone is free to decide what to do. So let's start searching and migrating to new parsers. Also, due to the fact that the creator has no desire to work on this problem and no interest, I have no reason to keep working on postcss-selector-parser and postcss-value-parser, so only critical/security fixes and followed by deprecation as unsupported/unmaintained (unless of course someone else finds the time to do it, so contributors welcome).

For other developers - https://github.com/rome/tools/tree/main/internal/css-parser looks very good, I will ask them about experiential export parser/transformers so we will try to migrate.

@soluml
Copy link

soluml commented Dec 7, 2021

@alexander-akait - Looks like the above Rome links are broken due to their shift to Rust. I was looking through their source code earlier and couldn't find their Rust implementation for a css-parser but if that exists I'd love to take a look!

On the topic of Rust-compatible (or a language with similar performance metrics) CSS AST tools: I haven't actually been able to find one that exists. It seems they're all currently written in JS/TS. With the rise of performance-centric JS tooling like ESBuild and SWC, I'd personally love to see a CSS AST tool with similar performance goals and would definitely help out as able.

@scripthunter7
Copy link

I see the issue has been open for 4.5 years. Will this feature ever be implemented natively? :)

@ai
Copy link
Member

ai commented Dec 15, 2022

In some moment, yes.

I really afraid of the 8.x refactoring and do not want to add big features for now.

@scripthunter7
Copy link

And do you already have an idea of what kind of AST / API structure is needed for this? For example, to ensure that CSS does not break during serialization, and that transformations with AST work correctly, etc.

I saw similar questions here in the thread, but the information is several years old, so I asked about it again.

@ai
Copy link
Member

ai commented Dec 15, 2022

And do you already have an idea of what kind of AST / API structure is needed for this?

This is why the hardest question for me. When we will introduce some AST, we will not be able to change it since major versions for PostCSS is really hard.

@alexander-akait
Copy link

Not sure, but if you want to implement low syntax AST, you can use https://www.w3.org/TR/css-syntax-3/#component-value-diagram (canonization, i.e. grammar parsing, can be implemented as helper functions because grammar is the most hard and the bad perf part), but having component values in selectors/at-rules/etc allow to working with such structures without reparsing strings many times (plugins will be faster), you can manipulate them to solve simple things (again, we will avoid serilization a lot of time - better perf). List of tokens already are in spec https://www.w3.org/TR/css-syntax-3/#token-diagrams. So I don't think there are many problems to implement low syntax AST.

@romainmenke
Copy link
Contributor

I've been thinking about this issue and I am not convinced that PostCSS should be adding AST's for selectors, values, at rule params.

A typed AST only for media query params is massive : https://github.com/csstools/postcss-plugins/tree/postcss-preset-env--v8/packages/media-query-list-parser/src/nodes

And this is still only a low level API.
It doesn't have all the convenience of the PostCSS API.

This makes me worried that it will add way too much maintenance overhead to PostCSS.


For plugins under CSSTools we are now using this setup :

  • a shared low level and highly accurate CSS tokenizer
  • a shared toolbox of parsing algorithms
  • dedicated parser packages for specific contexts

The tokenizer is a naive implementation of the CSS specification with only one alteration.
Whitespace and comments are not discarded.

It is not fast, but also not so slow that it becomes an issue.
The PostCSS tokenizer is 2 times faster but ours spits out some more tokens.

some tests for the tokenizer as an example

The parsing algorithms do this :

  • split on <whitespace>
  • split on , (comma)
  • correctly group tokens in blocks ((), [], {})

Specific parser packages can be :

  • @media params
  • @scope params
  • @layer params
  • values (generic)
  • color function values (specific)
  • selectors

Having such specialized parser packages means we can have a better AST.
This in turn makes it easier to correctly transform CSS and always produce valid CSS.


Tokenizing everything again and again in each individual plugin is however a non-zero overhead.

This would be better :

  • PostCSS has a more granular tokenizer
  • PostCSS exposes the tokens
  • 3rd party parser packages do not need to tokenize

PostCSS would not parse values, selectors, at rule params but only expose tokens next to string values :

// .foo { color: rgb(0, 0, 0) }

decl.value; // rgb(0, 0, 0)
decl.tokens; // [['token-function', ...],[...], ...]

(This is very similar to what @alexander-akait is proposing but more defined what is done by PostCSS)

The most complicated part is defining the API design principles that should be followed by those 3rd party parser packages.

I am comfortable with more low level code and manipulating CSS at a token level, but I would prefer it if creating PostCSS plugins is more accessible.

@ai
Copy link
Member

ai commented Dec 23, 2022

PostCSS has a more granular tokenizer
PostCSS exposes the tokens
3rd party parser packages do not need to tokenize

I like this plan 😍

There are only two downsides:

  1. More memory consumption since we do not store all tokens right now (not a huge problem)
  2. Slower PostCSS core parser (but we will look at final numbers)

What token do you have in your parsers, which we need to add to tokenizer?

@romainmenke
Copy link
Contributor

token interface in the CSSTools tokenizer

https://github.com/csstools/postcss-plugins/blob/postcss-preset-env--v8/packages/css-tokenizer/src/interfaces/token.ts#L123-L134

export type Token<T extends TokenType, U> = [
	/** The type of token */
	T,
	/** The token representation */
	string,
	/** Start position of representation */
	number,
	/** End position of representation */
	number,
	/** Extra data */
	U,
]

Extra data examples:

  • { value: number, unit: string, type: NumberType }
  • { value: string }
  • { value: string, type: HashType }

Strings, indents, numbers, ... can contain escaped characters and we preserve the source in the token representation (token[1]) but we expose the unescaped values in the extra data (token[4].value).


CSSTools tokenizer :

(-token
)-token
[-token
]-token
{-token
}-token
at-keyword-token
bad-string-token
bad-url-token
CDC-token
CDO-token
colon-token
comma-token
comment
delim-token
dimension-token - integer
dimension-token - number
EOF-token
function-token
hash-token - id
hash-token - unrestricted
ident-token
number-token - integer
number-token - number
percentage-token
semicolon-token
string-token
url-token
whitespace-token

PostCSS tokenizer :

;
:
(
)
[
]
{
}
at-word
brackets
comment
space
string
word

@ai
Copy link
Member

ai commented Dec 24, 2022

On case of decl.value = newValue we will mark decl.valueTokens as dirty and re-tokenize decl.value on next decl.valueTokens read.

How does those tokens works:

bad-string-token
bad-url-token
function-token
dimension-token

@romainmenke
Copy link
Contributor

romainmenke commented Dec 24, 2022

A bad url token :

https://drafts.csswg.org/css-syntax/#consume-url-token
This one is a bit more complicated and has multiple paths that lead to a bad url token.
One of them is a second (.

image: url(();

How it looks in our parser :

{
	const t = tokenizer({
		css: 'url(foo())',
	});

	assert.deepEqual(
		collectTokens(t),
		[
			['bad-url-token', 'url(foo()', 0, 8, undefined],
			[')-token', ')', 9, 9, undefined],
			['EOF-token', '', -1, -1, undefined],
		],
	);
}

I think this is problematic for plugins that want to do custom things inside url()

Update :
It is impossible to allow nested functions inside url() and adhere to the CSS tokenizing specification.

url("foo") is a function token, not a url token.
There is more room for custom things here.


A bad string token is any string token with (unescaped) newlines in it.
https://drafts.csswg.org/css-syntax/#consume-string-token

	content: "foo
bar";

A function token is an ident + ( :
https://drafts.csswg.org/css-syntax/#consume-ident-like-token

(there is no look a head to the closing ))

outtake from one of our tests

fancy(baz)

['function-token', 'fancy(', 65, 70, { value: 'fancy' }],
['ident-token', 'baz', 71, 73, { value: 'baz' }],
[')-token', ')', 74, 74, undefined],

A dimension token is a number + an ident :

https://drafts.csswg.org/css-syntax/#consume-numeric-token

@alexander-akait
Copy link

Just for information - Fully written to specification https://github.com/swc-project/swc/blob/main/crates/swc_css_parser/src/lexer/mod.rs (rust) with CST (normalized and raw values, the second can be useful for linters and serializtion 1:1), so if you have questions about the implementation, we can check

@romainmenke
Copy link
Contributor

I've started work on a WPT-like thing for CSS Tokenizers.
https://github.com/romainmenke/css-tokenizer-tests

This should make it easier to compare multiple tokenizers and check them for bugs.
Intention is to make it easier for everyone to verify that their tokenizer is accurate.


@ai What is the best way forward on this issue?

Is it a breaking change for PostCSS to change the tokenizer?
If it is breaking, is there a way to make it not breaking?

Since I can import the PostCSS tokenizer and use it, I assume that making any changes there is breaking.

Maybe we can leave tokenize.js as-is and introduce a separate tokenize-next-gen.js

Anyone who depends on tokenize.js can continue to use it within PostCSS 8.


Are there other ways in which it can be breaking to switch the tokenizer?
(not including bugs)

@ai
Copy link
Member

ai commented Jan 2, 2023

Is it a breaking change for PostCSS to change the tokenizer?

No (until we didn’t expose tokens in any public API).

@romainmenke
Copy link
Contributor

TL;DR;

I don't think we should pursue this further. I think this change will hurt PostCSS and it's ecosystem.


Having spend the last few weeks working on integrating a new tokenizer into PostCSS I think I've identified a few area's where this change would be really bad.

1 It is a breaking change

Even if the tokenizer was never documented on the website it was still exposed via package.json

Some packages, libraries, frameworks are using the PostCSS tokenizer directly.

We can introduce another breaking change into the PostCSS ecosystem, but everything is still settling after the last one.

Personally I would prefer to spend my time helping the community do more within the current version than spending time migrating to a PostCSS 9.

2 It will make PostCSS slower

The new tokenizer might have a few area's left that can be optimized to make it faster but it will never be faster than the current tokenizer.

It is simply doing a lot more work and it needs to do this work.
It is currently about as fast as css-tree while exposing a lot more useful data and while being more accurate. It is 20 times faster than the reference tokenizer by Tab while being equally accurate.

That selector, value and at rule parsers would be able to skip tokenizing will also not make PostCSS faster overall.

They will only be able to re-use the tokens from the initial parsing phase.
Any mutation of values will trigger the tokenizer to run again on the changed values.

At best we would skip tokenizing some values 1 time.

3 It will make PostCSS more complicated to maintain and use

As you can see in the pull request this is not a trivial change : #1812

But it will also make writing syntaxes and plugins more complicated.
Most plugin authors will choose to work with everything as strings because it is simpler. They will all miss out on the benefits of this change.

Syntax authors will need to do similar tokenizer and parser changes for the ecosystem as a whole to still make sense.

Without this change everything is just a string. Strings are messy, and a bad regexp replace can introduce bugs. But at the same time strings are also simple to use and manipulate.

The simplicity of using PostCSS to manipulate CSS today is what makes PostCSS an awesome tool.

4 We do not need to change PostCSS to have a better tokenizer and selector, value parsers

Why disrupt the ecosystem if we can fix this modularly?
We can improve current selector and values parsers or create new versions.

The tokenizer we wrote for media query parsing can be used by anyone that wants it : https://github.com/csstools/postcss-plugins/tree/postcss-preset-env--v8/packages/css-tokenizer#readme

We also maintain a collection of parsing algorithms for this tokenizer.

Together they form a solid base to build specific parsers

Closing

If there was even a small chance that this change would roll out smoothly and have an overal positive impact on the PostCSS ecosystem I would happily spend a large chunk of my time to make it happen.

If anyone has insights or further concerns that would be most helpful :)

@ai
Copy link
Member

ai commented Jan 20, 2023

Yes, I agree that the changes became too big to not hurt ecosystem in some way

@ai
Copy link
Member

ai commented Jan 20, 2023

It was an amazing experiment, thanks @romainmenke

@romainmenke
Copy link
Contributor

Thank you for the feedback 🙇

@ghost
Copy link

ghost commented Jul 14, 2023

The values part is possible but at a cost of thousands lines of code to handle each kind of value because there are so many values possible in css, having a class named CSSPropertyFlex or CSSPropertyMargin won't be good.
You can use list to separate simple values.

The select part is possible, there should be a Selector class which does not extend Node, should be inside the rule.js
And it should provide methods to work with selectors and store selector as array after parsing it.

EDIT: It could be made more complex and the Selector's selectorList array would have objects, which will define their type.
example:

interface SelectorType {
  type: "Selector" | "Raws";
  content: string;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants