New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Native selector and value parsing in PostCSS #1145
Comments
In PostCSS 8 ;). PostCSS 7 will be a quick release without Node.js 4. The only thing blocking us from PostCSS 7 is cssnano 4 release by @evilebottnawi |
Technically we don’t need a major release for it. We can release it in 7.1. I suggest starting with selectors. And add values parser only in 7.2. |
How you can help:
|
The postcss-selector-parser api is really weird when it comes to handling raws (escapes, comments, and intervening whitespace). Sometimes comments are their own node, sometimes comments are part of raws, etc. I have a design in my head that is much better and more uniform that I would like to build out that I think should be the foundation of a first-class postcss api for selectors. There's a lossy parse that trims whitespace but preserves comments but it can only be done at parse time (I think there should be methods to normalize/minify whitespace, prune comments, etc instead or as an option during stringification). Nodes didn't maintain internal consistency across mutation, and there's no sourcemap support to speak of. I'd like to write up a design proposal soon. |
@chriseppstein thanks, This is very useful feedback.
Let’s make the rule for new selector/value API. Comments must be a separated node. But whitespaces we must put inside nodes to be compatible with other PostCSS API.
Can you show an example?
Yeap, new parser will continue PostCSS best practices. Keep all bytes of origin input. Support source map. |
PostCSS also does not follow this rule. I think it's very confusing. There are some comments in "inconvenient locations" that will not be iterated when
Whitespace is tricky, I think the proper solution is to store whitespace and comments as nodes and provide convenient setters/getters to manipulate them.
In 3.0 when I took over the value stored in https://github.com/postcss/postcss-selector-parser/blob/v3.0.0/src/selectors/attribute.js#L11-L34 I've fixed this in a lot of places now. but a better design would make it easier to manage. Speaking of raws. postcss itself does not handle escapes and this makes working with it unpredictable for edge cases. For instance: #bg {
background-color: #efefef;
back\ground-color: red;
back\67 round-color: blue;
} The browser will happily treat all three of these as valid declarations of the This is less of an edge case in selectors though. Depending on the file encoding, escapes for the value for ❤️ and the bare character in utf-8 should be the same when it comes to doing things like selector matching. In the selector parser I now store every ident that has an encoding in
|
Yeap. It is impossible if you don’t have value and selectors parser. But now we will have them ;)
Could you write some pseudo-code to show what API your are thinking of?
We can fix it in the tokenizer in a patch release. Please open an issue with some details how it should be parsed by the spec. I will found a person who will fix the tokenizer. |
I'm not sure exactly how these bugs manifest. But I just verified that This is the best, most accessible writing on the subject: https://mathiasbynens.be/notes/css-escapes Here's the basic escape sequence parser I wrote for the selector tokenizer: https://github.com/postcss/postcss-selector-parser/blob/master/src/tokenize.js#L67-L94 It doesn't convert the escape sequence to characters but it prevents word tokens from being prematurely ended (A space can be used to end the escape sequence if it's fewer than 6 characters 🙄) The parser then does extra work later to check if a token contains an escape and then unescapes them. A better implementation would be for the tokenizer to return both a raw value and a value after resolving all escapes to unicode, but that change was too complex for the time that I had available.
Roughly, I'd like to parse selectors more correctly into the structures specified by CSS and keep values as nodes This keeps values associated as nodes with source location down to a much more granular location (as is typical with an AST). I think this makes mutating nodes more predictable and keeps a very sane model for how to handle the raw value, and ignorable whitespace before and after nodes in a variety of locations within the file. whitespace and comment nodes can be interspersed with each other in the For simplified access to value nodes as strings, we can use a proxy that boxes/unboxes them and manages the escaping. or have properties that correspond to string or node accessors. With nodes for values, a lot of the apis in the selector parser just go away. |
I think we should focus API on how people will use it. This is why I am not sure that putting whitespace as a tokens is a great idea. User can put a whitespace in any place of CSS. As result PostCSS plugins will need more checks. But maybe I am wrong and we can provide some syntax sugar on top of it. How do you imagine if using this AST? |
The AST I wrote up didn't have any whitespace nodes except in Also, by preserving values as nodes, it should make the apis more sourcemap friendly. anyways, that's what I've had in mind, if you have something else, write it up and we can compare some use cases. |
Are you sure, that all cases could be solved by walkers? 😏 For instance, in Autoprefixer I need to find function and replace it (yeap, it is easy for walker). But for replace function I need to change its arguments. |
@ai I'm not sure I understand what you're suggesting is hard about that use case. having the arguments as nodes makes this easier as a developer and makes sourcemap tracking more robust. For the case where the arguments are left unchanged a value node for a function would be trivial: rule.walkFunctionCalls((fnCall) => {
if (fnCall.name.value === "oldname") {
fnCall.name = postcss.ident("newname");
// or the accessor can be smart in such cases an allow string assignment and create a new ident.
}
}); For the case where arguments are manipulated, nodes also seem to make the most sense. Maybe you're expecting that the whitespace and comment nodes would be part of the arguments to the function? rule.walkFunctionCalls((fnCall) => {
if (fnCall.name.value === "oldname") {
fnCall.name = postcss.ident("newname");
for (let arg of fnCall.arguments) {
// iterate over only the arguments, whitespace and comments nodes
// would only be found in those nodes' `before` and `after` properties.
}
}
}); Does autoprefixer handle the replacement if there's an escape sequence in the function name? I suspect it doesn't. By making this a node, the value can easily represent the unescaped name but also store the escape sequences in a Anways, again, I don't see why you think this makes the code more complicated. Please write some code to show me what you're worried about. |
+1! .test {
background: /*test*/ red;
} This is valid CSS. A PostCSS plugin should be able to retrieve the comments in this declaration value for further use, e.g. annotations for enabling or controlling a PostCSS plugin. Currently the Declaration node raws have to be used. |
@ai Can we start discussion and prioritize it? Or does it make no sense and is it better to wait for another parser? We are having - incredible perf problems now in cssnano, in almost every plugin we run
To be honest, I can continue to list it for a long time, let's finish this already and make life easier for everyone who uses |
I am right now working on my new project. Will not start a new feature in PostCSS at least for a few months. |
Like for
I deeply understand what it is OSS and everyone is free to decide what to do. So let's start searching and migrating to new parsers. Also, due to the fact that the creator has no desire to work on this problem and no interest, I have no reason to keep working on For other developers - https://github.com/rome/tools/tree/main/internal/css-parser looks very good, I will ask them about experiential export parser/transformers so we will try to migrate. |
@alexander-akait - Looks like the above Rome links are broken due to their shift to Rust. I was looking through their source code earlier and couldn't find their Rust implementation for a css-parser but if that exists I'd love to take a look! On the topic of Rust-compatible (or a language with similar performance metrics) CSS AST tools: I haven't actually been able to find one that exists. It seems they're all currently written in JS/TS. With the rise of performance-centric JS tooling like ESBuild and SWC, I'd personally love to see a CSS AST tool with similar performance goals and would definitely help out as able. |
I see the issue has been open for 4.5 years. Will this feature ever be implemented natively? :) |
In some moment, yes. I really afraid of the 8.x refactoring and do not want to add big features for now. |
And do you already have an idea of what kind of AST / API structure is needed for this? For example, to ensure that CSS does not break during serialization, and that transformations with AST work correctly, etc. I saw similar questions here in the thread, but the information is several years old, so I asked about it again. |
This is why the hardest question for me. When we will introduce some AST, we will not be able to change it since major versions for PostCSS is really hard. |
Not sure, but if you want to implement low syntax AST, you can use https://www.w3.org/TR/css-syntax-3/#component-value-diagram (canonization, i.e. grammar parsing, can be implemented as helper functions because grammar is the most hard and the bad perf part), but having component values in selectors/at-rules/etc allow to working with such structures without reparsing strings many times (plugins will be faster), you can manipulate them to solve simple things (again, we will avoid serilization a lot of time - better perf). List of tokens already are in spec https://www.w3.org/TR/css-syntax-3/#token-diagrams. So I don't think there are many problems to implement low syntax AST. |
I've been thinking about this issue and I am not convinced that PostCSS should be adding AST's for selectors, values, at rule params. A typed AST only for media query params is massive : https://github.com/csstools/postcss-plugins/tree/postcss-preset-env--v8/packages/media-query-list-parser/src/nodes And this is still only a low level API. This makes me worried that it will add way too much maintenance overhead to PostCSS. For plugins under CSSTools we are now using this setup :
The tokenizer is a naive implementation of the CSS specification with only one alteration. It is not fast, but also not so slow that it becomes an issue. some tests for the tokenizer as an example The parsing algorithms do this :
Specific parser packages can be :
Having such specialized parser packages means we can have a better AST. Tokenizing everything again and again in each individual plugin is however a non-zero overhead. This would be better :
PostCSS would not parse values, selectors, at rule params but only expose tokens next to string values : // .foo { color: rgb(0, 0, 0) }
decl.value; // rgb(0, 0, 0)
decl.tokens; // [['token-function', ...],[...], ...] (This is very similar to what @alexander-akait is proposing but more defined what is done by PostCSS) The most complicated part is defining the API design principles that should be followed by those 3rd party parser packages. I am comfortable with more low level code and manipulating CSS at a token level, but I would prefer it if creating PostCSS plugins is more accessible. |
I like this plan 😍 There are only two downsides:
What token do you have in your parsers, which we need to add to tokenizer? |
token interface in the CSSTools tokenizer export type Token<T extends TokenType, U> = [
/** The type of token */
T,
/** The token representation */
string,
/** Start position of representation */
number,
/** End position of representation */
number,
/** Extra data */
U,
] Extra data examples:
Strings, indents, numbers, ... can contain escaped characters and we preserve the source in the token representation ( CSSTools tokenizer :
PostCSS tokenizer :
|
On case of How does those tokens works:
|
A bad url token : https://drafts.csswg.org/css-syntax/#consume-url-token image: url((); How it looks in our parser : {
const t = tokenizer({
css: 'url(foo())',
});
assert.deepEqual(
collectTokens(t),
[
['bad-url-token', 'url(foo()', 0, 8, undefined],
[')-token', ')', 9, 9, undefined],
['EOF-token', '', -1, -1, undefined],
],
);
}
Update :
A bad string token is any string token with (unescaped) newlines in it. content: "foo
bar"; A function token is an ident + (there is no look a head to the closing outtake from one of our tests
A dimension token is a number + an ident : |
Just for information - Fully written to specification https://github.com/swc-project/swc/blob/main/crates/swc_css_parser/src/lexer/mod.rs (rust) with CST (normalized and raw values, the second can be useful for linters and serializtion 1:1), so if you have questions about the implementation, we can check |
I've started work on a WPT-like thing for CSS Tokenizers. This should make it easier to compare multiple tokenizers and check them for bugs. @ai What is the best way forward on this issue? Is it a breaking change for PostCSS to change the tokenizer? Since I can import the PostCSS tokenizer and use it, I assume that making any changes there is breaking. Maybe we can leave Anyone who depends on Are there other ways in which it can be breaking to switch the tokenizer? |
No (until we didn’t expose tokens in any public API). |
TL;DR; I don't think we should pursue this further. I think this change will hurt PostCSS and it's ecosystem. Having spend the last few weeks working on integrating a new tokenizer into PostCSS I think I've identified a few area's where this change would be really bad. 1 It is a breaking changeEven if the tokenizer was never documented on the website it was still exposed via Some packages, libraries, frameworks are using the PostCSS tokenizer directly. We can introduce another breaking change into the PostCSS ecosystem, but everything is still settling after the last one. Personally I would prefer to spend my time helping the community do more within the current version than spending time migrating to a PostCSS 9. 2 It will make PostCSS slowerThe new tokenizer might have a few area's left that can be optimized to make it faster but it will never be faster than the current tokenizer. It is simply doing a lot more work and it needs to do this work. That They will only be able to re-use the tokens from the initial parsing phase. At best we would skip tokenizing some values 1 time. 3 It will make PostCSS more complicated to maintain and useAs you can see in the pull request this is not a trivial change : #1812 But it will also make writing syntaxes and plugins more complicated. Syntax authors will need to do similar tokenizer and parser changes for the ecosystem as a whole to still make sense. Without this change everything is just a string. Strings are messy, and a bad regexp replace can introduce bugs. But at the same time strings are also simple to use and manipulate. The simplicity of using PostCSS to manipulate CSS today is what makes PostCSS an awesome tool. 4 We do not need to change PostCSS to have a better tokenizer and selector, value parsersWhy disrupt the ecosystem if we can fix this modularly? The tokenizer we wrote for media query parsing can be used by anyone that wants it : https://github.com/csstools/postcss-plugins/tree/postcss-preset-env--v8/packages/css-tokenizer#readme We also maintain a collection of parsing algorithms for this tokenizer. Together they form a solid base to build specific parsers ClosingIf there was even a small chance that this change would roll out smoothly and have an overal positive impact on the PostCSS ecosystem I would happily spend a large chunk of my time to make it happen. If anyone has insights or further concerns that would be most helpful :) |
Yes, I agree that the changes became too big to not hurt ecosystem in some way |
It was an amazing experiment, thanks @romainmenke |
Thank you for the feedback 🙇 |
The values part is possible but at a cost of thousands lines of code to handle each kind of value because there are so many values possible in css, having a class named The select part is possible, there should be a EDIT: It could be made more complex and the Selector's interface SelectorType {
type: "Selector" | "Raws";
content: string;
} |
I would like to make an ambitious proposal. In PostCSS 7, could we parse selectors and values, or provide that parsing functionality out-of-the-box? If so, how could I help?
At the lowest effort, could we integrate postcss-selector-parser and postcss-values-parser?
At a higher effort, could we integrate the tokenizers? And have one tokenization to rule them all?
The text was updated successfully, but these errors were encountered: