Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support codemods: use a CST #12806

Open
conartist6 opened this issue May 7, 2022 · 24 comments
Open

Support codemods: use a CST #12806

conartist6 opened this issue May 7, 2022 · 24 comments
Labels
type:meta Issues about the status of Prettier, or anything else about Prettier itself

Comments

@conartist6
Copy link

Related issues: #2068, #4675, #5998, and my own earlier #12709.

Currently this is possible:

  • Give recast your favorite parser and have it parse source text into an AST and make a copy of that AST for later comparison.
  • Modify the AST as your desired transform dictates. You cannot manipulate concrete syntax, so I hope you don't want line breaks.
  • Recast prints combining a format-preserving printer with a generic pretty printer.
  • Prettier parses the source text into a new AST
  • Prettier prints its AST into a string using the source text to inform some formatting decisions.

I do not like this workflow:

  • It involves three complete copies of the AST and an extra intermediate representation of the text.
  • Recast has significant bugs in its printing.
  • Recast's pretty printer is largely unneeded and unmaintained since most of the development on it has occurred in a fork known as "prettier".

I have started work on implementing an alternate workflow:

  • Parse source text into an AST with your favorite parser
  • Integrate concrete syntax with that AST by adding node.tokens to every node. Now you have a concrete syntax tree, or CST!
  • Modify this CST according to the logic of your transform. Transforms are free to tweak the concrete syntax.
  • Pass the CST to prettier, which mutates the tokens to achieve pretty-printing.
  • Print the tokens through simple traversal.

In order to make this possible, I am building a library whose working name is cst-tokens. Its functionality centers around building and synchronizing concrete syntax stored in node.tokens arrays.

My research indicates that this structure could be used to satisfy the kinds of queries prettier currently can only make against source text (i.e. originalText), though actually implementing those queries is still likely to be nontrivial.

@alexander-akait
Copy link
Member

alexander-akait commented May 7, 2022

I think it's better to create your own instrument, we don't plan do it, sorry

@conartist6
Copy link
Author

I don't understand. You're saying I should reinvent prettier? What I am inventing is a tool that would make it possible for prettier to support this.

@conartist6
Copy link
Author

conartist6 commented May 7, 2022

In particular what I haven't built yet but could offer is a tokensLookahead(node) iterator and a tokensLookbehind(node) iterator. This is what would make it easy for prettier to answer queries like:

  • Is the next non-whitespace token {?
  • Is the preceding line blank?
  • Is there an inline comment for this parameter?

@fisker
Copy link
Member

fisker commented May 7, 2022

Actually, I think use a CST is a good idea, it solve many problems, eg comments in for (;;);, check the {} in import {} from "foo", comments in array holes.

We already got the tokens, but not used yet.

opts[Symbol.for("tokens")] = ast.tokens || [];

@fisker
Copy link
Member

fisker commented May 7, 2022

One reason we are not using tokens #10492 (comment).

@conartist6
Copy link
Author

Oh yeah. I don't think the memory costs of tokens should be too bad. There's a lot of places in this code that I could cache the token objects. For example there's no reason why every { type: 'Punctuator', value: '{' } token shouldn't be the same object.

@fisker
Copy link
Member

fisker commented May 7, 2022

ESLint's TokenStore seems a good way to consume tokens.

@conartist6
Copy link
Author

It looks like a pretty big/complex API but I could try implementing some parts of it over the CST. I should also definitely change node.tokens to something like node[Symbol.for('cst.tokens')]. Another idea that I kind of like for saving memory: force newlines into their own Whitespace tokens. This way all the newline tokens can be reused and it's also highly likely that you'll be able to reuse the tokens for comment indentations like ' ' (2space), ' ' (4space), etc. The data will then be stored in a non-normal form, but that should be fine since we need to assume that modifications may have made it non-normal anyway.

@alexander-akait
Copy link
Member

alexander-akait commented May 7, 2022

@fisker You forgot about many logic around parent nodes + parens + context printing, using CST is not simple for it (you don't know context of token - for example number token - is it just number, or maybe property, or it is inside expression and more and more), yes it will allow to print 1:1 code, but what is our goal - pretty code? I see how @conartist6 want to hack prettier and insert own logic for spaces and newlines, I think this is just contrary to our philosophy, there are a lot of other tools which use CST.

Also no good CST parser currently for JS (so there are a lot of work here).

I don't mean to say that it's not possible to create good enough logic using CST, but it will take a lot of work. And prettier was not design for it.

@alexander-akait
Copy link
Member

alexander-akait commented May 7, 2022

And yes, to keep memory is low we need a lot of weak caches between AST nodes and CST tokens and rebuild some stuff very ofter, it means we will lose perf.

@conartist6 I want to ask one question - Why you want this behaviour? What is the end goal?

@conartist6
Copy link
Author

@alexander-akait What I'm talking about isn't a pure CST, it's a CST purely inside a normal AST. How you know about your parents and siblings is no different than it would be when traversing any AST. If you're looking at how to print a particular node you have access to its tokens, but you also have access to the traversal path with your parent references, which is what you need to have the ability to "look around".

@fisker
Copy link
Member

fisker commented May 7, 2022

I'm just saying I like the way how ESLint uses AST, tokens, and comments. We definitely don't want add tokens to Node(we've try to remove attaching comments to Node), but we can use tokens instead of originalText in some place.

@conartist6
Copy link
Author

conartist6 commented May 7, 2022

I want the behavior because I believe there should be first-class support for writing codemods. Your code is data, and it should be as easy as possible to write change it by writing a program. In particular you should probably never do a textual "find and replace" in your codebase. You'll end up changing things you didn't mean to. Even the simplest changes to a whole codebase need to be performed as modifications to ASTs, as that is the only way you can be sure what you are modifying.

I think there's an incredible amount of potential lost currently due to the inability to do this well. Imagine this: as a library author I change the name of an exported function in new major version. Some of my users have imported the function in 1200 files. So that they can upgrade, I publish a codemod that:

  • upgrades to my-library@next
  • goes through the code, finds import {oldName} from 'my-library' and changes them to import {newName as oldName} from 'my-library'. It could even check whether it was safe to just use newName in place of oldName everywhere in order to avoid the extra as.

In order to be able to make this kind of functionality valuable, I need to be able to avoid damaging the project's existing formatting. For hand-formatted projects this means respecting hand formatting. For prettier-formatted projects this means running prettier.

Recast is what is supposed to step in and solve this problem for me, but since prettier forked from it it has been suffering from prettier's success. While prettier's printer has evolved to near perfection, the pretty-printer embedded in recast has become obsolete. Nobody wants to do the same work twice. That's why instead of trying to compete with prettier I want to integrate with it.

@conartist6
Copy link
Author

@fisker Why do you definitely not want tokens on node?

@fisker
Copy link
Member

fisker commented May 7, 2022

#8122 (comment)
#9830

@fisker
Copy link
Member

fisker commented May 7, 2022

there should be first-class support for writing codemods.

I don't think any of our team member interest in this.

@conartist6
Copy link
Author

By first-class support I really just mean that there should be a formatAst(ast) or a formatCst(cst) method. It does seem there's interest in that right?

I think it actually makes more sense to offer formatCst(cst) because an AST should be a subtype of CST, and it would allow prettier to respect some aspects of input formatting as it usually does.

My point is this though: what would formatAst be useful for if its only way of building tokens is from the source text? If that is true, then removing nodes will work fine but reordering them will have strange results and creating them would break any lookaround system based on tokens. So: to me it doesn't look like your main problem with being able to create formatAst is really just that you might mutate a few nodes, it's that your method will break when given real inputs.

@conartist6
Copy link
Author

For the record though, my tool just helps you extrapolate tokens from the AST. It can overwrite properties of nodes, but it doesn't need to. It just helps you be able to define an algorithms in terms of tokens by ensuring that it can always generate them if they don't exist, and that it can always use them if they do. Prettier would most likely want to use generateAllTokens not buildAllTokens since as you mention you would like to avoid mutating the input.

@conartist6
Copy link
Author

conartist6 commented May 14, 2022

Just for fun, here's what I imagine being possible. I'm hoping this helps make it clear why I think a mutable tree is actually a really great and powerful thing:

import { readFile, writeFile } from 'fs/promises';
import { rebuildTokens, attachComments, print } from 'cst-tokens';
import prettier from 'prettier/cst';

export async function codemod(path, transform) {
  const cst = someParser.parse(await readFile(path, 'utf-8'));

  rebuildTokens(cst);

  attachComments(cst);

  transform(cst);

  prettier(cst);

  await writeFile(path, print(cst));
}

@conartist6
Copy link
Author

conartist6 commented May 14, 2022

The difference between this kind of mutating the AST and the kind of the mutations you're trying to avoid I think is that these are mutations expressed in terms of public APIs explicitly about AST mutation. They're not leaking bits of internal state out of prettier, they're the actual interface through which prettier does its work. I think such changes would make prettier more interoperable with other tools, not less. In the end, Prettier is changing your CST and your AST, and I don't see that it helps to hide that.

In addition to codemodding this offers pretty good support for IDE integration. You can keep once CST and keep passing it back and forth from prettier to the editor like so since it continues to stay in sync:

  • Generate AST from file
  • Decorate AST with CST nodes
  • Update CST with edits
  • Run prettier on CST
  • Sync CST back to file
  • Update CST with more edits
  • Run prettier on CST

The only problem that would be left unsolved is the precise definition of "Sync CST back to file". Right now you'd have to do what you have to do anyway: print the entire CST out and replace the entire file contents. I don't think this is so bad since, as noted, it's what has to happen already. Rome imagined a scenario where prettier would work with CSTs that have immutable nodes, and thus they could compare two CSTs and avoid reprinting parts of the tree that were unchanged. It's an interesting idea, but Rome is so perf-focused that they left the javascript ecosystem for Rust anyway, and I don't think the changes I'm suggesting actually rule out the possibility of later implementing reference-immutability if there is actually demand for it.

@conartist6
Copy link
Author

My plan is to actually build this integration for a small subset of syntax (imports). That way we can discuss the specific mechanisms by which it works, and I can also show people exactly what workflows this integration would make possible.

@conartist6
Copy link
Author

OK the challenges in implementing this are obviously... large. The best thing I could say about the possibility of actually building something is that the doc data structure will be tremendously useful. Essentially what we want is to have two doc builders: one that builds strings, and one that rebuilds CST tokens.

The question is: how would the builder that rebuilds CST tokens know which node's tokens it should be changing? The doc object is sort of tree-structured, but it isn't doesn't include any references to the actual AST, which I think is very much part of its design goals.

We do have an AST that is known to match the contents of the doc, in particular because prettier actually does implement formatting changes that would change the AST (e.g. (2 + 2) + 2 -> 2 + 2 + 2) as AST mutations, which in this case helps us.

I'm not sure quite how yet, but I'm pretty sure that with the parts I've built it should be possible to traverse the AST nodes guided by the content of the doc's input, while outputting to each node's tokens array. I need to figure out exactly how to allow the grammar to guide that process, and how all the parts would work together. If I can manage to put all the parts together without needing any modifications to the structure of the doc to support my use case, I think I have a decent chance of making this real. If I can't, game over.

@conartist6
Copy link
Author

conartist6 commented Jun 29, 2022

No, ok, I think I can do it all, and I don't need anything major from prettier. Here's exactly how I think the code will look:

const __prettier = require('prettier').__debug;
const cstTokens = require('cst-tokens');

function codemod(sourceText, transform) {
  const ast = __prettier.parse(sourceText);

  cstTokens.updateTokens(ast, { sourceText });

  // Run some arbitrary codemod
  transform(ast);

  cstTokens.updateRanges(ast);

  return __prettier.formatAST(ast, { originalText: cstTokens.print(ast) });
}

@conartist6
Copy link
Author

I am pleased to say that I've finished enough of the cst-tokens library to start experimenting with the prettier integration as shown in the example above.

The next major steps will be to build out the grammar, build a test suite, and implement comment attachment. The cool thing is that in theory this data structure allows me to implement comment attachment logic in such a way that it could be reused by any library, including but not limited to prettier.

@thorn0 thorn0 added the type:meta Issues about the status of Prettier, or anything else about Prettier itself label Sep 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:meta Issues about the status of Prettier, or anything else about Prettier itself
Projects
None yet
Development

No branches or pull requests

4 participants