New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul comment attachment #13521
Overhaul comment attachment #13521
Conversation
Build successful! You can test your changes in the REPL here: https://babeljs.io/repl/build/47263/ |
This pull request is automatically built and testable in CodeSandbox. To see build info of the built libraries, click here or the icon next to each commit SHA. Latest deployment of this branch, based on commit 4f076f8:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! 🎉
@@ -0,0 +1 @@ | |||
foo /* 1 */ (/* 2 */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment /* 2 */
is now an innerComment
of CallExpression
, which is not bad since 1) it is lost on current main and 2) we don't have AST structures for the parenthesis token.
* @property {Array<Comment>} comments - the containing comments | ||
* @property {Node | null} leadingNode - the immediately preceding AST node of the whitespace token | ||
* @property {Node | null} trailingNode - the immediately following AST node of the whitespace token | ||
* @property {Node | null} containerNode - the outermost AST node containing the whitespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the future we can change the behaviour to have innerComments
attached to the innermost AST node. So in the example foo (/* 2 */)
, /* 2 */
will the an innerComments
of call expression. The idea is to provide the generator more information about how to insert innerComments
.
Or should we change in this PR now?
Edit: I realize I have to implement such change as required by the trailing comma comment adjustments. I am surprised that changing outermost to innermost does not break any old tests. I added new tests to capture this behaviour change.
Sorry, didn't realize it wasn't ready for review 😅 |
Converted back to draft as I just find new bugs: e.g. Babel does not attach comments after a DirectiveLabel: "use strict"/* foo */; |
@JLHwung Kataw have a bunch of comment tests you can find here. In fact Kataw is the only one that get all comments 100% correct,, but I havent attached all of them yet. I still think you will have issues with class semicolon and elisons in array literal and pattern. There are also some other edge cases. This is basically because of how the Babel AST is designed, but you can work around this internally if you e.g use a dictionary lookup table and save all loc pos in that one and do a comparison against the "real nodes" when you try to attach. That way you would get the location for cases like The same for this case |
@JLHwung I forgot that you may suffer with trailing comments if Babel in the future should allow optional trailing comments as in Prettier. Typescript have a few edge cases with and without comments where trailing comma is required. Not sure how Babel works around it. Prettier suffer from bugs when it comes to comments and trailing comma and doesn't attach comments in 8 out of 10 cases so it may be a win case for Babel if getting it right. For example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only read the PR description so far. We should probably link this PR in the code, or add a src/parser/comments.md
file, so that whenever someone will have to modify the algorithm they can read the rationale behind it.
-
(P2) should probably be
w1.start ≤ w2.start ≤ w1.end
-
What does (P3) mean for an input code which is just
foo
(without any spaces or newlines)?
unattachedCommentStack.splice(i, 1); | ||
} else if (node.type !== "Program") { | ||
// we have a node share the same length of containerNode, but its finishNode is invoked later | ||
// than containerNode, so this node is the outer node. E.g. ExpressionStatement contains a VariableDeclaration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ExpressionStatement
cannot contain a VariableDeclaration
, but maybe an AssignmentExpression
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! You are right. This branch is removed so we are all good here.
case "ObjectPattern": | ||
this.adjustInnerComments(node, node.properties, comments); | ||
break; | ||
case "CallExpression": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to do this for OptionalCallExpression
, and maybe for function definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is addressed in e5f09a5.
Just in case, are you reviewing on old diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh I think I was accidentally reviewing a specific commit 😅
The main issue is the performance. I see the benchmarks, but in real life. How many source code files per second does this generator perform? And how much memory is consumed? Seems to be too much. Still using a class to handle comments, unnecessary iterations that could have been done in one go etc. |
Playing around with Babel REPL and found lots of comment issues Here is a few switch (c) { /*1*/ } /*2*/
switch (c) { /*1*/ } /*2*/
switch (c)/*1*/ { /*2*/ } /*3*/
x(/*1*/)
(a(/*1*/))
(a())(a(/*1*/))
[,,,/*1*/,,,,]
// should not break
"string"; /*1*/ |
Good question! In that case the list of comment whitespaces is an empty set. P3 is then deduced to nothingness.
I can add it to the
I run the performance test on https://github.com/babel/parser_performance. When parsing
I have added a new identifier benchmark result (without any comments). Because the overhead of comment tracking is reduced, the new algorithm is 15% faster than current one. The identifier benchmark predict the performance improvements on
Thanks! I checked the cases and the comments are now all attached to AST but
Why? I think they are equivalent. Babel generator does not preserve the whitespace. |
@JLHwung This case is a trailing of the empty stmt. After removing the semicolon ";" - it should still be the same, but be a "trailing of string literal". In current REPL a line break is inserted and the comment is no longer a trailing comment. It's a detached comment. // should not break
"string"; /*1*/ Babel REPL "use strict";
"string";
/*1*/ Why using Acorn and Meriyah in the benchmark? They doesn't have any printer / generator support. And this PR is about Babel generator and internal code? Why not add Kataw to the benchmark? The printer is located here. |
@KFlash This PR is focused on parser because before this PR we failed to attach comments in edge cases, so we should prioritize the parser part. The generator support would be addressed later. Babel parses
is an empty statement with trailing comments. |
The Kataw parser can be located here in case you want to add it ;) Regarding the comment. It was printed wrong in the REPL but maybe parsed correctly. Compare against Prettier. They also handle it as an trailing comment. Not a detached one. |
@KFlash I think the comment attachment can only be perfectly handled on a CST where every non-whitespace token has a node representation to which a comment can be attached. The ideal implementation should be language-agnostic as long as space and non-space are well defined. If Babel had sort of |
After my opinion there is a "design flaw" in all AST parser when it comes to loc tracking. It shouldn't count whitespace at all. That way you have more control, and you can use the start / end value of each AST / CST node in the printer to use a separated WS skipping that starts at given position and ends if hit a token that is not whitespace. Lists is defined in the ECMA specs and you would need an extra AST node to get correct loc position after the list token has been consumed. E.g. '(', '{' etc. This is out of scope for Babel, but the only way you can get comments 100% correct. You can even do a "slice" between end loc of previous node and start loc of current node to collect the whitespace with comments if you want a 1:1 printing. This isn't possible with AST parsers. With this kind of algorithm you will not experience any overhead either if you want to "collect" a specific comment. You can do that directly in the lexer. See here how I collect a single line ignore comment. No slice or any string manipulation. |
@JLHwung Here is close to 200 comment tests. This PR should parse them all. @nicolo-ribaudo Did you validate all possible comment attachment combinations before approved this PR? Check the test I linked too. This PR still fails on 2 out of 5 of this tests, so there are still things to fix. |
@KFlash Can you offer a list of failing tests? If the comment is not attached to AST nodes, it should be addressed in this PR. Otherwise it is a generator bug and will be addressed later. |
@JLHwung Just test all the 200 tests I linked too. You can see if the comments are attached or not. As said earlier 2 out of 5 cases the comments are not attached. You can find a sub-folder in the tests I linked too - "Babel issues". I haven't tested if they are fixed with this PR |
b256e51
to
4f076f8
Compare
@@ -0,0 +1,129 @@ | |||
# Comment attachment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
* refactor: inline pushComment * chore: add benchmark cases * perf: overhaul comment attachment * cleanup * update test fixtures They are all bugfixes. * fix: merge HTMLComment parsing to skipSpace * perf: remove unattachedCommentStack baseline 128 nested leading comments: 11_034 ops/sec ±50.64% (0.091ms) baseline 256 nested leading comments: 6_037 ops/sec ±11.46% (0.166ms) baseline 512 nested leading comments: 3_077 ops/sec ±2.31% (0.325ms) baseline 1024 nested leading comments: 1_374 ops/sec ±3.22% (0.728ms) current 128 nested leading comments: 11_027 ops/sec ±37.41% (0.091ms) current 256 nested leading comments: 6_736 ops/sec ±1.39% (0.148ms) current 512 nested leading comments: 3_306 ops/sec ±0.69% (0.302ms) current 1024 nested leading comments: 1_579 ops/sec ±2.09% (0.633ms) baseline 128 nested trailing comments: 10_073 ops/sec ±42.95% (0.099ms) baseline 256 nested trailing comments: 6_294 ops/sec ±2.19% (0.159ms) baseline 512 nested trailing comments: 3_041 ops/sec ±0.8% (0.329ms) baseline 1024 nested trailing comments: 1_530 ops/sec ±1.18% (0.654ms) current 128 nested trailing comments: 11_461 ops/sec ±44.89% (0.087ms) current 256 nested trailing comments: 7_212 ops/sec ±1.6% (0.139ms) current 512 nested trailing comments: 3_403 ops/sec ±1% (0.294ms) current 1024 nested trailing comments: 1_539 ops/sec ±1.49% (0.65ms) * fix: do not expose CommentWhitespace type * add comments on CommentWhitespace * add test case for babel#11576 * fix: mark containerNode be the innermost node containing commentWS * fix: adjust trailing comma comments for Record/Tuple/OptionalCall * fix: drain comment stacks in parseExpression * docs: update comments * add a new benchmark * chore: containerNode => containingNode * add more benchmark cases * fix: avoid finishNodeAt in stmtToDirective * finalize comment right after containerNode is set * add testcase about directive * fix: finish SequenceExpression at current pos and adjust later * chore: rename test cases * add new test case on switch statement * fix: adjust comments after trailing comma of function params * add comment attachment design doc * misc fix * fix: reset previous trailing comments when parsing async method/accessor * chore: add more comment testcases * fix flow errors * fix: handle comments when parsing async arrow * fix: handle comments when "static" is a class modifier * fix flow errors * fix: handle comments when parsing async function/do * refactor: simplify resetPreviousNodeTrailingComments * update test fixtures
}
/+
are mis attached as leading/trailing comments. Fixes #11576, closes #12560Abstract
This PR overhauls current comment attachments. The updated test fixtures are all bugfixes. It is faster than current approach in general cases.
Design docs
Benchmark Results
leading comments + trailing comments (common use case, O(n^2) -> O(n))
leading comments (10%)
nested leading comments (O(n^2) -> O(n))
trailing comments (marginal improvements)
nested trailing comments (25% slower)
I think it is acceptable since trailing-only comments is rare. Most comments are both leading / trailing comments of adjacent AST nodes.
inner comments (10%)
nested inner comments (current main incorrectly produces leading comments)
many identifiers (without comments, 15%)