Understanding AST (Abstract Syntax Tree) #12912
-
This is a break up discussion from #12907.At this point I have gone through endless number of youtube videos (best video I found so far: https://www.youtube.com/watch?v=kzDuHh6kolk&t=977s) (another video: https://www.youtube.com/watch?v=jpfaXK4xCYE&t=12s). Sadly a lot of them are about javascript based ASTs, CSTs, parse trees etc, but very few/none for ASTs like how we print and work with them. i did get a lot of ideas from them. I am facing some problems which I wish to discuss below: (Image below: my notes that I made)The biggest issue I'm having with ASTs is predicting what AST will be printed for a java souce file. Some issues i have is how do I predict what will the children of a particular node (look at the sample java code I have shared below): PS C:\Users\shubh\OneDrive\Desktop\checkstyle project> cat test.java
package com.example.project;
public class MyClass {
// class implementation goes here
}
PS C:\Users\shubh\OneDrive\Desktop\checkstyle project> java -jar "checkstyle-10.7.0-all.jar" -t test.java
COMPILATION_UNIT -> COMPILATION_UNIT [1:0]
|--PACKAGE_DEF -> package [1:0]
| |--ANNOTATIONS -> ANNOTATIONS [1:19]
| |--DOT -> . [1:19]
| | |--DOT -> . [1:11]
| | | |--IDENT -> com [1:8]
| | | `--IDENT -> example [1:12]
| | `--IDENT -> project [1:20]
| `--SEMI -> ; [1:27]
`--CLASS_DEF -> CLASS_DEF [3:0]
|--MODIFIERS -> MODIFIERS [3:0]
| `--LITERAL_PUBLIC -> public [3:0]
|--LITERAL_CLASS -> class [3:7]
|--IDENT -> MyClass [3:13]
`--OBJBLOCK -> OBJBLOCK [3:21]
|--LCURLY -> { [3:21]
`--RCURLY -> } [5:0]
PS C:\Users\shubh\OneDrive\Desktop\checkstyle project> I am rewriting the part where I have problem: COMPILATION_UNIT -> COMPILATION_UNIT [1:0]
|--PACKAGE_DEF -> package [1:0]
| |--ANNOTATIONS -> ANNOTATIONS [1:19]
| |--DOT -> . [1:19]
| | |--DOT -> . [1:11]
| | | |--IDENT -> com [1:8]
| | | `--IDENT -> example [1:12]
| | `--IDENT -> project [1:20]
| `--SEMI -> ; [1:27] The AST is printing Whats really going on in my headSee, I have a java file in which I wish to search for particular tokens. I know I will have to use the generated AST for that so, I generate the AST for that and study it. But, when I study the AST, I cannot draw any patterns as to how the AST is being generated. I do understand the AST to about 60 to 70 percent as it is yet another tree but, what will be the children of a node is difficult for me to find out. As in the above discussed example, which Instances like these i find throughout any AST that confuses me. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Can I use some tool to understand ASTs better. I am at the moment going through the repo @nrmancuso shared in the previous discussion. |
Beta Was this translation helpful? Give feedback.
-
Abstract syntax trees are a way to represent the structure of source code that is convenient to me. Depending on how I want to use the AST, I might remove elements I don't care about, or add imaginary ones that help me do whatever I want to do with it. An AST is just my interpretation of some source code. Don't get hung up with technical stuff, this is why it is abstract :) In Checkstyle, this is how we generate the AST:
So, a concrete example is this: source code:
This all happens in the code generated by this file. So now we have a token stream. It looks like this:
Now, this token stream matches exactly to the So, as we parse the token stream from above, we end up with this parse tree (ANTLR builds this for us): You can see that the parse tree has a bunch of extra nodes that are of little value to Checkstyle. So now, we traverse the parse tree to build our vision of what we want the AST to look like. This happens in the JavaAstVisitor class. Result:
You can look at the parser grammar, check the rule you are interested in, and see all possible children by going through subrules etc. But remember, this is only part of the story, because we transform the parse tree into our AST in the visitor. |
Beta Was this translation helpful? Give feedback.
@shubh220922
Abstract syntax trees are a way to represent the structure of source code that is convenient to me. Depending on how I want to use the AST, I might remove elements I don't care about, or add imaginary ones that help me do whatever I want to do with it. An AST is just my interpretation of some source code. Don't get hung up with technical stuff, this is why it is abstract :)
In Checkstyle, this is how we generate the AST: