Understanding AST (Abstract Syntax Tree) #12912

ThatSneakyCoder · 2023-03-26T10:21:42Z

ThatSneakyCoder
Mar 26, 2023

This is a break up discussion from #12907.

At this point I have gone through endless number of youtube videos (best video I found so far: https://www.youtube.com/watch?v=kzDuHh6kolk&t=977s) (another video: https://www.youtube.com/watch?v=jpfaXK4xCYE&t=12s). Sadly a lot of them are about javascript based ASTs, CSTs, parse trees etc, but very few/none for ASTs like how we print and work with them. i did get a lot of ideas from them. I am facing some problems which I wish to discuss below:

(Image below: my notes that I made)

The biggest issue I'm having with ASTs is predicting what AST will be printed for a java souce file. Some issues i have is how do I predict what will the children of a particular node (look at the sample java code I have shared below):

PS C:\Users\shubh\OneDrive\Desktop\checkstyle project> cat test.java
package com.example.project;

public class MyClass {
    // class implementation goes here
}
PS C:\Users\shubh\OneDrive\Desktop\checkstyle project> java -jar "checkstyle-10.7.0-all.jar" -t test.java
COMPILATION_UNIT -> COMPILATION_UNIT [1:0]
|--PACKAGE_DEF -> package [1:0]
|   |--ANNOTATIONS -> ANNOTATIONS [1:19]
|   |--DOT -> . [1:19]
|   |   |--DOT -> . [1:11]
|   |   |   |--IDENT -> com [1:8]
|   |   |   `--IDENT -> example [1:12]
|   |   `--IDENT -> project [1:20]
|   `--SEMI -> ; [1:27]
`--CLASS_DEF -> CLASS_DEF [3:0]
    |--MODIFIERS -> MODIFIERS [3:0]
    |   `--LITERAL_PUBLIC -> public [3:0]
    |--LITERAL_CLASS -> class [3:7]
    |--IDENT -> MyClass [3:13]
    `--OBJBLOCK -> OBJBLOCK [3:21]
        |--LCURLY -> { [3:21]
        `--RCURLY -> } [5:0]
PS C:\Users\shubh\OneDrive\Desktop\checkstyle project>

I am rewriting the part where I have problem:

COMPILATION_UNIT -> COMPILATION_UNIT [1:0]
|--PACKAGE_DEF -> package [1:0]
|   |--ANNOTATIONS -> ANNOTATIONS [1:19]
|   |--DOT -> . [1:19]
|   |   |--DOT -> . [1:11]
|   |   |   |--IDENT -> com [1:8]
|   |   |   `--IDENT -> example [1:12]
|   |   `--IDENT -> project [1:20]
|   `--SEMI -> ; [1:27]

The AST is printing DOT and that too the DOT at 1:19 and then it goes in reverse and prints the DOT at 1:11 . What i mean to ask is how is the DOT at 1:19 child of PACKAGE_DEF while the DOT at 1:11 not a child of PACKAGE_DEF even though the DOT at 1:11 is closer to token package (PACKAGE_DEF)

Whats really going on in my head

See, I have a java file in which I wish to search for particular tokens. I know I will have to use the generated AST for that so, I generate the AST for that and study it. But, when I study the AST, I cannot draw any patterns as to how the AST is being generated. I do understand the AST to about 60 to 70 percent as it is yet another tree but, what will be the children of a node is difficult for me to find out. As in the above discussed example, which DOT will be child of which token was something I couldn't figure out.

Instances like these i find throughout any AST that confuses me.
(some more conversations:)

Answered by nrmancuso

Mar 27, 2023

@shubh220922

Abstract syntax trees are a way to represent the structure of source code that is convenient to me. Depending on how I want to use the AST, I might remove elements I don't care about, or add imaginary ones that help me do whatever I want to do with it. An AST is just my interpretation of some source code. Don't get hung up with technical stuff, this is why it is abstract :)

In Checkstyle, this is how we generate the AST:

Create a stream of characters from the source code file
Recognize certain elements within this stream (tokens), create a token stream
Recognize groups of tokens that form larger elements (production rules)
Create a parse tree from the production rules
Visit …

View full answer

ThatSneakyCoder · 2023-03-26T10:22:33Z

ThatSneakyCoder
Mar 26, 2023
Author

Can I use some tool to understand ASTs better. I am at the moment going through the repo @nrmancuso shared in the previous discussion.

3 replies

romani Mar 26, 2023
Maintainer

The biggest issue I'm having with ASTs is predicting what AST will be printed for a java souce file

please never do this, not of us knows it.
some of them place tokens in order as in file, some branches are not. We can not change all mistakes in AST, so it is weird sometime.
Just do coding by "test driven development".
write some Inputs java files, put violation on lines that you think is good, write code to match your expectations.
Run regression report generation to see wild code structures.

romani Mar 26, 2023
Maintainer

@nrmancuso , can you share some patterns that you know ?

nrmancuso Mar 26, 2023
Maintainer

@romani i planned to give detailed reply for @shubh220922 next time I come to laptop. You know this is my favorite topic :)

nrmancuso · 2023-03-27T05:13:37Z

nrmancuso
Mar 27, 2023
Maintainer

@shubh220922

Abstract syntax trees are a way to represent the structure of source code that is convenient to me. Depending on how I want to use the AST, I might remove elements I don't care about, or add imaginary ones that help me do whatever I want to do with it. An AST is just my interpretation of some source code. Don't get hung up with technical stuff, this is why it is abstract :)

In Checkstyle, this is how we generate the AST:

Create a stream of characters from the source code file
Recognize certain elements within this stream (tokens), create a token stream
Recognize groups of tokens that form larger elements (production rules)
Create a parse tree from the production rules
Visit parse tree nodes, and build our AST

So, a concrete example is this:

source code:

class C {}

Create character stream: c|l|a|s|s| |C| |{|}|EOF (each | separates a character in the stream)
Recognize that c|l|a|s|s is a LITERAL_CLASS token
Recognize that is whitespace, and we can ignore it in java
Recognize that C is an identifier (IDENT)
Recognize that is whitespace, and we can ignore it in java
Recognize that { is a LCURLY
Recognize that } is a RCURLY
Recognize that we are at the end of the file and we can stop now EOF ( don't take this one for granted, because we can parse infinite streams, too.)

This all happens in the code generated by this file.

So now we have a token stream. It looks like this:

LITERAL_CLASS IDENT LCURLY RCURLY EOF

Now, this token stream matches exactly to the classDeclaration production rule in our parser grammar:

checkstyle/src/main/resources/com/puppycrawl/tools/checkstyle/grammar/java/JavaLanguageParser.g4

Line 131 in 258c792

classDeclaration[List<ModifierContext> mods]

So, as we parse the token stream from above, we end up with this parse tree (ANTLR builds this for us):

You can see that the parse tree has a bunch of extra nodes that are of little value to Checkstyle. So now, we traverse the parse tree to build our vision of what we want the AST to look like. This happens in the JavaAstVisitor class.

Result:

COMPILATION_UNIT -> COMPILATION_UNIT [1:0]
`--CLASS_DEF -> CLASS_DEF [1:0]
    |--MODIFIERS -> MODIFIERS [1:0]
    |--LITERAL_CLASS -> class [1:0]
    |--IDENT -> C [1:6]
    `--OBJBLOCK -> OBJBLOCK [1:8]
        |--LCURLY -> { [1:8]
        `--RCURLY -> } [1:9]

The biggest issue I'm having with ASTs is predicting what AST will be printed for a java souce file

You can look at the parser grammar, check the rule you are interested in, and see all possible children by going through subrules etc. But remember, this is only part of the story, because we transform the parse tree into our AST in the visitor.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding AST (Abstract Syntax Tree) #12912

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Understanding AST (Abstract Syntax Tree) #12912

ThatSneakyCoder Mar 26, 2023

This is a break up discussion from #12907.

(Image below: my notes that I made)

Whats really going on in my head

Replies: 2 comments · 3 replies

ThatSneakyCoder Mar 26, 2023 Author

romani Mar 26, 2023 Maintainer

romani Mar 26, 2023 Maintainer

nrmancuso Mar 26, 2023 Maintainer

nrmancuso Mar 27, 2023 Maintainer

ThatSneakyCoder
Mar 26, 2023

Replies: 2 comments 3 replies

ThatSneakyCoder
Mar 26, 2023
Author

romani Mar 26, 2023
Maintainer

romani Mar 26, 2023
Maintainer

nrmancuso Mar 26, 2023
Maintainer

nrmancuso
Mar 27, 2023
Maintainer