Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using antlr4 go to parse SQL takes too much time #4545

Open
tclxdean-lu opened this issue Mar 4, 2024 · 21 comments
Open

Using antlr4 go to parse SQL takes too much time #4545

tclxdean-lu opened this issue Mar 4, 2024 · 21 comments

Comments

@tclxdean-lu
Copy link

tclxdean-lu commented Mar 4, 2024

Using antlr4 go to parse SQL takes too much time, eg:

I am using https://github.com/antlr/grammars-v4/blob/master/sql/mysql/Positive-Technologies/MySqlParser.g4

sql := fmt.Sprintf(`select * from xx where id=1`)
	var is *antlr.InputStream
	is = antlr.NewInputStream(sql)
	lexer := parser.NewMySqlLexer(is)
	tokens := antlr.NewCommonTokenStream(lexer, antlr.TokenDefaultChannel)

	p := parser.NewMySqlParser(tokens)

	startTime := time.Now()
	p.Root()
	endTime := time.Now()
	elapsedTime := endTime.Sub(startTime)
	fmt.Printf("Execution time:: %s \n", elapsedTime)

output:

Execution time: 50ms

@jimidle
Copy link
Collaborator

jimidle commented Mar 4, 2024

This grammar is unfortunately garbage. I am an advocate of removing it as it comes up often. After spending sometime with the Go runtime to allow such poor grammars to function, I decided it was not worth it. Someone told me that there was a good grammar somewhere, maybe in the MySql source code.

Don't waste your time pursuing this.

@KvanTTT
Copy link
Member

KvanTTT commented Mar 5, 2024

This grammar is unfortunately garbage. I am an advocate of removing it as it comes up often.

If even it's garbage, it's ANTLR pitfail that it doesn't report grammar problems. ANTLR 4 was created with mind to allow user write simple grammars.

@jimidle
Copy link
Collaborator

jimidle commented Mar 5, 2024 via email

@KvanTTT
Copy link
Member

KvanTTT commented Mar 5, 2024

I don't support the idea with removing unless we have another better MySql grammar. Nobody forces anybody to fix this grammar issues.

@KvanTTT
Copy link
Member

KvanTTT commented Mar 5, 2024

Also, I think there are many poorly-written grammars in this repository. There are many issues with MySql grammar probably because it's more demandable compare to others.

@jimidle
Copy link
Collaborator

jimidle commented Mar 5, 2024

As part of MySQL Workbench, @mike-lischke has/had a v3 grammar that he wrote to use with the C target.

https://dev.mysql.com/blog-archive/parsing-in-mysql-workbench/

Mike's stuff is very good. @mike-lischke (I think I have the right Mike? ;), Can I take that and convert it to v4?

There was also a grammar released by Salesforce at some point many moons ago.

I will spend a little time on this and see where we get to.

@jimidle
Copy link
Collaborator

jimidle commented Mar 5, 2024

Actually, it looks like MySqlWorkbench now uses an official Oracle v4 grammar. I am not sure where that comes from originally, but we have this:

https://github.com/mysql/mysql-workbench/tree/8.0/library/parsers/grammars

@mike-lischke
Copy link
Member

Yes, it's the right Mike, Jim :-D

And yes, I wrote the original MySQL ANTLR grammar (for use in MySQL Workbench), at that time for ANTLR3 and the C target (created by Jim!). When you go back far enough in the git history, you might be able to find it there, still.

The current MySQL grammar in WB is outdated (4 years old) and supports only MySQL 5.6, 5.7 and 8.0. The latest and greatest version of it is now in the MySQL Shell GUI for VS Code project (part of the shell plugins). This one, however, only supports MySQL 8.0+ and requires TypeScript as target.

Since a while I only have had a readme for my MySQL Grammar in the grammar-v4 repository, describing where to get it (it needs a bit support code), but I plan to make this a full grammar entry as soon as I can find the time for that.

@jimidle
Copy link
Collaborator

jimidle commented Mar 5, 2024

Yes, it's the right Mike, Jim :-D

And yes, I wrote the original MySQL ANTLR grammar (for use in MySQL Workbench), at that time for ANTLR3 and the C target (created by Jim!). When you go back far enough in the git history, you might be able to find it there, still.

The current MySQL grammar in WB is outdated (4 years old) and supports only MySQL 5.6, 5.7 and 8.0. The latest and greatest version of it is now in the MySQL Shell GUI for VS Code project (part of the shell plugins). This one, however, only supports MySQL 8.0+ and requires TypeScript as target.

Since a while I only have had a readme for my MySQL Grammar in the grammar-v4 repository, describing where to get it (it needs a bit support code), but I plan to make this a full grammar entry as soon as I can find the time for that.

OK - cool. I was just taking a look now. I will take a look through your Typescript targeted one as maybe it can be made more generic. No problem letting you do it when you have time. Maybe for now, we can change the README for the other one and refer to these other repos as perhaps being a better place to start.

If it seems that making your grammar generic is a reasonable task, then I may just do it. I am going to put some ANTLR time in these next two weeks, such as upgrading the Maven plugin (nearly finished with that). I could do with some entries in my GitHub account as there is a dearth of any work out there right now and I am twiddling my thumbs a bit.

@mike-lischke
Copy link
Member

I highly welcome any help with that! You might be interested in my antlr4wasm repo where I have a set of benchmarks with my MySQL grammar (the one there is even newer than the one in the MSG project, as I have this updated in February for the upcoming release). And in this benchmark folder there are two versions, one for TS and one for C++, including the stripped down support code. I wanted to add a Java version too, but failed so far. That should be a good starting point.

@jimidle
Copy link
Collaborator

jimidle commented Mar 5, 2024

OK - cool. I started on a Go version (but literally just created the project and downloaded your grammars. I was just about to ask if there was any need to support both MRS and non-MRS as separate units. It seemed like I could just use the MRS versions.

As every time I ask, you tell me about a new version, are there any other repos ;) - ha.

@mike-lischke
Copy link
Member

You can ignore MRS (short for MySQL REST Service) completely. That's a non-released language extension to MySQL and does not include standard MySQL. In fact in the build script for MSG I merge both grammars, as I didn't want to have something in the main grammar that doesn't really exist.

I'm really looking forward to get even more targets running the same benchmarks, so we can compare them. I'd love to see Java there too.

@jimidle
Copy link
Collaborator

jimidle commented Mar 5, 2024 via email

@kylege
Copy link

kylege commented Apr 18, 2024

Test with the same mysql grammar, the java sdk is mush more faster than golang version, maybe there is some implementation difference in golang sdk?

@jimidle
Copy link
Collaborator

jimidle commented Apr 18, 2024

Test with the same mysql grammar, the java sdk is mush more faster than golang version, maybe there is some implementation difference in golang sdk?

You are using this grammar, and not the one in the contributed repo?

@kylege
Copy link

kylege commented Apr 22, 2024

Test with the same mysql grammar, the java sdk is mush more faster than golang version, maybe there is some implementation difference in golang sdk?

You are using this grammar, and not the one in the contributed repo?

Yes. Using https://github.com/antlr/grammars-v4/blob/master/sql/mysql/Positive-Technologies/MySqlParser.g4

Java SDK is much faster than golang SDK,You can try it with Intelijj IDEA ANTLR4 Plugin.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 22, 2024

Java SDK is much faster than golang SDK

I'm curious how much faster. Could you perform a benchmark?

@jimidle
Copy link
Collaborator

jimidle commented Apr 22, 2024

Test with the same mysql grammar, the java sdk is mush more faster than golang version, maybe there is some implementation difference in golang sdk?

You are using this grammar, and not the one in the contributed repo?

Yes. Using https://github.com/antlr/grammars-v4/blob/master/sql/mysql/Positive-Technologies/MySqlParser.g4

Java SDK is much faster than golang SDK,You can try it with Intelijj IDEA ANTLR4 Plugin.

That is the poor grammar in the free downloads and not the one mentioned here.

I made some inroads on improving the performance of such broken grammars, but decided enough was enough. It "works" but will be too slow to be practical to use in any target. That demo grammar is not fit for purpose I am afraid and is not the one that this thread changed to discuss. I can see your confusion, as the thread started out with the postive-technologies one.

Basically, do not use that grammar.

@jimidle
Copy link
Collaborator

jimidle commented Apr 22, 2024

Java SDK is much faster than golang SDK

I'm curious how much faster. Could you perform a benchmark?

It's not relevant really. I improved from essentially infinite execution time to impractical. It depends heavily on the input given to the parser. The grammar is just broken in the sense that it is so poorly formed that it pretty much traverses every possible grammar rule for everything. It is a testament to the algorithm that it finished at all to be honest.

@KvanTTT
Copy link
Member

KvanTTT commented Apr 22, 2024

It's not relevant really.

Maybe relevant because it shows that Go runtime works differently. It probably can cause similar problems even with "good" grammars.

NB, the debezium project with almost 10K stars on GitHub relies on this grammar. I'm curious if this grammar causes big performance problems, probably @Naros could answer.

@Naros
Copy link

Naros commented Apr 22, 2024

I'm curious if this grammar causes big performance problems, probably @Naros could answer.

So we use 3 different grammars from this upstream project: MySQL, MariaDB, and Oracle.

Our use of the MySQL and MariaDB grammar is restricted solely to DDL changes and not DML. So we probably don't observe any measurable performance issues since we process far more DML events than DDL.

When we implemented the Oracle connector in 2020, we considered using the ANTLR grammar for parsing both the DDL and DML events; however, we did notice a performance concern, but that concern was not only about the parse speed, but the number of temporary objects constructed. Since we generally work with higher volumes of inserts, updates, and deletes, we needed a robust solution and since our ingestion for insert, update, and deletes via LogMiner have a very restricted set of grammar bits, we simply rolled our own for DML and simply use ANTLR for DDL only.

The grammar is just broken in the sense that it is so poorly formed that it pretty much traverses every possible grammar rule for everything.

It certainly needs work @jimidle, there is no denying that; it's just a time/resource issue. I've wanted to sit down with the grammar we heavily use and start with a fresh slate, but that's just not a reality any time soon, but at the same time, we aren't in a position to use anything else either :(

I think if anything, if someone wanted to collaborate on an improved version of the grammar, I'd be open to that idea even if it takes a bit of time to do it, and do it right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants