Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance #19572

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

dvchristianbors
Copy link

@dvchristianbors dvchristianbors commented Apr 6, 2022

fix(chartdata): improve querying performance for long where ... in (...) statements.

SUMMARY

As described in andialbrecht/sqlparse#621, calling sqlparse.format and sqlparse.parse functions with grouping queries will cause quadratic runtime, ending in very long computation times for queries with large statements, e.g., long IN statements with a large number of keys.

In Chart Data, the sqlparse functions are needlessly called, this has several reasons:

  1. The ChartData component will automatically generate most of the structure of the final query. It is possible, however, that users can add custom SQLs for filters, columns, etc.) - However, validation of the query will also happen in connectors/sqla/models/get_query_str_extended, so if erroneos sql code is produced, we will still receive an error.
  2. The structure of the finished SQL query will be valid in every case, and the query will also never consist of multiple consecutive queries. Hence, it is not necessary to parse the query for a possible sequence of multiple queries. (which happens in models/core.py.
  3. The SQL automatically created by compile_sqla_query(sqlaq.sqla_query) will already result in a readable sql. Reformating the sql using sqlparser.format(sql, reindent=True) to reindent will only have a minor improvement in readability while sacrificing computation time.

Looking at the issue #19567 , querying times vary significantly depending on IN clause key size. Omitting the parse() and format() functions from the code used by ChartData will solve this issue, without having any impediments.

BEFORE/AFTER ANIMATED GIF

Before

superset_querying_performance_before

After

superset_querying_performance_after

TESTING INSTRUCTIONS

Please see issue #19567 for detailed steps to reproduce.
I also added a test case that showcases the computational penalty in this commit

ADDITIONAL INFORMATION

  • Has associated issue: Fixes Significant Increase in Querying Time Due to Sqlparse #19567
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@dvchristianbors dvchristianbors changed the title Disable sqlparse calls for Chart Data requests to Improve Querying Performance fix(chartdata): disable sqlparse calls for chart data requests to improve querying performance Apr 6, 2022
@villebro
Copy link
Member

villebro commented Apr 7, 2022

Thanks for the contribution @dvchristianbors . While I agree this performance hit is very problematic, we currently rely heavily on sqlparse for some very critical validation logic. Therefore, some of the changes in this PR will not be possible in their current form.

I looked into the original issue that you linked here, and noticed there seems to be open PRs on sqlparse that may address some of the observed performance issues. From our perspective, the optimal solution would be to fix the performance issues in the underlying library first, and only then optimize in Superset where needed. Out of curiosity, have you tested running Superset with a fork of sqlparse that has applied some of those open PRs? It would be interesting to see if they have a significant impact on performance..

@dvchristianbors
Copy link
Author

Thanks for the contribution @dvchristianbors . While I agree this performance hit is very problematic, we currently rely heavily on sqlparse for some very critical validation logic. Therefore, some of the changes in this PR will not be possible in their current form.

I looked into the original issue that you linked here, and noticed there seems to be open PRs on sqlparse that may address some of the observed performance issues. From our perspective, the optimal solution would be to fix the performance issues in the underlying library first, and only then optimize in Superset where needed. Out of curiosity, have you tested running Superset with a fork of sqlparse that has applied some of those open PRs? It would be interesting to see if they have a significant impact on performance..

Thanks for your comments. I agree that this would be the best option. I forked sqlparse and merged the PRs which supposedly addressed the issues, however I cannot confirm that they did improve performance (at least for the tested use cases).

I will further investigate the issue in sqlparse itself, and propose new changes as soon as I found a better solution.
@villebro Should I post my updates and revised solution here?

@dvmarkusvogl
Copy link

we currently rely heavily on sqlparse for some very critical validation logic
SQL-Parses self-description:
(pypi): sqlparse is a non-validating SQL parser for Python
(github) A non-validating SQL parser module for Python

They seem very keen on not being used for validation.
IMO the pythonic way would be to properly handle the database-response, and the fix above at least reduces the existing complexity.

Also, the project had 3 days of activity since October 2020, so I wouldn't expect much from them or rely to heavily from a project that hat 200+ open issues.

@rumbin
Copy link
Contributor

rumbin commented Nov 20, 2023

We also ran into this performance issue lately. Having some rather complex queries on a dfashboard drastically increases the dashboard load time, as the inefficiency of sqlparse keeps the gunicorn workers busy. This, in-turn, causes queries of the dashboard to be queued until a worker is free again.
So, all in all, the effect on dashboard load times is immense, unless we throw lots of gunicorn workers and many CPU cores at this problem.

@villebro, given the fact that there is virtually no activity in the sqlparse project and considering @dvmarkusvogl's comment above, I feel that following the approach suggested in this PR is the best option we have.
What do you think?

PS: We'll try patching sqlparse by virtue of andialbrecht/sqlparse#710 and report back...

@rumbin
Copy link
Contributor

rumbin commented Nov 21, 2023

Update on our investigations:

Thank you @dvchristianbors for providing this patch.
Now I only hope that this PR will be accepted in some way...

@dvchristianbors
Copy link
Author

Update on our investigations:

Thank you @dvchristianbors for providing this patch.
Now I only hope that this PR will be accepted in some way...

Thanks for letting us know. Did you test these changes with a recent branch? Our changes were made on a rather old version.

@WojtekWaga
Copy link

Hey @dvchristianbors, I applied this patch manually as there were some incompatible changes in the codebase and it helped. Time from clicking chart refresh to getting the query sent to db dropped significantly. Also show query is much faster now but (as expected) it shows uglier query now.

@rusackas
Copy link
Member

I was tempted to close the related issue as stale, but would love to know if anyone on this thread has interest in rebasing/rekindling this effort?

@dvchristianbors
Copy link
Author

dvchristianbors commented Feb 15, 2024

I was tempted to close the related issue as stale, but would love to know if anyone on this thread has interest in rebasing/rekindling this effort?

Yes, I would still be interested to finish this feature. If a rebase next week is still fine.

@rusackas
Copy link
Member

Perfectly fine. Thanks again!

@dvchristianbors
Copy link
Author

Perfectly fine. Thanks again!

Hello @rusackas, the PR was updated and is now ready for review.

@rusackas
Copy link
Member

Looks like the pre-commit hooks need to be run to solve at least some (if not all) of the CI issues.

@rusackas
Copy link
Member

Approving CI run 🤞

@john-bodley
Copy link
Member

@dvchristianbors, @betodealmeida recently introduced [SIP-117] Improve SQL parsing which proposes that we use sqlglot for all of our SQL parsing. Rather than disabling the sqlparse calls (which likely serve some purpose) to improve query performance, would you mind trying sqlglot instead?

@rafehi rafehi mentioned this pull request Feb 28, 2024
16 tasks
@pull-request-size pull-request-size bot added size/M and removed size/L labels Mar 5, 2024
@dvchristianbors
Copy link
Author

@dvchristianbors, @betodealmeida recently introduced [SIP-117] Improve SQL parsing which proposes that we use sqlglot for all of our SQL parsing. Rather than disabling the sqlparse calls (which likely serve some purpose) to improve query performance, would you mind trying sqlglot instead?

I replaced the usage of sqlparse with sqlglot for splitting up multiple sql statements. However, sqlglot does not include the formatting function that was specifically removed due to performance issues here.

@betodealmeida
Copy link
Member

Note that it is possible to pretty-format a SQL query with sqlglot, but you need to know the dialect:

https://github.com/apache/superset/pull/26767/files#diff-30f4c6ffdcb1f78a9e1ebbb60e1f297b379c181534d5a185a4cd37b1b16ac6f8R409

@dvchristianbors do you mind leaving a TODO comment to tentatively re-add the SQL formatting once SIP-117 is merged? (I saw the PR is approved but has conflicts, I'll work on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Significant Increase in Querying Time Due to Sqlparse
8 participants