Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for basic incremental index sorting #3656

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

llalexandru00
Copy link

Performance bottleneck

I am using a lot of multi-table simple queries with ORDER BY (and eventually OFFSET/LIMIT), which aren't optimized in any sense. All records are fetched, a final sorting is done and eventually an OFFSET/LIMIT is applied. There is an existing optimization in H2 based on the Select.sortUsingIndex flag, which hints that the records are already fetched in a sorted way (because an index is used), so a final sorting is not required. Also, LIMIT clause has a lot to benefit from this, as only some records are fetched. However, this works only for single-table queries.

Example

The most basic example is something like SELECT * FROM TEST1 CROSS JOIN TEST2 ORDER BY TEST1.F1 TEST2.F2 LIMIT 5. Right now, H2 sequentially scans both tables, sorts the result set according to ORDER BY TEST1.F1 TEST2.F2 and retrieves the first 5 rows. If both tables have ~1000 unique records, this operation takes ~450ms (embedded anonymous in-memory database v2.1.214).

Research

  • PostgreSQL (at least 14) has "INCREMENTAL SORTING" part of the planning phase. This improves the sorting at the end as the result set can be sorted in batches. Intuitively, the LIMIT is also optimized due to this technique.
  • DuckDB (at least 0.5.1) and SQLite (at least 3.39.3) also have differential performance for LIMIT in the context of multi-table ORDER BY (LIMIT 1 is considerably faster than LIMIT 100000 even for multi-table queries). Of course, these are C based back-ends, but the theoretic approach is relevant here.
  • Apache Derby (at least 10.4.2) has a a poor performance on cross joins, but still has a consistent improvement between LIMIT values. This may be misleading, as the final sorting performance difference may spoil the performance check.

Solution

I suggest using the best b-tree index for the topFilterTable - the one that matches the longest prefix of columns from the ORDER BY clause as possible. Using such index will ensure that the result set is partially sorted on the first Select.alreadySorted elements from ORDER BY.

When OFFSET/LIMIT is specified, only LIMIT elements shall be retrieved initially. To ensure completeness, extra records should be fetched until a "bigger" record is encountered based on the first Select.alreadySorted fields. Sorting these and applying LIMIT is a complete and sound solution.

Using something like LIMIT X will make SELECT to retrieve only X records and eventually fetches some extra records to ensure completeness.

  • Best case scenario: ORDER BY has many fields matched by an index or the records have a high variance. Very few extra records are fetched (maybe none). In this case, the number of extra comparisons is low, so the overhead is clearly acceptable comparing to the fetching of all records.
  • Worst case scenario: the best index matched only the first 1/2 fields from ORDER BY and these have a very low variance. Many extra records are fetched (maybe the whole result-set). In this case, the number of extra comparisons is high, but the number of fields for which one comparison is done is very low (1/2). Therefore, the overhead of applying this solution on result-sets with low variance is negligible.

Result

The same SQL I provided earlier now runs in ~6 ms. The change never compromises as this does not interfere with join order or index selection. Only when it is safe to use/upgrade an index, Select.alreadySorted is used.

  • In general, all CROSS JOIN queries with ORDER BY and LIMIT are working considerably faster.
  • Queries that don't use ORDER BY are not affected by the change.
  • Inner joins without index may use a sort index to aid ORDER BY and LIMIT
  • Inner joins with index may upgrade to a more complex index to also cover ORDER BY and LIMIT, but always preserves the invariant proposed by the index in the first place.

@h2database h2database deleted a comment from sonatype-lift bot Oct 15, 2022
Copy link
Contributor

@katzyn katzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

Please, send a license statement as described here
https://h2database.com/html/build.html#providing_patches
to our mailing list:
https://groups.google.com/g/h2-database
(It is partially pre-moderated, some new messages will be visible only after approval.)

h2/src/main/org/h2/command/query/Select.java Outdated Show resolved Hide resolved
h2/src/main/org/h2/command/query/Select.java Outdated Show resolved Hide resolved
h2/src/main/org/h2/command/query/Select.java Show resolved Hide resolved
@katzyn
Copy link
Contributor

katzyn commented Oct 18, 2022

Did you send a license statement to the mailing list? I don't see it (but maybe it wasn't approved yet).

@llalexandru00
Copy link
Author

The licensing perspective of my changes here is yet to be discussed. Sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants