Skip to content

Estimating the performance impact of changes to "query table" SQL in TPP

Iain Dillingham edited this page Sep 4, 2023 · 3 revisions

Background

ehrQL allows backends to define a table using the QueryTable object and a blob of SQL text e.g. TPPBackend.vaccinations

    vaccinations = QueryTable(
        """
            SELECT
                vax.Patient_ID AS patient_id,
                vax.Vaccination_ID AS vaccination_id,
                CAST(vax.VaccinationDate AS date) AS date,
                vax.VaccinationName AS product_name,
                ref.VaccinationContent AS target_disease
            FROM Vaccination AS vax
            LEFT JOIN VaccinationReference AS ref
            ON vax.VaccinationName_ID = ref.VaccinationName_ID
        """
    )

This works like a traditional SQL "view" and allows us to change the way a given table is presented to the user. However changes to this SQL can have quite a dramatic imapact on the performance of these tables in a way which won't be obvious locally or in test (see e.g. here and here).

Therefore, unless a change is obviously trivial (e.g. renaming a column) it would prudent to check that it has reasonable performance characteristics.

Prerequisites

You'll need access to the TPP L3 server and credentials for the L2 database. You'll also need to be comfortable coming up with an appropriate test query and making a judgement as to whether the performance impact (which will almost certainly be non-zero) is acceptable or not.

Workflow

1. Clone the repo

You'll need to clone a copy of the ehrQL repo onto the server. You can either do this directly on the Windows desktop or within the Linux VM, whichever seems easier. (I use the VM to minimise the amount of Windows in my life.) Either way, you need to pull via the proxy:

git clone https://github-proxy.opensafely.org/opensafely-core/ehrql

This gives you access to the original SQL for the table (on main) and the proposed new SQL on your branch in a form which is copy-paste-able within the remote desktop environment.

2. Think up a test query

Here you're mostly on your own! I think the basic thing to test here is that if you're adding some kind of transformation to a column then you want to check that queries which filter on that column still perform OK i.e. that the presence of the transform doesn't cause the query planner to go down some de-optimised route.

For example, if we were adding a transformation to the date column on clinical_events I'd want to know how queries which filter by date perform. So I might try something like:

SELECT COUNT(*) FROM (
  ...
) t
WHERE t.date >= '20220101' AND t.date <= '20221231'

Note the ... here: this isn't actually a query, it's a query template and we're going to substitute the two different QueryTable definitions in and compare the performance.

3. Assemble the test SQL

I tend to do the SQL editing in Notepad++ which is available on the Windows Desktop and marginally more helpful that trying to do directly in the SQL Management Studio.

The basic structure we want is (using the example above):

SET STATISTICS TIME ON;
SET STATISTICS IO ON;


PRINT('New query')

SELECT COUNT(*) FROM (
  -- Copy new QueryTable SQL here
) t
WHERE t.date >= '20220101' AND t.date <= '20221231'


PRINT('Original query')

SELECT COUNT(*) FROM (
  -- Copy original QueryTable SQL here
) t
WHERE t.date >= '20220101' AND t.date <= '20221231'

That is, we want two copies of our test query with the different table definitions substituted into each. The PRINT calls will appear in the messages console along with the query stats and they make it easier to see which set of stats belong to which query.

4. Run the test SQL and evaluate the results

Open up SQL Server Management Studio, paste in the test SQL and hit Execute. Wait for it to complete. There's a Messages tab next to the Results tab underneath the SQL console, that will contain the query statistics.

The things we're really interested in are the elapsed and CPU execution times:

 SQL Server Execution Times:
   CPU time = 100 ms,  elapsed time = 200 ms.

If the new query is taking much longer, or using much more CPU, then that may be a problem.

Note that query performance will be affected by the state of various caches such that running it twice may give different results. As a balance between tying up too much of our production resource running performance tests and getting misleading results we run the new query before running the old query. This should give any warm cache advantage to the old query which should guard against us thinking the new query performs better than it actually does.

In case of uncertainty (and assuming the queries aren't too expensive) you can run them multiple times to get a better comparison.

Similarly, the elapsed time (as opposed to CPU time) will be affected by what else the server happens to be doing at the time. Again, running the test a few times will allow you to see what the variance in elapsed time looks like.