Reduce monkeypatching of `codelist_from_csv()` in documentation tests #1694

StevenMaude · 2023-11-01T17:06:09Z

#1648 adds tests of complete dataset definitions in the documentation.

We are still not entirely testing the dataset definitions completely end-to-end, due to the use of codelist_from_csv(). If we want to do so, we need to improve the existing test strategy.

Background

codelist_from_csv() is used frequently in dataset definitions. This function requires a named CSV file that can be opened. With dataset definitions inline in the documentation, we don't have any current mechanism for including additional data files. Example dataset definitions then have no way of specifying a correct CSV relevant to the rest of the definition. Dataset definitions may be using the codes to match against, say, coded clinical events, which are typed.

The current test workaround:

wraps codelist_from_csv(), intercepting calls to it, validating that calls have the correct call signature and then returns the result the real codelist_from_csv() function with arguments that point to a fixture CSV with fake medical codes included inline in the test
patches out the ehrQL code that validates codes, so that the fake medical codes can be used to run dataset definitions

The use of a fake medical code is necessary because there is no medical code that validates in all of the supported coding systems.

Possible solutions

I think these would all still need a monkeypatching workaround to point to the correct CSV file to use. But the rest of the medical code handling parts of ehrQL would then be run end-to-end, using specified valid medical codes instead of fakes.

All of these would require some convention to be defined and adhered to.

1. Use a filename convention in example calls to specify which test CSV to use

Establish a convention for the arguments of codelist_from_csv() calls to use in dataset definitions in the documentation examples.

By doing so, we could intercept calls to codelist_from_csv(), use the filename argument to specify which minimal test CSV file containing valid code(s) to use.

For example:

from ehrql import codelist_from_csv

my_codelist = codelist_from_csv(filename="snomedct-codes.csv", column="code")

This convention could be to match either an entire filename — snomedct-codes.csv — or maybe the end of a filename diabetes-snomedct-codes.csv.

This enforces a consistency between each example's use of codelist_from_csv(), but is still using the same mock data each time.

2. Inline some CSV data for examples

We could actually inline relevant data, and:

Use the code fences support for specification of arbitrary HTML IDs, classes and attributes to tag data fences and the associated ehrQL examples, and then pair these fences in our tests.
If we do not want to display this data in the documentation, we could do so by:
1. either, wrapping the code fence in a HTML comment; SuperFences does extract commented out fences
2. or, using CSS to remove unwanted data fences from display in the documentation

This means we could use "real" codes per-example, not the same valid example every time.

(In general, using additional Markdown metadata in this way this could be a solution, if we're looking to extend our testing further while still restricting ourselves to using inline examples only.)

3. A combination of both the above approaches

We could allow inlining of data, and fall back to a filename match if no data is inlined.

The text was updated successfully, but these errors were encountered:

StevenMaude · 2023-11-02T10:25:52Z

~~The sandbox feature has complicated this, so it's no longer easy to monkeypatch codelist_from_csv().~~ Still possible as in #1648.

Another workaround would be to load the dataset definition as a Python AST, and then patch out the codelist_from_csv() call there. That has the downside that you could potentially get errors that don't actually match up with the actual unmodified dataset definition.

StevenMaude mentioned this issue Nov 1, 2023

test: Check complete ehrQL dataset definitions in docs #1648

Merged

inglesp added this to the P3 milestone Nov 3, 2023

StevenMaude mentioned this issue Dec 13, 2023

test: Expand documentation checks to measures #1822

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce monkeypatching of `codelist_from_csv()` in documentation tests #1694

Reduce monkeypatching of `codelist_from_csv()` in documentation tests #1694

StevenMaude commented Nov 1, 2023 •

edited

StevenMaude commented Nov 2, 2023 •

edited

Reduce monkeypatching of codelist_from_csv() in documentation tests #1694

Reduce monkeypatching of codelist_from_csv() in documentation tests #1694

Comments

StevenMaude commented Nov 1, 2023 • edited

Background

Possible solutions

1. Use a filename convention in example calls to specify which test CSV to use

2. Inline some CSV data for examples

3. A combination of both the above approaches

StevenMaude commented Nov 2, 2023 • edited

Reduce monkeypatching of `codelist_from_csv()` in documentation tests #1694

Reduce monkeypatching of `codelist_from_csv()` in documentation tests #1694

StevenMaude commented Nov 1, 2023 •

edited

StevenMaude commented Nov 2, 2023 •

edited