Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce monkeypatching of codelist_from_csv() in documentation tests #1694

Open
StevenMaude opened this issue Nov 1, 2023 · 1 comment
Open
Milestone

Comments

@StevenMaude
Copy link
Contributor

StevenMaude commented Nov 1, 2023

#1648 adds tests of complete dataset definitions in the documentation.

We are still not entirely testing the dataset definitions completely end-to-end, due to the use of codelist_from_csv(). If we want to do so, we need to improve the existing test strategy.

Background

codelist_from_csv() is used frequently in dataset definitions. This function requires a named CSV file that can be opened. With dataset definitions inline in the documentation, we don't have any current mechanism for including additional data files. Example dataset definitions then have no way of specifying a correct CSV relevant to the rest of the definition. Dataset definitions may be using the codes to match against, say, coded clinical events, which are typed.

The current test workaround:

  • wraps codelist_from_csv(), intercepting calls to it, validating that calls have the correct call signature and then returns the result the real codelist_from_csv() function with arguments that point to a fixture CSV with fake medical codes included inline in the test
  • patches out the ehrQL code that validates codes, so that the fake medical codes can be used to run dataset definitions

The use of a fake medical code is necessary because there is no medical code that validates in all of the supported coding systems.

Possible solutions

I think these would all still need a monkeypatching workaround to point to the correct CSV file to use. But the rest of the medical code handling parts of ehrQL would then be run end-to-end, using specified valid medical codes instead of fakes.

All of these would require some convention to be defined and adhered to.

1. Use a filename convention in example calls to specify which test CSV to use

Establish a convention for the arguments of codelist_from_csv() calls to use in dataset definitions in the documentation examples.

By doing so, we could intercept calls to codelist_from_csv(), use the filename argument to specify which minimal test CSV file containing valid code(s) to use.

For example:

from ehrql import codelist_from_csv

my_codelist = codelist_from_csv(filename="snomedct-codes.csv", column="code")

This convention could be to match either an entire filename — snomedct-codes.csv — or maybe the end of a filename diabetes-snomedct-codes.csv.

This enforces a consistency between each example's use of codelist_from_csv(), but is still using the same mock data each time.

2. Inline some CSV data for examples

We could actually inline relevant data, and:

  1. Use the code fences support for specification of arbitrary HTML IDs, classes and attributes to tag data fences and the associated ehrQL examples, and then pair these fences in our tests.
  2. If we do not want to display this data in the documentation, we could do so by:
    1. either, wrapping the code fence in a HTML comment; SuperFences does extract commented out fences
    2. or, using CSS to remove unwanted data fences from display in the documentation

This means we could use "real" codes per-example, not the same valid example every time.

(In general, using additional Markdown metadata in this way this could be a solution, if we're looking to extend our testing further while still restricting ourselves to using inline examples only.)

3. A combination of both the above approaches

We could allow inlining of data, and fall back to a filename match if no data is inlined.

@StevenMaude
Copy link
Contributor Author

StevenMaude commented Nov 2, 2023

The sandbox feature has complicated this, so it's no longer easy to monkeypatch codelist_from_csv(). Still possible as in #1648.

Another workaround would be to load the dataset definition as a Python AST, and then patch out the codelist_from_csv() call there. That has the downside that you could potentially get errors that don't actually match up with the actual unmodified dataset definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants