You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#1648 adds tests of complete dataset definitions in the documentation.
We are still not entirely testing the dataset definitions completely end-to-end, due to the use of codelist_from_csv(). If we want to do so, we need to improve the existing test strategy.
Background
codelist_from_csv() is used frequently in dataset definitions. This function requires a named CSV file that can be opened. With dataset definitions inline in the documentation, we don't have any current mechanism for including additional data files. Example dataset definitions then have no way of specifying a correct CSV relevant to the rest of the definition. Dataset definitions may be using the codes to match against, say, coded clinical events, which are typed.
The current test workaround:
wraps codelist_from_csv(), intercepting calls to it, validating that calls have the correct call signature and then returns the result the real codelist_from_csv() function with arguments that point to a fixture CSV with fake medical codes included inline in the test
patches out the ehrQL code that validates codes, so that the fake medical codes can be used to run dataset definitions
I think these would all still need a monkeypatching workaround to point to the correct CSV file to use. But the rest of the medical code handling parts of ehrQL would then be run end-to-end, using specified valid medical codes instead of fakes.
All of these would require some convention to be defined and adhered to.
1. Use a filename convention in example calls to specify which test CSV to use
Establish a convention for the arguments of codelist_from_csv() calls to use in dataset definitions in the documentation examples.
By doing so, we could intercept calls to codelist_from_csv(), use the filename argument to specify which minimal test CSV file containing valid code(s) to use.
This convention could be to match either an entire filename — snomedct-codes.csv — or maybe the end of a filename diabetes-snomedct-codes.csv.
This enforces a consistency between each example's use of codelist_from_csv(), but is still using the same mock data each time.
2. Inline some CSV data for examples
We could actually inline relevant data, and:
Use the code fences support for specification of arbitrary HTML IDs, classes and attributes to tag data fences and the associated ehrQL examples, and then pair these fences in our tests.
If we do not want to display this data in the documentation, we could do so by:
either, wrapping the code fence in a HTML comment; SuperFences does extract commented out fences
or, using CSS to remove unwanted data fences from display in the documentation
This means we could use "real" codes per-example, not the same valid example every time.
(In general, using additional Markdown metadata in this way this could be a solution, if we're looking to extend our testing further while still restricting ourselves to using inline examples only.)
3. A combination of both the above approaches
We could allow inlining of data, and fall back to a filename match if no data is inlined.
The text was updated successfully, but these errors were encountered:
The sandbox feature has complicated this, so it's no longer easy to monkeypatch codelist_from_csv(). Still possible as in #1648.
Another workaround would be to load the dataset definition as a Python AST, and then patch out the codelist_from_csv() call there. That has the downside that you could potentially get errors that don't actually match up with the actual unmodified dataset definition.
#1648 adds tests of complete dataset definitions in the documentation.
We are still not entirely testing the dataset definitions completely end-to-end, due to the use of
codelist_from_csv()
. If we want to do so, we need to improve the existing test strategy.Background
codelist_from_csv()
is used frequently in dataset definitions. This function requires a named CSV file that can be opened. With dataset definitions inline in the documentation, we don't have any current mechanism for including additional data files. Example dataset definitions then have no way of specifying a correct CSV relevant to the rest of the definition. Dataset definitions may be using the codes to match against, say, coded clinical events, which are typed.The current test workaround:
codelist_from_csv()
, intercepting calls to it, validating that calls have the correct call signature and then returns the result the realcodelist_from_csv()
function with arguments that point to a fixture CSV with fake medical codes included inline in the testThe use of a fake medical code is necessary because there is no medical code that validates in all of the supported coding systems.
Possible solutions
I think these would all still need a monkeypatching workaround to point to the correct CSV file to use. But the rest of the medical code handling parts of ehrQL would then be run end-to-end, using specified valid medical codes instead of fakes.
All of these would require some convention to be defined and adhered to.
1. Use a filename convention in example calls to specify which test CSV to use
Establish a convention for the arguments of
codelist_from_csv()
calls to use in dataset definitions in the documentation examples.By doing so, we could intercept calls to
codelist_from_csv()
, use the filename argument to specify which minimal test CSV file containing valid code(s) to use.For example:
This convention could be to match either an entire filename —
snomedct-codes.csv
— or maybe the end of a filenamediabetes-snomedct-codes.csv
.This enforces a consistency between each example's use of
codelist_from_csv()
, but is still using the same mock data each time.2. Inline some CSV data for examples
We could actually inline relevant data, and:
This means we could use "real" codes per-example, not the same valid example every time.
(In general, using additional Markdown metadata in this way this could be a solution, if we're looking to extend our testing further while still restricting ourselves to using inline examples only.)
3. A combination of both the above approaches
We could allow inlining of data, and fall back to a filename match if no data is inlined.
The text was updated successfully, but these errors were encountered: