Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better validation for "list of codes shoved in a string" columns #2010

Open
evansd opened this issue May 10, 2024 · 0 comments
Open

Better validation for "list of codes shoved in a string" columns #2010

evansd opened this issue May 10, 2024 · 0 comments

Comments

@evansd
Copy link
Contributor

evansd commented May 10, 2024

We have some columns containing codes which we have to declare as type str rather than their coding system (e.g ICD-10) because they contain multiple codes shoved into a single column. For example:
https://docs.opensafely.org/ehrql/reference/schemas/tpp/#apcs.all_diagnoses

We query these columns using the .contains() method, which takes strings. But we still want to ensure that the strings supplied are valid for the coding system in question. The quick hacky way I did this when writing the original proof-of-concept study was to construct a code object and use the internal _to_primitive_type() method to get the string back out of the code object:

    for code in codelist:
        # Pass the string through the ICD10Code to constructor to validate that it has
        # the expected format
        code_string = ICD10Code(code)._to_primitive_type()
        code_strings.add(code_string)

Annoyingly (but also as a positive testament to the code sharing that goes on) this has propagated itself into various research codebases.

At a minimum we should provide a non-private and less ugly way to do this. One simple way would be to define a __str__ method on BaseCode so we could rewrite the above as:

    for code in codelist:
        # Pass the string through the ICD10Code to constructor to validate that it has
        # the expected format
        code_string = str(ICD10Code(code))
        code_strings.add(code_string)

However that still leaves the study author with the responsibility to use the correct coding system when checking the codes, which is exactly what ehrQL is supposed to avoid.

The long-term solution here is proper handling of multi-valued columns – but that's a whole lot of thinking and work and I don't propose to tackle that now.

I wonder if there's a simpler improvement here which is to define an explicit type for "bunch of codes from coding system X shoved in a string". That could catch a couple of things:

  1. Prevent equality comparisons between the string and a single code so you can't accidentally write:
    list_of_codes_field == single_code
  2. Ensure that the arguments to .contains() queries are of a format that could potentially match something in the field.

Note that for 2 we shouldn't require that the argument to .contains() is a complete valid code because it may just be a prefix to a code: for lexically hierarchical coding systems like ICD-10 it's useful to be able to match codes like N171 using just the prefix N17.

Related Slack threads:
https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1715334831886929
https://bennettoxford.slack.com/archives/C069YDR4NCA/p1715344398345399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant