Expose a "clean" therapeutics table #2023

rebkwok · 2024-05-15T14:10:59Z

To make the covid therapeutics data consistent with the data cohort-extractor provided and easier for users to use.

Current covid_therapeutics_raw table:

remove the onset of symptoms columns
restrict the dataset?

Add a new non-raw table that:

only exposes required fields
removes fully duplicate rows
removes "filler" words from the 3 risk group fields and joins as single risk group field (a comma-separated string)
casts datetimes to date (already done in the raw table)
(cohort extractor applied collation to the intervention and currentstatus columns, but according to the database report they already have the applied collation, so that should be unnecessary.

Refer to cohort-extractor's implementation:
create_therapeutics_table does the removal of duplicated and the comma-separated risk groups (as separate columns). Joining the 3 groups is done here.
(Note that we don't need to worry about duplicate risk groups across the 3 risk group columns because only one of those contains data in any one row)

The text was updated successfully, but these errors were encountered:

rebkwok · 2024-05-15T14:26:07Z

To add the new TPP table in ehrQL:

Add to the Backend as a new QueryTable (here's where the raw one is defined)
Add a table (EventFrame, as the therapeutics table contains multiple rows per patient) with docstring, to tpp/tables.py
Add a backend test similar to the one for the raw table (with relevant input data to check for the collated strings etc)

madwort · 2024-05-15T14:43:59Z

I think you've got a typo for the link to "backend test similar to the one for the raw table" - just checking you meant

ehrql/tests/integration/backends/test_tpp.py

Lines 666 to 711 in b2c6750

    
           @register_test_for(tpp_raw.covid_therapeutics_raw) 
        
           def test_covid_therapeutics_raw(select_all_tpp): 
        
               results = select_all_tpp( 
        
                   Therapeutics( 
        
                       Patient_ID=1, 
        
                       COVID_indication="a", 
        
                       Count=3, 
        
                       CurrentStatus="b", 
        
                       Diagnosis="c", 
        
                       FormName="d", 
        
                       Intervention="e", 
        
                       CASIM05_date_of_symptom_onset="f", 
        
                       CASIM05_risk_cohort="g", 
        
                       MOL1_onset_of_symptoms="h", 
        
                       MOL1_high_risk_cohort="i", 
        
                       SOT02_onset_of_symptoms="j", 
        
                       SOT02_risk_cohorts="k", 
        
                       Received="2023-10-15T12:13:45", 
        
                       TreatmentStartDate="2023-11-16T13:45:07", 
        
                       AgeAtReceivedDate=60, 
        
                       Region="l", 
        
                       Der_LoadDate="2023-09-14 12:34:56.78000", 
        
                   ), 
        
               ) 
        
               assert results == [ 
        
                   { 
        
                       "patient_id": 1, 
        
                       "covid_indication": "a", 
        
                       "count": 3, 
        
                       "current_status": "b", 
        
                       "diagnosis": "c", 
        
                       "form_name": "d", 
        
                       "intervention": "e", 
        
                       "CASIM05_date_of_symptom_onset": "f", 
        
                       "CASIM05_risk_cohort": "g", 
        
                       "MOL1_onset_of_symptoms": "h", 
        
                       "MOL1_high_risk_cohort": "i", 
        
                       "SOT02_onset_of_symptoms": "j", 
        
                       "SOT02_risk_cohorts": "k", 
        
                       "received": date(2023, 10, 15), 
        
                       "treatment_start_date": date(2023, 11, 16), 
        
                       "age_at_received_date": 60, 
        
                       "region": "l", 
        
                       "load_date": date(2023, 9, 14), 
        
                   }, 
        
               ]

rebkwok · 2024-05-15T15:28:46Z

Yes, that's the one

* risk cohort values from different sources are aggregated * fixes #2023

rebkwok · 2024-05-20T10:39:09Z

@acagreen17 Some questions:

Which columns should be exposed in the therapeutics table? Cohort-extractor could return values from the following columns:

covid_indication
intervention
current_status
risk_group (a combination of the 3 separate risk group column as a comma-separated list)
treatment_start_date
region

Are all of these required in ehrQL? Are there any others that should be queryable?

There are some (~35) fully duplicate rows in the database table, which cohort-extractor removed, and we can do the same in ehrQL. However, this may (probably will) leave some duplicates of the selected columns that we expose. Is there a subset of columns where duplicates would definitely constitute actual duplication of data (i.e. someone entered data from the same paper form twice)? If we only remove fully duplicate rows, we should document that the data may contain duplicates and users should take steps to address that.
The risk group field are taken from the 3 separate risk groups columns, which each relate to a particular intervention (Sotroviman, Molnupiravir, Casirivimab & imdevimab). There are now also other interventions sarilumab, baricitinib, paxlovid and remdesivir - none of these have a corresponding risk group column in the data. Should we just document this in the table docs?

acagreen17 · 2024-05-20T11:08:27Z

I think everything available via Cohort-extractor should be avaliable via ehrQL. I don't think we need to make any additional columns available either.
From memory there is small group of individuals who will appear to get given two drugs around the same time (because they were given the first line drugs but then switched to a different one) but difficult to know which one they ended up getting (at least it was difficult to know when we were first looking at this data). But I think safer for the researchers to decide on to make this call and just document as a potential limitation.
Yes.

@HelenCEBM might have some useful thoughts on this.

rebkwok added the cohortextractor-parity label May 15, 2024

rebkwok assigned rebkwok and madwort May 15, 2024

madwort added a commit that referenced this issue May 17, 2024

Expose a cleaner therapeutics table

9e05b42

* risk cohort values from different sources are aggregated * fixes #2023

madwort added a commit that referenced this issue May 17, 2024

Expose a cleaner therapeutics table

62d9c0c

* risk cohort values from different sources are aggregated * fixes #2023

madwort mentioned this issue May 17, 2024

Add a cleaner therapeutics table #2025

Merged

madwort added a commit that referenced this issue May 17, 2024

Expose a cleaner therapeutics table

6f1a71c

* risk cohort values from different sources are aggregated * fixes #2023

madwort closed this as completed in #2025 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose a "clean" therapeutics table #2023

Expose a "clean" therapeutics table #2023

rebkwok commented May 15, 2024 •

edited

rebkwok commented May 15, 2024

madwort commented May 15, 2024

rebkwok commented May 15, 2024

rebkwok commented May 20, 2024

acagreen17 commented May 20, 2024

Expose a "clean" therapeutics table #2023

Expose a "clean" therapeutics table #2023

Comments

rebkwok commented May 15, 2024 • edited

rebkwok commented May 15, 2024

madwort commented May 15, 2024

rebkwok commented May 15, 2024

rebkwok commented May 20, 2024

acagreen17 commented May 20, 2024

rebkwok commented May 15, 2024 •

edited