Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose a "clean" therapeutics table #2023

Closed
rebkwok opened this issue May 15, 2024 · 5 comments · Fixed by #2025
Closed

Expose a "clean" therapeutics table #2023

rebkwok opened this issue May 15, 2024 · 5 comments · Fixed by #2025

Comments

@rebkwok
Copy link
Contributor

rebkwok commented May 15, 2024

See slack thread

To make the covid therapeutics data consistent with the data cohort-extractor provided and easier for users to use.

Current covid_therapeutics_raw table:

  • remove the onset of symptoms columns
  • restrict the dataset?

Add a new non-raw table that:

  • only exposes required fields
  • removes fully duplicate rows
  • removes "filler" words from the 3 risk group fields and joins as single risk group field (a comma-separated string)
  • casts datetimes to date (already done in the raw table)
    (cohort extractor applied collation to the intervention and currentstatus columns, but according to the database report they already have the applied collation, so that should be unnecessary.

Refer to cohort-extractor's implementation:
create_therapeutics_table does the removal of duplicated and the comma-separated risk groups (as separate columns). Joining the 3 groups is done here.
(Note that we don't need to worry about duplicate risk groups across the 3 risk group columns because only one of those contains data in any one row)

@rebkwok
Copy link
Contributor Author

rebkwok commented May 15, 2024

To add the new TPP table in ehrQL:

  • Add to the Backend as a new QueryTable (here's where the raw one is defined)
  • Add a table (EventFrame, as the therapeutics table contains multiple rows per patient) with docstring, to tpp/tables.py
  • Add a backend test similar to the one for the raw table (with relevant input data to check for the collated strings etc)

@madwort
Copy link
Contributor

madwort commented May 15, 2024

I think you've got a typo for the link to "backend test similar to the one for the raw table" - just checking you meant

@register_test_for(tpp_raw.covid_therapeutics_raw)
def test_covid_therapeutics_raw(select_all_tpp):
results = select_all_tpp(
Therapeutics(
Patient_ID=1,
COVID_indication="a",
Count=3,
CurrentStatus="b",
Diagnosis="c",
FormName="d",
Intervention="e",
CASIM05_date_of_symptom_onset="f",
CASIM05_risk_cohort="g",
MOL1_onset_of_symptoms="h",
MOL1_high_risk_cohort="i",
SOT02_onset_of_symptoms="j",
SOT02_risk_cohorts="k",
Received="2023-10-15T12:13:45",
TreatmentStartDate="2023-11-16T13:45:07",
AgeAtReceivedDate=60,
Region="l",
Der_LoadDate="2023-09-14 12:34:56.78000",
),
)
assert results == [
{
"patient_id": 1,
"covid_indication": "a",
"count": 3,
"current_status": "b",
"diagnosis": "c",
"form_name": "d",
"intervention": "e",
"CASIM05_date_of_symptom_onset": "f",
"CASIM05_risk_cohort": "g",
"MOL1_onset_of_symptoms": "h",
"MOL1_high_risk_cohort": "i",
"SOT02_onset_of_symptoms": "j",
"SOT02_risk_cohorts": "k",
"received": date(2023, 10, 15),
"treatment_start_date": date(2023, 11, 16),
"age_at_received_date": 60,
"region": "l",
"load_date": date(2023, 9, 14),
},
]

@rebkwok
Copy link
Contributor Author

rebkwok commented May 15, 2024

Yes, that's the one

madwort added a commit that referenced this issue May 17, 2024
* risk cohort values from different sources are aggregated
* fixes #2023
madwort added a commit that referenced this issue May 17, 2024
* risk cohort values from different sources are aggregated
* fixes #2023
madwort added a commit that referenced this issue May 17, 2024
* risk cohort values from different sources are aggregated
* fixes #2023
@rebkwok
Copy link
Contributor Author

rebkwok commented May 20, 2024

@acagreen17 Some questions:

  1. Which columns should be exposed in the therapeutics table? Cohort-extractor could return values from the following columns:
  • covid_indication
  • intervention
  • current_status
  • risk_group (a combination of the 3 separate risk group column as a comma-separated list)
  • treatment_start_date
  • region

Are all of these required in ehrQL? Are there any others that should be queryable?

  1. There are some (~35) fully duplicate rows in the database table, which cohort-extractor removed, and we can do the same in ehrQL. However, this may (probably will) leave some duplicates of the selected columns that we expose. Is there a subset of columns where duplicates would definitely constitute actual duplication of data (i.e. someone entered data from the same paper form twice)? If we only remove fully duplicate rows, we should document that the data may contain duplicates and users should take steps to address that.

  2. The risk group field are taken from the 3 separate risk groups columns, which each relate to a particular intervention (Sotroviman, Molnupiravir, Casirivimab & imdevimab). There are now also other interventions sarilumab, baricitinib, paxlovid and remdesivir - none of these have a corresponding risk group column in the data. Should we just document this in the table docs?

@acagreen17
Copy link

  1. I think everything available via Cohort-extractor should be avaliable via ehrQL. I don't think we need to make any additional columns available either.

  2. From memory there is small group of individuals who will appear to get given two drugs around the same time (because they were given the first line drugs but then switched to a different one) but difficult to know which one they ended up getting (at least it was difficult to know when we were first looking at this data). But I think safer for the researchers to decide on to make this call and just document as a potential limitation.

  3. Yes.

@HelenCEBM might have some useful thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants