-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipeline for Quidel flu test #181
base: main
Are you sure you want to change the base?
Conversation
After switching to James's new mapping file, 31 zip codes have no mapping information still: Only 7,583 tests out of 7,519,726 are related to those zip codes until 2020-0803 |
@jsharpna helped check those zip codes. They are not valid zip codes according to https://tools.usps.com/zip-code-lookup.htm?citybyzipcode. Will ask Quidel about them. |
Will email Quidel with all problems: bad zips, non-unique regions per device. Fixing some of these requires merging or otherwise depending on #137, but that package doesn't include the home-state mappings for HRRs and MSAs that are used to fill in for insufficient sample size. Hold off on finishing this until we can get the home-state mappings into the geo package. |
overall_total.drop(labels="FluA", axis="columns", inplace=True) | ||
|
||
# Compute numUniqueDevices | ||
numUniqueDevices = df.groupby( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
snake case var names
|
||
|
||
def raw_tests_per_device(devices, tests, min_obs): | ||
''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double quotes
@@ -0,0 +1,39 @@ | |||
# -*- coding: utf-8 -*- | |||
"""Function to export the dataset in the format expected of the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nitpick but standardizing docstrings/general linting if going one step further can be nice for organization and readability.
I've mainly used flake8 but looks like pylint is common on this repo. I imagine they're comparable.
zipcode = int(float(zipcode)) | ||
zipcode5.append(zipcode) | ||
df['zip'] = zipcode5 | ||
# print('Fixing %.2f %% of the data' % (fixnum * 100 / len(zipcode5))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this debugging? do the fixnum
lines need to exist still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used for checking only. Temporarily I still want it to be there, since Quidel might change their raw data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
zipcode5 = [] | ||
fixnum = 0 | ||
for zipcode in df['ZipCode'].values: | ||
if isinstance(zipcode, str) and '-' in zipcode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do mixed types get read into the DF which is why this if/else exists? if so, is it worth reading everything in as str
? if not, and the else
isn't for nans, I'm unsure why the isinstance
exists .
Also I think there might be a way to do this quicker with zfill
like str(zipcode).split("-")[0].zfill(5)
, though not sure without knowing exactly what raw input looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. int and strings at length of 5 ("XXXXX-XXXX") both exist for "ZipCode" in raw data from Quidel. The reason that I don't read it in str is because we won't report the data in zip code level. Zip Codes are only used for geo mapping. It is easier that we read it as int and then merge the data with map_df which also has zip codes as type int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
else: | ||
pooled_positives = tpooled_positives | ||
pooled_tests = tpooled_tests | ||
## STEP 2: CALCULATE AS THOUGH THEY'RE RAW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is STEP 2 since the geo pooling had a STEP 1 in it, but it's a bit confusing since then STEP 1 is somewhere else.
Co-authored-by: chinandrew <chinandrew96@gmail.com>
zipcode5.append(int(zipcode.split('-')[0])) | ||
fixnum += 1 | ||
else: | ||
zipcode = int(float(zipcode)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zipcode = int(float(zipcode)) | |
zipcode = int(zipcode) |
pretty sure this works
@amartyabasu, waiting on your review |
I'll have it completed today. |
EXPORT_DAY_RANGE = 40 # Number of dates to report | ||
|
||
GEO_RESOLUTIONS = [ | ||
# "county", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the county based aggregation not done because of small sample size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There are few counties available with sample sizes larger than 50.
quidel_flutest/params.json.template
Outdated
"account": "delphi-datadrop@andrew.cmu.edu", | ||
"password": "", | ||
"sender": "", | ||
"mode":"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
"mode":""
Extra comma in the end. -
I ran the pipeline with pull_start_date: "2020-07-01" and export_start_date: "2020-06-01". The daily csvs got generated from 20200711 onwards. Does that mean there was no data from 2020-07-01 to 2020-07-10?
-
According to the implementation would the export_start_date always precede pull_start_date to account for the backfills?
-
The 'flu_ag_smoothed_tests_per_device' signal does not report standard errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Remember we only report a geo_id with sample_size larger than 50. There will be data from 2020-07-01 to 2020-07-10, but they might not have a single geo_id with sample sizes larger than 50.
- Yes. export_start_date should always precede pull_start_date
- Yes. Not sure the definition of se for that signal.
# newdf["ResultID"] = np.nan | ||
# if "FluTestNumber" not in newdf.keys(): | ||
# newdf["FluTestNumber"] = np.nan | ||
# newdf["filename"] = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these commented lines still needed to handle different columns names generated in the email files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not needed anymore. I will delete them.
|
||
def fix_date(df): | ||
""" | ||
Quidel Covid Test are labeled with Test Date and Storage Date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is 'StorageDate' different from 'TestDate' that calls for carrying out this clean up? Maybe a one line definition would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will add more explanations in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Linter test:
Pytest:
|
res_group = res_group.merge(parent_group, how="left", | ||
on="timestamp", suffixes=('', '_parent')) | ||
res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent']) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion a simpler if/else block would work better in place of the 'try/catch' block when parent_group does not exist.
…in params.json.template
How did you conduct this linter test where you got those info? |
I simply ran pylint over delphi_quidel_flutest module. |
Weird, I didn't see those results. Could you try |
I got 10/10 on my computer with: |
Some decisions to make:
A mapping problem at 5-digit zip code level:
This problem is not severe in COVID test. There is only <10 zip codes that are not included in 02_20_uszips.csv and a very small proportion of data is related to those wired zip codes.
However in Flu test, there are ~90 such zip codes. Hard to manually check each one and fill in their mapping and population information. May need to update our mapping file?
These zip codes listed here:
{603, 622, 627, 674, 676, 683, 717, 726, 728, 732, 733, 736, 738, 754, 780, 792, 795, 907, 912, 919, 953, 957, 959, 2572, 2781, 15705, 20174, 27412, 27460, 28793, 28823, 29019, 29484, 29486, 29871, 30597, 30997, 32163, 32214, 32306, 32313,
32611, 32761, 33551, 33574, 33652, 35642, 37232, 47782, 48483, 48670, 48824, 48902, 50410, 60944, 68179, 72053,
75033, 75072, 75222, 75322, 75429, 75546, 75606, 76094, 76803, 76909, 76992, 76993, 77370, 77399, 78086, 78776,
79430, 80630, 84129, 85378, 86123, 86746, 89557, 91315, 92094, 92152, 92521, 92697, 93077,
95929, 99094, 99623}
Only 133,000 tests out of 7,519,726 are related to those zip codes until 2020-0803
(Remember to remove wip_ and change the pull_start_date to be earlier than 2020-05-08, it will take about half an hour to read all of the historical data)