Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline for Quidel flu test #181

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

pipeline for Quidel flu test #181

wants to merge 15 commits into from

Conversation

jingjtang
Copy link
Contributor

@jingjtang jingjtang commented Aug 5, 2020

Some decisions to make:

  1. When to start report this signal? (the data volume each day is still very low ~1k - ~1.5k every day including backfilled records)
  2. export end date: the last day D to be reported today (due to backfill problem)
  3. export start date: how many files we want to upload every day (due to backfill problem)
  4. how to solve the problem about one device showing in difference places at the same time (as mentioned here)
  5. how to do correlation analysis for flu test data?

A mapping problem at 5-digit zip code level:
This problem is not severe in COVID test. There is only <10 zip codes that are not included in 02_20_uszips.csv and a very small proportion of data is related to those wired zip codes.
However in Flu test, there are ~90 such zip codes. Hard to manually check each one and fill in their mapping and population information. May need to update our mapping file?

These zip codes listed here:
{603, 622, 627, 674, 676, 683, 717, 726, 728, 732, 733, 736, 738, 754, 780, 792, 795, 907, 912, 919, 953, 957, 959, 2572, 2781, 15705, 20174, 27412, 27460, 28793, 28823, 29019, 29484, 29486, 29871, 30597, 30997, 32163, 32214, 32306, 32313,
32611, 32761, 33551, 33574, 33652, 35642, 37232, 47782, 48483, 48670, 48824, 48902, 50410, 60944, 68179, 72053,
75033, 75072, 75222, 75322, 75429, 75546, 75606, 76094, 76803, 76909, 76992, 76993, 77370, 77399, 78086, 78776,
79430, 80630, 84129, 85378, 86123, 86746, 89557, 91315, 92094, 92152, 92521, 92697, 93077,
95929, 99094, 99623}

Only 133,000 tests out of 7,519,726 are related to those zip codes until 2020-0803

(Remember to remove wip_ and change the pull_start_date to be earlier than 2020-05-08, it will take about half an hour to read all of the historical data)

@jingjtang
Copy link
Contributor Author

jingjtang commented Aug 5, 2020

After switching to James's new mapping file, 31 zip codes have no mapping information still:
{2572, 2781, 20174, 27460, 28823, 29019, 29871, 30997, 32761, 33551, 33652, 35642, 47782, 48483, 48902, 50410, 75322, 75429, 75546, 76992, 76993, 77370, 78086, 78776, 80630, 86123, 86746, 91315, 92094, 93077, 99094}

Only 7,583 tests out of 7,519,726 are related to those zip codes until 2020-0803

@jingjtang
Copy link
Contributor Author

After switching to James's new mapping file, 31 zip codes have no mapping information still:
{2572, 2781, 20174, 27460, 28823, 29019, 29871, 30997, 32761, 33551, 33652, 35642, 47782, 48483, 48902, 50410, 75322, 75429, 75546, 76992, 76993, 77370, 78086, 78776, 80630, 86123, 86746, 91315, 92094, 93077, 99094}

Only 7,583 tests out of 7,519,726 are related to those zip codes until 2020-0803

@jsharpna helped check those zip codes. They are not valid zip codes according to https://tools.usps.com/zip-code-lookup.htm?citybyzipcode. Will ask Quidel about them.

@krivard
Copy link
Contributor

krivard commented Aug 6, 2020

Will email Quidel with all problems: bad zips, non-unique regions per device.

Fixing some of these requires merging or otherwise depending on #137, but that package doesn't include the home-state mappings for HRRs and MSAs that are used to fill in for insufficient sample size.

Hold off on finishing this until we can get the home-state mappings into the geo package.

overall_total.drop(labels="FluA", axis="columns", inplace=True)

# Compute numUniqueDevices
numUniqueDevices = df.groupby(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snake case var names



def raw_tests_per_device(devices, tests, min_obs):
'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double quotes

@@ -0,0 +1,39 @@
# -*- coding: utf-8 -*-
"""Function to export the dataset in the format expected of the API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nitpick but standardizing docstrings/general linting if going one step further can be nice for organization and readability.

I've mainly used flake8 but looks like pylint is common on this repo. I imagine they're comparable.

zipcode = int(float(zipcode))
zipcode5.append(zipcode)
df['zip'] = zipcode5
# print('Fixing %.2f %% of the data' % (fixnum * 100 / len(zipcode5)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this debugging? do the fixnum lines need to exist still?

Copy link
Contributor Author

@jingjtang jingjtang Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used for checking only. Temporarily I still want it to be there, since Quidel might change their raw data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

zipcode5 = []
fixnum = 0
for zipcode in df['ZipCode'].values:
if isinstance(zipcode, str) and '-' in zipcode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do mixed types get read into the DF which is why this if/else exists? if so, is it worth reading everything in as str? if not, and the else isn't for nans, I'm unsure why the isinstance exists .

Also I think there might be a way to do this quicker with zfill like str(zipcode).split("-")[0].zfill(5), though not sure without knowing exactly what raw input looks like

Copy link
Contributor Author

@jingjtang jingjtang Aug 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. int and strings at length of 5 ("XXXXX-XXXX") both exist for "ZipCode" in raw data from Quidel. The reason that I don't read it in str is because we won't report the data in zip code level. Zip Codes are only used for geo mapping. It is easier that we read it as int and then merge the data with map_df which also has zip codes as type int.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

else:
pooled_positives = tpooled_positives
pooled_tests = tpooled_tests
## STEP 2: CALCULATE AS THOUGH THEY'RE RAW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is STEP 2 since the geo pooling had a STEP 1 in it, but it's a bit confusing since then STEP 1 is somewhere else.

Co-authored-by: chinandrew <chinandrew96@gmail.com>
zipcode5.append(int(zipcode.split('-')[0]))
fixnum += 1
else:
zipcode = int(float(zipcode))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zipcode = int(float(zipcode))
zipcode = int(zipcode)

pretty sure this works

@krivard
Copy link
Contributor

krivard commented Aug 21, 2020

@amartyabasu, waiting on your review

@amartyabasu
Copy link
Contributor

@amartyabasu, waiting on your review

I'll have it completed today.

EXPORT_DAY_RANGE = 40 # Number of dates to report

GEO_RESOLUTIONS = [
# "county",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the county based aggregation not done because of small sample size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are few counties available with sample sizes larger than 50.

"account": "delphi-datadrop@andrew.cmu.edu",
"password": "",
"sender": "",
"mode":"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "mode":"" Extra comma in the end.

  • I ran the pipeline with pull_start_date: "2020-07-01" and export_start_date: "2020-06-01". The daily csvs got generated from 20200711 onwards. Does that mean there was no data from 2020-07-01 to 2020-07-10?

  • According to the implementation would the export_start_date always precede pull_start_date to account for the backfills?

  • The 'flu_ag_smoothed_tests_per_device' signal does not report standard errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Remember we only report a geo_id with sample_size larger than 50. There will be data from 2020-07-01 to 2020-07-10, but they might not have a single geo_id with sample sizes larger than 50.
  • Yes. export_start_date should always precede pull_start_date
  • Yes. Not sure the definition of se for that signal.

# newdf["ResultID"] = np.nan
# if "FluTestNumber" not in newdf.keys():
# newdf["FluTestNumber"] = np.nan
# newdf["filename"] = name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these commented lines still needed to handle different columns names generated in the email files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are not needed anymore. I will delete them.


def fix_date(df):
"""
Quidel Covid Test are labeled with Test Date and Storage Date.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is 'StorageDate' different from 'TestDate' that calls for carrying out this clean up? Maybe a one line definition would help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will add more explanations in the comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@krivard
Copy link
Contributor

krivard commented Aug 25, 2020

  • TODO: Verify tests pass and linter has no substantive complaints

@amartyabasu
Copy link
Contributor

  • TODO: Verify tests pass and linter has no substantive complaints

Linter test:

  • Score 8.75/10. No mandatory messages that need to be addressed.
  • Majority of the messages are with respect to single letter and double letter variable names that don't abide by the snake case. But these are variables like 'df', 'se' etc which are used in other codebases as well.
  • There are messages of 'too many local variables' in all files except geo_maps.py

Pytest:

  • 89% coverage. Covers all important part of the code.
  • I observed only functions for reading historical data and emails not covered.

res_group = res_group.merge(parent_group, how="left",
on="timestamp", suffixes=('', '_parent'))
res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent'])
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion a simpler if/else block would work better in place of the 'try/catch' block when parent_group does not exist.

@jingjtang
Copy link
Contributor Author

  • TODO: Verify tests pass and linter has no substantive complaints

Linter test:

  • Score 8.75/10. No mandatory messages that need to be addressed.
  • Majority of the messages are with respect to single letter and double letter variable names that don't abide by the snake case. But these are variables like 'df', 'se' etc which are used in other codebases as well.
  • There are messages of 'too many local variables' in all files except geo_maps.py

Pytest:

  • 89% coverage. Covers all important part of the code.
  • I observed only functions for reading historical data and emails not covered.

How did you conduct this linter test where you got those info?

@amartyabasu
Copy link
Contributor

How did you conduct this linter test where you got those info?

I simply ran pylint over delphi_quidel_flutest module.

@jingjtang
Copy link
Contributor Author

jingjtang commented Aug 26, 2020

I simply ran pylint over delphi_quidel_flutest module.

Weird, I didn't see those results. Could you try git pull and run it again?

@amartyabasu
Copy link
Contributor

Weird, I didn't see those results. Could you try git pull and run it again?

I ran the project again pulling the latest changes and I got the same set of messages. Attaching a screenshot of my output. I don't think these are of any concern as such. :-)

pylint

@jingjtang
Copy link
Contributor Author

I got 10/10 on my computer with:
pylint 2.6.0
astroid 2.4.2
Python 3.7.4
Don't understand why there is such a big difference in the linter test result. Leave the problem here temporarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants