pipeline for Quidel flu test #181

jingjtang · 2020-08-05T13:39:21Z

Some decisions to make:

When to start report this signal? (the data volume each day is still very low ~1k - ~1.5k every day including backfilled records)
export end date: the last day D to be reported today (due to backfill problem)
export start date: how many files we want to upload every day (due to backfill problem)
how to solve the problem about one device showing in difference places at the same time (as mentioned here)
how to do correlation analysis for flu test data?

A mapping problem at 5-digit zip code level:
This problem is not severe in COVID test. There is only <10 zip codes that are not included in 02_20_uszips.csv and a very small proportion of data is related to those wired zip codes.
However in Flu test, there are ~90 such zip codes. Hard to manually check each one and fill in their mapping and population information. May need to update our mapping file?

These zip codes listed here:
{603, 622, 627, 674, 676, 683, 717, 726, 728, 732, 733, 736, 738, 754, 780, 792, 795, 907, 912, 919, 953, 957, 959, 2572, 2781, 15705, 20174, 27412, 27460, 28793, 28823, 29019, 29484, 29486, 29871, 30597, 30997, 32163, 32214, 32306, 32313,
32611, 32761, 33551, 33574, 33652, 35642, 37232, 47782, 48483, 48670, 48824, 48902, 50410, 60944, 68179, 72053,
75033, 75072, 75222, 75322, 75429, 75546, 75606, 76094, 76803, 76909, 76992, 76993, 77370, 77399, 78086, 78776,
79430, 80630, 84129, 85378, 86123, 86746, 89557, 91315, 92094, 92152, 92521, 92697, 93077,
95929, 99094, 99623}

Only 133,000 tests out of 7,519,726 are related to those zip codes until 2020-0803

(Remember to remove wip_ and change the pull_start_date to be earlier than 2020-05-08, it will take about half an hour to read all of the historical data)

jingjtang · 2020-08-05T21:39:12Z

After switching to James's new mapping file, 31 zip codes have no mapping information still:
{2572, 2781, 20174, 27460, 28823, 29019, 29871, 30997, 32761, 33551, 33652, 35642, 47782, 48483, 48902, 50410, 75322, 75429, 75546, 76992, 76993, 77370, 78086, 78776, 80630, 86123, 86746, 91315, 92094, 93077, 99094}

Only 7,583 tests out of 7,519,726 are related to those zip codes until 2020-0803

jingjtang · 2020-08-06T16:45:42Z

After switching to James's new mapping file, 31 zip codes have no mapping information still:
{2572, 2781, 20174, 27460, 28823, 29019, 29871, 30997, 32761, 33551, 33652, 35642, 47782, 48483, 48902, 50410, 75322, 75429, 75546, 76992, 76993, 77370, 78086, 78776, 80630, 86123, 86746, 91315, 92094, 93077, 99094}

Only 7,583 tests out of 7,519,726 are related to those zip codes until 2020-0803

@jsharpna helped check those zip codes. They are not valid zip codes according to https://tools.usps.com/zip-code-lookup.htm?citybyzipcode. Will ask Quidel about them.

krivard · 2020-08-06T18:11:06Z

Will email Quidel with all problems: bad zips, non-unique regions per device.

Fixing some of these requires merging or otherwise depending on #137, but that package doesn't include the home-state mappings for HRRs and MSAs that are used to fill in for insufficient sample size.

Hold off on finishing this until we can get the home-state mappings into the geo package.

quidel_flutest/DETAILS.md

chinandrew · 2020-08-15T08:23:46Z

quidel_flutest/delphi_quidel_flutest/pull.py

+    overall_total.drop(labels="FluA", axis="columns", inplace=True)
+
+    # Compute numUniqueDevices
+    numUniqueDevices = df.groupby(


snake case var names

chinandrew · 2020-08-15T08:24:12Z

quidel_flutest/delphi_quidel_flutest/data_tools.py

+
+
+def raw_tests_per_device(devices, tests, min_obs):
+    '''


double quotes

chinandrew · 2020-08-15T08:28:56Z

quidel_flutest/delphi_quidel_flutest/export.py

@@ -0,0 +1,39 @@
+# -*- coding: utf-8 -*-
+"""Function to export the dataset in the format expected of the API.


super nitpick but standardizing docstrings/general linting if going one step further can be nice for organization and readability.

I've mainly used flake8 but looks like pylint is common on this repo. I imagine they're comparable.

chinandrew · 2020-08-15T08:32:32Z

quidel_flutest/delphi_quidel_flutest/pull.py

+            zipcode = int(float(zipcode))
+            zipcode5.append(zipcode)
+    df['zip'] = zipcode5
+    # print('Fixing %.2f %% of the data' % (fixnum * 100 / len(zipcode5)))


is this debugging? do the fixnum lines need to exist still?

This is used for checking only. Temporarily I still want it to be there, since Quidel might change their raw data.

chinandrew · 2020-08-15T08:38:15Z

quidel_flutest/delphi_quidel_flutest/pull.py

+    zipcode5 = []
+    fixnum = 0
+    for zipcode in df['ZipCode'].values:
+        if isinstance(zipcode, str) and '-' in zipcode:


do mixed types get read into the DF which is why this if/else exists? if so, is it worth reading everything in as str? if not, and the else isn't for nans, I'm unsure why the isinstance exists .

Also I think there might be a way to do this quicker with zfill like str(zipcode).split("-")[0].zfill(5), though not sure without knowing exactly what raw input looks like

Yes. int and strings at length of 5 ("XXXXX-XXXX") both exist for "ZipCode" in raw data from Quidel. The reason that I don't read it in str is because we won't report the data in zip code level. Zip Codes are only used for geo mapping. It is easier that we read it as int and then merge the data with map_df which also has zip codes as type int.

chinandrew · 2020-08-15T08:42:55Z

quidel_flutest/delphi_quidel_flutest/data_tools.py

+    else:
+        pooled_positives = tpooled_positives
+        pooled_tests = tpooled_tests
+    ## STEP 2: CALCULATE AS THOUGH THEY'RE RAW


I assume this is STEP 2 since the geo pooling had a STEP 1 in it, but it's a bit confusing since then STEP 1 is somewhere else.

Co-authored-by: chinandrew <chinandrew96@gmail.com>

chinandrew · 2020-08-17T03:08:35Z

quidel_flutest/delphi_quidel_flutest/pull.py

+            zipcode5.append(int(zipcode.split('-')[0]))
+            fixnum += 1
+        else:
+            zipcode = int(float(zipcode))


Suggested change

zipcode = int(float(zipcode))

zipcode = int(zipcode)

pretty sure this works

krivard · 2020-08-21T21:40:49Z

@amartyabasu, waiting on your review

amartyabasu · 2020-08-24T17:03:59Z

@amartyabasu, waiting on your review

I'll have it completed today.

amartyabasu · 2020-08-25T17:11:33Z

quidel_flutest/delphi_quidel_flutest/run.py

+EXPORT_DAY_RANGE = 40 # Number of dates to report
+
+GEO_RESOLUTIONS = [
+    # "county",


Is the county based aggregation not done because of small sample size?

Yes. There are few counties available with sample sizes larger than 50.

amartyabasu · 2020-08-25T17:19:21Z

quidel_flutest/params.json.template

+  "account": "delphi-datadrop@andrew.cmu.edu",
+  "password": "",
+  "sender": "",
+  "mode":"",


"mode":"" Extra comma in the end.

I ran the pipeline with pull_start_date: "2020-07-01" and export_start_date: "2020-06-01". The daily csvs got generated from 20200711 onwards. Does that mean there was no data from 2020-07-01 to 2020-07-10?

According to the implementation would the export_start_date always precede pull_start_date to account for the backfills?

The 'flu_ag_smoothed_tests_per_device' signal does not report standard errors.

Remember we only report a geo_id with sample_size larger than 50. There will be data from 2020-07-01 to 2020-07-10, but they might not have a single geo_id with sample sizes larger than 50.

Yes. export_start_date should always precede pull_start_date

Yes. Not sure the definition of se for that signal.

amartyabasu · 2020-08-25T17:33:47Z

quidel_flutest/delphi_quidel_flutest/pull.py

+                    #     newdf["ResultID"] = np.nan
+                    # if "FluTestNumber" not in newdf.keys():
+                    #     newdf["FluTestNumber"] = np.nan
+                    # newdf["filename"] = name


Are these commented lines still needed to handle different columns names generated in the email files?

No, they are not needed anymore. I will delete them.

amartyabasu · 2020-08-25T17:55:53Z

quidel_flutest/delphi_quidel_flutest/pull.py

+
+def fix_date(df):
+    """
+    Quidel Covid Test are labeled with Test Date and Storage Date.


How is 'StorageDate' different from 'TestDate' that calls for carrying out this clean up? Maybe a one line definition would help.

Thanks. I will add more explanations in the comments.

krivard · 2020-08-25T18:46:48Z

TODO: Verify tests pass and linter has no substantive complaints

amartyabasu · 2020-08-25T19:25:28Z

TODO: Verify tests pass and linter has no substantive complaints

Linter test:

Score 8.75/10. No mandatory messages that need to be addressed.
Majority of the messages are with respect to single letter and double letter variable names that don't abide by the snake case. But these are variables like 'df', 'se' etc which are used in other codebases as well.
There are messages of 'too many local variables' in all files except geo_maps.py

Pytest:

89% coverage. Covers all important part of the code.
I observed only functions for reading historical data and emails not covered.

amartyabasu · 2020-08-25T19:37:29Z

quidel_flutest/delphi_quidel_flutest/generate_sensor.py

+            res_group = res_group.merge(parent_group, how="left",
+                                        on="timestamp", suffixes=('', '_parent'))
+            res_group = res_group.drop(columns=[res_key, "state_id", "state_id" + '_parent'])
+        except:


In my opinion a simpler if/else block would work better in place of the 'try/catch' block when parent_group does not exist.

…in params.json.template

jingjtang · 2020-08-26T14:48:15Z

TODO: Verify tests pass and linter has no substantive complaints

Linter test:

Score 8.75/10. No mandatory messages that need to be addressed.

Majority of the messages are with respect to single letter and double letter variable names that don't abide by the snake case. But these are variables like 'df', 'se' etc which are used in other codebases as well.

There are messages of 'too many local variables' in all files except geo_maps.py

Pytest:

89% coverage. Covers all important part of the code.

I observed only functions for reading historical data and emails not covered.

How did you conduct this linter test where you got those info?

amartyabasu · 2020-08-26T17:44:31Z

How did you conduct this linter test where you got those info?

I simply ran pylint over delphi_quidel_flutest module.

jingjtang · 2020-08-26T17:55:52Z

I simply ran pylint over delphi_quidel_flutest module.

Weird, I didn't see those results. Could you try git pull and run it again?

amartyabasu · 2020-08-28T16:03:57Z

Weird, I didn't see those results. Could you try git pull and run it again?

I ran the project again pulling the latest changes and I got the same set of messages. Attaching a screenshot of my output. I don't think these are of any concern as such. :-)

jingjtang · 2020-08-28T23:09:41Z

I got 10/10 on my computer with:
pylint 2.6.0
astroid 2.4.2
Python 3.7.4
Don't understand why there is such a big difference in the linter test result. Leave the problem here temporarily.

Jingjing Tang and others added 7 commits August 3, 2020 14:26

add code for quidel flutest pipeline

d0a82b8

fixed errors in code and unittests

73e7413

update unittests and fix errors in the code

9103535

update unittests

8e257b8

update code for ingesting historical data from midas

60c8999

Update README.md

d3aa36f

Update DETAILS.md

29718a6

krivard requested a review from amartyabasu August 5, 2020 18:40

Jingjing Tang added 2 commits August 13, 2020 15:41

refactor the code for checking criteria

02a9a3a

update unit test

31899e8

chinandrew reviewed Aug 15, 2020

View reviewed changes

quidel_flutest/DETAILS.md Outdated Show resolved Hide resolved

chinandrew reviewed Aug 15, 2020

View reviewed changes

quidel_flutest/delphi_quidel_flutest/data_tools.py

def raw_tests_per_device(devices, tests, min_obs):

'''

Copy link

Contributor

chinandrew Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double quotes

chinandrew reviewed Aug 15, 2020

View reviewed changes

Update quidel_flutest/DETAILS.md

93ac187

Co-authored-by: chinandrew <chinandrew96@gmail.com>

chinandrew reviewed Aug 17, 2020

View reviewed changes

add dry-run mode

2125139

Refactored code

d9931ab

add test_data in the format of raw data received

1b9c652

amartyabasu reviewed Aug 25, 2020

View reviewed changes

Add explainations to test_date and storage_date; deleted extra comma …

4a20c67

…in params.json.template

Fixed the error in the documentation of se

5d49b9f

chinandrew mentioned this pull request Jan 4, 2021

Refactor indicators to use delphi_utils library calls #306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline for Quidel flu test #181

pipeline for Quidel flu test #181

jingjtang commented Aug 5, 2020 •

edited

jingjtang commented Aug 5, 2020 •

edited

jingjtang commented Aug 6, 2020

krivard commented Aug 6, 2020

chinandrew Aug 15, 2020

chinandrew Aug 15, 2020

chinandrew Aug 15, 2020

chinandrew Aug 15, 2020

jingjtang Aug 16, 2020 •

edited

chinandrew Aug 17, 2020

chinandrew Aug 15, 2020

jingjtang Aug 16, 2020 •

edited

chinandrew Aug 17, 2020

chinandrew Aug 15, 2020

chinandrew Aug 17, 2020

krivard commented Aug 21, 2020

amartyabasu commented Aug 24, 2020

amartyabasu Aug 25, 2020

jingjtang Aug 25, 2020

amartyabasu Aug 25, 2020 •

edited

jingjtang Aug 25, 2020

amartyabasu Aug 25, 2020

jingjtang Aug 25, 2020

amartyabasu Aug 25, 2020

jingjtang Aug 25, 2020

jingjtang Aug 26, 2020

krivard commented Aug 25, 2020

amartyabasu commented Aug 25, 2020

amartyabasu Aug 25, 2020

jingjtang commented Aug 26, 2020

amartyabasu commented Aug 26, 2020

jingjtang commented Aug 26, 2020 •

edited

amartyabasu commented Aug 28, 2020

jingjtang commented Aug 28, 2020

		@@ -0,0 +1,39 @@
		# -- coding: utf-8 --
		"""Function to export the dataset in the format expected of the API.

pipeline for Quidel flu test #181

Are you sure you want to change the base?

pipeline for Quidel flu test #181

Conversation

jingjtang commented Aug 5, 2020 • edited

jingjtang commented Aug 5, 2020 • edited

jingjtang commented Aug 6, 2020

krivard commented Aug 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingjtang Aug 16, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingjtang Aug 16, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krivard commented Aug 21, 2020

amartyabasu commented Aug 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amartyabasu Aug 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krivard commented Aug 25, 2020

amartyabasu commented Aug 25, 2020

Choose a reason for hiding this comment

jingjtang commented Aug 26, 2020

amartyabasu commented Aug 26, 2020

jingjtang commented Aug 26, 2020 • edited

amartyabasu commented Aug 28, 2020

jingjtang commented Aug 28, 2020

jingjtang commented Aug 5, 2020 •

edited

jingjtang commented Aug 5, 2020 •

edited

jingjtang Aug 16, 2020 •

edited

jingjtang Aug 16, 2020 •

edited

amartyabasu Aug 25, 2020 •

edited

jingjtang commented Aug 26, 2020 •

edited