-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first draft of splitting NWSS signals #1946
base: main
Are you sure you want to change the base?
Conversation
Minh confirmed that this runs on her machine and that the output looks reasonable. A couple of things to do before we merge this:
|
df5aa19
to
e86e5fa
Compare
495f183
to
07c6c90
Compare
Currently not passing because of the same update that made nssp tests fail. Once that's merged and this is rebased it will pass. |
890e8b8
to
402f2ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments about style and comments. The actual functionality here looks fine.
This hasn't been released at all yet, right? You previously did statistical review on this; is there anything else we want to do for these new signals?
PROVIDER_NORMS = { | ||
"provider": ["CDC_VERILY", "CDC_VERILY", "NWSS", "NWSS", "WWS"], | ||
"normalization": [ | ||
"flow-population", | ||
"microbial", | ||
"flow-population", | ||
"microbial", | ||
"microbial", | ||
], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: my preference would be to tie the provider names and normalization types together more closely as pairs, maybe a list of length-2 tuples or a dict like. This can be used with very little modification to the run.run_module
for
loop.
{"CDC_VERILY": ("flow-population", ("microbial"), "NWSS": (...), ...}
SIGNALS = ["pcr_conc_smoothed"] | ||
METRIC_SIGNALS = ["detect_prop_15d", "percentile", "ptc_15d"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What is the distinction between these two signal sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I find these signal names hard to parse. I'd prefer longer, more descriptive names. Our final signal names will include source and normalization method, though, so maybe they'd get too long?
Worth more thought. Have you checked these names with Roni yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the distinction between these two signal sets?
They're from two separate socrata APIs.
I haven't run the names by Roni, that's a good idea. The names are based on mirroring the original dataset's names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might have been left out of the "how to make an indicator" doc, but officially we're supposed to check signal names with Roni.
"""Add identifier columns. | ||
|
||
Add columns to get more detail than key_plot_id gives; | ||
specifically, state, and `provider_normalization`, which gives the signal identifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: more detail here. I'm just guessing at the format and processing, as an example of what to include, so please check.
"""Add identifier columns. | |
Add columns to get more detail than key_plot_id gives; | |
specifically, state, and `provider_normalization`, which gives the signal identifier | |
"""Parse `key_plot_id` to create several key columns | |
`key_plot_id` is of format "<state>_<provider>_<normalization>". We split by `_` and put each resulting item into its own column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, leaving this unresolved for feedback
agg_df["geo_id"] = "us" | ||
return agg_df | ||
|
||
|
||
def add_needed_columns(df, col_names=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (optional): to make this more robust, add assert to make sure that our set of missing column names doesn't include important ones (like geo_id and value). Since this is out of scope, worth making an issue for
logger.info("Generating signal and exporting to CSV", metric=full_sensor_name) | ||
if geo == "nation": | ||
df_prov_norm["nation"] = "us" | ||
agg_df = geomapper.aggregate_by_weighted_sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: looks like we're aggregating for all geos (state and nation). Is the base geo type not reportable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base geo is (state, wwtp_id)
(wastewater treatment plant id), which doesn't really match any other source at all. I was planning on waiting to add that, possibly outside of covidcast-indicators. Do you think just adding it to covidcast-indicators is a good idea?
if "archive" in params: | ||
_, common_diffs, new_files = arch_diff.diff_exports() | ||
to_archive = [f for f, diff in common_diffs.items() if diff is not None] | ||
to_archive += new_files | ||
_, fails = arch_diff.archive_exports(to_archive) | ||
succ_common_diffs = {f: diff for f, diff in common_diffs.items() if f not in fails} | ||
arch_diff.filter_exports(succ_common_diffs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: same thing about the runner script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I don't follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh this was commented in the other order, I'll just delete this then. There are several endpoints which include this btw
if "archive" in params: | ||
daily_arch_diff = S3ArchiveDiffer( | ||
arch_diff = S3ArchiveDiffer( | ||
params["archive"]["cache_dir"], | ||
export_dir, | ||
params["archive"]["bucket_name"], | ||
"nchs_mortality", | ||
"nwss_wastewater", | ||
params["archive"]["aws_credentials"], | ||
) | ||
daily_arch_diff.update_cache() | ||
arch_diff.update_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: you don't need to run the archive differ, the runner script takes care of that.
It has not. There's the corresponding docs, which I think Will read through at one point. |
Description
This splits the dataset based on the provider and normalization (not every pair is actually present), and adds the metric signals. The resulting signals are called:
Some of these can have negative values; for e.g.
ptc_15d
, the values are small enough that I expect these may actually be exponents. Still looking into why the concentration data has negative values, which are too large to make sense as exponents.Fixes
generate_weights
in the case where some weights are negative, and added a test for it