Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first draft of splitting NWSS signals #1946

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

dsweber2
Copy link
Contributor

Description

This splits the dataset based on the provider and normalization (not every pair is actually present), and adds the metric signals. The resulting signals are called:

  • pcr_conc_smoothed_CDC_VERILY_flow-population
  • detect_prop_15d_CDC_VERILY_flow-population
  • percentile_CDC_VERILY_flow-population
  • ptc_15d_CDC_VERILY_flow-population
  • pcr_conc_smoothed_CDC_VERILY_microbial
  • detect_prop_15d_CDC_VERILY_microbial
  • percentile_CDC_VERILY_microbial
  • ptc_15d_CDC_VERILY_microbial
  • pcr_conc_smoothed_NWSS_flow-population
  • detect_prop_15d_NWSS_flow-population
  • percentile_NWSS_flow-population
  • ptc_15d_NWSS_flow-population
  • pcr_conc_smoothed_NWSS_microbial
  • detect_prop_15d_NWSS_microbial
  • percentile_NWSS_microbial
  • ptc_15d_NWSS_microbial
  • pcr_conc_smoothed_WWS_microbial
  • detect_prop_15d_WWS_microbial
  • percentile_WWS_microbial
  • ptc_15d_WWS_microbial

Some of these can have negative values; for e.g. ptc_15d, the values are small enough that I expect these may actually be exponents. Still looking into why the concentration data has negative values, which are too large to make sense as exponents.

Fixes

  • Fix generate_weights in the case where some weights are negative, and added a test for it

@dsweber2 dsweber2 changed the title first draft of splitting signals first draft of splitting NWSS signals Feb 23, 2024
@dsweber2
Copy link
Contributor Author

dsweber2 commented Mar 11, 2024

Minh confirmed that this runs on her machine and that the output looks reasonable. A couple of things to do before we merge this:

  • figure out the right cut-off for sig digits so that the noise is removed
  • seeing what values the formerly negative concentrations became
  • write docs similar to this PR

@dsweber2 dsweber2 force-pushed the splittingNWSSSignals branch 3 times, most recently from df5aa19 to e86e5fa Compare March 29, 2024 18:36
@dsweber2 dsweber2 requested a review from dshemetov April 16, 2024 18:40
@dsweber2
Copy link
Contributor Author

Currently not passing because of the same update that made nssp tests fail. Once that's merged and this is rebased it will pass.

@nmdefries nmdefries requested review from nmdefries and removed request for melange396 and minhkhul June 6, 2024 16:27
Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about style and comments. The actual functionality here looks fine.

This hasn't been released at all yet, right? You previously did statistical review on this; is there anything else we want to do for these new signals?

Comment on lines +15 to 24
PROVIDER_NORMS = {
"provider": ["CDC_VERILY", "CDC_VERILY", "NWSS", "NWSS", "WWS"],
"normalization": [
"flow-population",
"microbial",
"flow-population",
"microbial",
"microbial",
],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: my preference would be to tie the provider names and normalization types together more closely as pairs, maybe a list of length-2 tuples or a dict like. This can be used with very little modification to the run.run_module for loop.

{"CDC_VERILY": ("flow-population", ("microbial"), "NWSS": (...), ...}

Comment on lines 13 to 14
SIGNALS = ["pcr_conc_smoothed"]
METRIC_SIGNALS = ["detect_prop_15d", "percentile", "ptc_15d"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: What is the distinction between these two signal sets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I find these signal names hard to parse. I'd prefer longer, more descriptive names. Our final signal names will include source and normalization method, though, so maybe they'd get too long?

Worth more thought. Have you checked these names with Roni yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the distinction between these two signal sets?

They're from two separate socrata APIs.

I haven't run the names by Roni, that's a good idea. The names are based on mirroring the original dataset's names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might have been left out of the "how to make an indicator" doc, but officially we're supposed to check signal names with Roni.

nwss_wastewater/delphi_nwss/pull.py Show resolved Hide resolved
nwss_wastewater/delphi_nwss/pull.py Show resolved Hide resolved
Comment on lines +67 to +70
"""Add identifier columns.

Add columns to get more detail than key_plot_id gives;
specifically, state, and `provider_normalization`, which gives the signal identifier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: more detail here. I'm just guessing at the format and processing, as an example of what to include, so please check.

Suggested change
"""Add identifier columns.
Add columns to get more detail than key_plot_id gives;
specifically, state, and `provider_normalization`, which gives the signal identifier
"""Parse `key_plot_id` to create several key columns
`key_plot_id` is of format "<state>_<provider>_<normalization>". We split by `_` and put each resulting item into its own column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, leaving this unresolved for feedback

agg_df["geo_id"] = "us"
return agg_df


def add_needed_columns(df, col_names=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (optional): to make this more robust, add assert to make sure that our set of missing column names doesn't include important ones (like geo_id and value). Since this is out of scope, worth making an issue for

logger.info("Generating signal and exporting to CSV", metric=full_sensor_name)
if geo == "nation":
df_prov_norm["nation"] = "us"
agg_df = geomapper.aggregate_by_weighted_sum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: looks like we're aggregating for all geos (state and nation). Is the base geo type not reportable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base geo is (state, wwtp_id) (wastewater treatment plant id), which doesn't really match any other source at all. I was planning on waiting to add that, possibly outside of covidcast-indicators. Do you think just adding it to covidcast-indicators is a good idea?

Comment on lines +119 to +125
if "archive" in params:
_, common_diffs, new_files = arch_diff.diff_exports()
to_archive = [f for f, diff in common_diffs.items() if diff is not None]
to_archive += new_files
_, fails = arch_diff.archive_exports(to_archive)
succ_common_diffs = {f: diff for f, diff in common_diffs.items() if f not in fails}
arch_diff.filter_exports(succ_common_diffs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: same thing about the runner script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I don't follow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh this was commented in the other order, I'll just delete this then. There are several endpoints which include this btw

nwss_wastewater/delphi_nwss/run.py Show resolved Hide resolved
Comment on lines 80 to +88
if "archive" in params:
daily_arch_diff = S3ArchiveDiffer(
arch_diff = S3ArchiveDiffer(
params["archive"]["cache_dir"],
export_dir,
params["archive"]["bucket_name"],
"nchs_mortality",
"nwss_wastewater",
params["archive"]["aws_credentials"],
)
daily_arch_diff.update_cache()
arch_diff.update_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: you don't need to run the archive differ, the runner script takes care of that.

@dsweber2
Copy link
Contributor Author

dsweber2 commented Jun 7, 2024

This hasn't been released at all yet, right? You previously did statistical review on this; is there anything else we want to do for these new signals?

It has not. There's the corresponding docs, which I think Will read through at one point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants