first draft of splitting NWSS signals #1946

dsweber2 · 2024-02-23T00:52:37Z

Description

This splits the dataset based on the provider and normalization (not every pair is actually present), and adds the metric signals. The resulting signals are called:

pcr_conc_smoothed_CDC_VERILY_flow-population
detect_prop_15d_CDC_VERILY_flow-population
percentile_CDC_VERILY_flow-population
ptc_15d_CDC_VERILY_flow-population
pcr_conc_smoothed_CDC_VERILY_microbial
detect_prop_15d_CDC_VERILY_microbial
percentile_CDC_VERILY_microbial
ptc_15d_CDC_VERILY_microbial
pcr_conc_smoothed_NWSS_flow-population
detect_prop_15d_NWSS_flow-population
percentile_NWSS_flow-population
ptc_15d_NWSS_flow-population
pcr_conc_smoothed_NWSS_microbial
detect_prop_15d_NWSS_microbial
percentile_NWSS_microbial
ptc_15d_NWSS_microbial
pcr_conc_smoothed_WWS_microbial
detect_prop_15d_WWS_microbial
percentile_WWS_microbial
ptc_15d_WWS_microbial

Some of these can have negative values; for e.g. ptc_15d, the values are small enough that I expect these may actually be exponents. Still looking into why the concentration data has negative values, which are too large to make sense as exponents.

Fixes

Fix generate_weights in the case where some weights are negative, and added a test for it

dsweber2 · 2024-03-11T17:59:03Z

Minh confirmed that this runs on her machine and that the output looks reasonable. A couple of things to do before we merge this:

figure out the right cut-off for sig digits so that the noise is removed
seeing what values the formerly negative concentrations became
write docs similar to this PR

dsweber2 · 2024-05-14T22:01:45Z

Currently not passing because of the same update that made nssp tests fail. Once that's merged and this is rebased it will pass.

nmdefries

Some comments about style and comments. The actual functionality here looks fine.

This hasn't been released at all yet, right? You previously did statistical review on this; is there anything else we want to do for these new signals?

nmdefries · 2024-06-07T15:26:42Z

nwss_wastewater/delphi_nwss/constants.py

+PROVIDER_NORMS = {
+    "provider": ["CDC_VERILY", "CDC_VERILY", "NWSS", "NWSS", "WWS"],
+    "normalization": [
+        "flow-population",
+        "microbial",
+        "flow-population",
+        "microbial",
+        "microbial",
+    ],
 }


suggestion: my preference would be to tie the provider names and normalization types together more closely as pairs, maybe a list of length-2 tuples or a dict like. This can be used with very little modification to the run.run_module for loop.

{"CDC_VERILY": ("flow-population", ("microbial"), "NWSS": (...), ...}

nmdefries · 2024-06-07T15:30:14Z

nwss_wastewater/delphi_nwss/constants.py

 SIGNALS = ["pcr_conc_smoothed"]
 METRIC_SIGNALS = ["detect_prop_15d", "percentile", "ptc_15d"]


question: What is the distinction between these two signal sets?

suggestion: I find these signal names hard to parse. I'd prefer longer, more descriptive names. Our final signal names will include source and normalization method, though, so maybe they'd get too long?

Worth more thought. Have you checked these names with Roni yet?

What is the distinction between these two signal sets?

They're from two separate socrata APIs.

I haven't run the names by Roni, that's a good idea. The names are based on mirroring the original dataset's names.

It might have been left out of the "how to make an indicator" doc, but officially we're supposed to check signal names with Roni.

nwss_wastewater/delphi_nwss/pull.py

nmdefries · 2024-06-07T16:01:19Z

nwss_wastewater/delphi_nwss/pull.py

+    """Add identifier columns.
+
+    Add columns to get more detail than key_plot_id gives;
+    specifically, state, and `provider_normalization`, which gives the signal identifier


suggestion: more detail here. I'm just guessing at the format and processing, as an example of what to include, so please check.

Suggested change

"""Add identifier columns.

Add columns to get more detail than key_plot_id gives;

specifically, state, and `provider_normalization`, which gives the signal identifier

"""Parse `key_plot_id` to create several key columns

`key_plot_id` is of format "<state>_<provider>_<normalization>". We split by `_` and put each resulting item into its own column.

got it, leaving this unresolved for feedback

nmdefries · 2024-06-07T16:43:51Z