Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider channel-specific main_v4 tables, filtering probes per channel #1216

Open
jklukas opened this issue Mar 26, 2020 · 4 comments
Open

Comments

@jklukas
Copy link
Contributor

jklukas commented Mar 26, 2020

I'm under the impression that a large number of potential probes (the majority?) in the main ping are not included in release. Or maybe they're included in release, but we basically never see values in release because you have to opt-in, whereas those probes are on by default in nightly and beta.

I think it would be worth spending a little time on this to see if a main_v4_release table could have a significantly smaller schema if we never added opt-in probes to it. That would mean much smaller data volume in the problematic large-schema main_v4_nightly table.

@jklukas
Copy link
Contributor Author

jklukas commented Mar 26, 2020

cc @acmiyaguchi Do you have a good sense of what percentage of probes we'd expect to never be populated in main pings?

@acmiyaguchi
Copy link
Contributor

I'm not familiar with the sparsity of the main ping, but I expect there's a fairly large fraction of columns that are empty. Because of how we initially created the table, we have been using the all probes since Firefox 30 when generating the schema.

There is the probe dictionary stats that shows the number of reported probes. I also took a look at the mozaggregator dumps to see the distribution of total users per probe. We haven't had release data in there for over a year, so I took a look at the log(total) for all of the non-keyed probes on 2018-11-01.

image
source

The plot gives some reference for how populated each of the columns in the main ping table should be. For reference e^16 ~ 8m, so there are roughly 1000 probes that have counts > e^16. This doesn't take into account the large spike in the number of reported probes in recent versions because of when these aggregates were collected.

If we counted the number of null values for each probe in the main ping, I expect a large fraction of the columns would be unpopulated in release. I think opt-in would account for about 1-2k these columns. I think another 1k of these columns would be sparse due to expired probes. The fraction of rows populated with expired probes would probably be related to the relative size of old versions in the table. We probably don't care about older probes, so it may be good to cut these too if we wanted to make a more compact table.

I can take a look and gather more definitive stats. There is some code for generating the clients_daily_scalar_aggregates and clients_daily_histogram_aggregates in GLAM that could be repurposed for counting the probe columns.

@fbertsch
Copy link
Contributor

There is a quick and dirty way to see what the schema-generator will decide is a release probe:

In [30]: len([p for p in requests.get('https://probeinfo.telemetry.mozilla.org/firefox/release/main/all_probes').json().items() if p[1]['type'] != 'event' and p[1]['history']['release'][0]['optout']])
Out[30]: 2898

@fbertsch
Copy link
Contributor

fbertsch commented Mar 27, 2020

For reference, this is a bit more than half the size of all probes:

In [31]: len([p for p in requests.get('https://probeinfo.telemetry.mozilla.org/firefox/release/main/all_probes').json().items() if p[1]['type'] != 'event'])
Out[31]: 5092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants