Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make page level data available in BigQuery #10

Open
derekperkins opened this issue May 9, 2021 · 3 comments
Open

Make page level data available in BigQuery #10

derekperkins opened this issue May 9, 2021 · 3 comments

Comments

@derekperkins
Copy link

The origin level data is already there, and BigQuery is perfectly suited for broader page level analysis. The api quotas make it hard to analyze large sites.

@rviscomi
Copy link
Member

Unfortunately we're unable to add page-level data to BigQuery. Could you describe the API limitations you're hitting? Also are you doing any kind of rate limiting or batching?

@derekperkins
Copy link
Author

The official api docs aren't super clear about quotas, but according to https://github.com/treosh/crux-api#batch-request, each individual request inside the batch counts towards the quota. Before today, everything I read seemed to point at pagespeedonline.googleapis.com/default being the relevant quota, limited to 25k reqs / day. Today I found chromeuxreport.googleapis.com/default, which maxes out at 150 / min. I'm not sure how that works with the batch system that lets you include up to 1000 in a single request. At 150 / min, that limits you to 216k reqs / day, which doesn't allow for much analysis per url if you want to do any segmentation by device, country, or internet speed. If I'm misunderstanding and rate limiting only applies once per batch, that puts the limit at 216M that would be much better.

A way to query all URLs included in the CrUX database for a specific origin. A lot of hit and miss if you query specific URLs and there are occasions where it's hard to understand which group of URLs are the worst offenders when looking at origin data. https://twitter.com/jlhernando/status/1389648558614368258,

As mentioned here, being able to query for coverage instead of individually hitting the api repeatedly would reduce the need for so much quota.

Are you able to share anything about the reasoning for not making page level data in BigQuery? Are there privacy concerns?

@rviscomi
Copy link
Member

The Treo docs are correct that queries within a batched request still count towards the quota.

As mentioned here, being able to query for coverage instead of individually hitting the api repeatedly would reduce the need for so much quota.

Could you elaborate on what you mean by "query for coverage"? Not sure if it's referring to getting feedback on current quota usage or a feature request for better coverage of URLs.

Are you able to share anything about the reasoning for not making page level data in BigQuery? Are there privacy concerns?

Yeah, we would want to avoid anyone being able to say "show me all pages for a given origin" even if it's not their site. Site owners should know what all of their URLs are and how popular they are, so it should be possible to create an ordered list of URLs and to query the most popular ones, which are most likely to be included in the dataset and have the biggest influence over the site's aggregate CWV performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants