Skip to content

Analysts' Guide

Max Ostapenko edited this page May 17, 2024 · 47 revisions

Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.

Table of Contents

Commitment summary

  • Content planning: about 4 hours working with the content team to identify metrics from the HTTP Archive that could be used to support the chapter. This time is also spent preparing queries and augmenting the testing infrastructure as needed (custom metrics). This phase typically occurs in June-July.
  • Data validation: about 6 hours finalizing queries, running them against the dataset, and building a spreadsheet of results for the rest of the content team to review. This phase typically occurs in July-September.
  • Content writing: about 2 hours reviewing the drafts to ensure that the results are being interpreted correctly. This phase typically occurs in September-October.

The total commitment is approximately 30 hours of work per chapter, spread over 6 months.

How to join

The 2021 project is underway and we're actively looking for analysts! Browse the list of open chapters and comment in any chapter-specific issues that interest you.

Expensing queries

The Analysts team will be given BigQuery coupon codes to offset the cost of analysis. To get your code, join the #web-almanac-analysts channel on the HTTP Archive Slack and ask @Paul Calvano.

Analysis workflow

There could be dozens of queries per chapter, so to help stay organized here is a recommended workflow for analysts to get the queries written and the results shared.

  1. When the chapter is outlined, work with the content team to identify the quantifiable metrics that would be needed.
  2. Add each metric to the Metrics section using data-oriented descriptions of what it's measuring, for example "% of mobile pages that have images larger than 10 MB".
  3. Triage the feasibility of each metric based on the capabilities of the HTTP Archive dataset. Feel free to ask around in #web-almanac-analysts if you're not sure whether a metric is feasible.
  4. Create a git branch off main named <chapter>-sql-<yyyy> (or similar) and create a subdirectory for your chapter under the sql directory named <yyyy>/<chapter-name>. For example, 2022/css.
  5. Open up a pull request to merge with the main branch, but leave it in draft mode for now. Edit the PR description to contain a checklist of all of the metrics you triaged as feasible in the doc. See #1087, for example.
  6. As you write the queries for each metric, add a comment to the top of the query using the description from step 2, and give the file a short and descriptive name, like pct_large_images.sql. (Note that in 2019 we used a system of metric IDs like 01_22.sql, which made it hard to understand what a query did by its name.)
  7. When all of the queries have been written, request a review from the @HTTPArchive/analysts team.
  8. When the 2020_08_01 HTTP Archive crawl is complete at the end of August, run each query and save its results to your designated spreadsheet linked from the chapter issue. Each metric's results should be in its own tab, which is named using the file name (eg pct large images).
  9. To help others interpret the results, leave a comment on the top row of the results with the correct interpretation. For example: "Read as: 4.28% of mobile pages contain images larger than 10 MB".
  10. Authors and reviewers should check the results to make sure they're within expected limits, otherwise the query may need to be rewritten.
  11. When the authors have drafted the chapter, they should leave a placeholder for the data visualization. Generate a chart using the templates provided, publish it to get its embeddable HTML, and replace the placeholder.

Dates to use

The Web Almanac is an annual publication and we try to enforce a common HTTP Archive run to query for consistency across the Almanac. There has been some suggestions and discussion about making it less annual, or publishing different chapters at different time of years but, for now, it's annual primarily based around the one month's dataset.

We try to use the July dataset for each year, but there have been exceptions, either due to issues with that particular dataset making it less reliable (2020), or when we decided to publish earlier than usual (2022). Even within years, there are some exceptions. Below are the dates for each "Almanac Year" and the exceptions.

Almanac Edition Dataset to use Exceptions
2019 2019_07_01
2020 2020_08_01 HTTP Archive performance stats were incorrect for Aug and 2020-09-01 should be used for this (CrUX data was fine, and should really be used in preference to lab-based results anyway).
2021 2021_07_01 WASM used 2021_09_01
2022 2022_06_01 Parsed CSS data was not available for June and July data should be used. Also 2022_06_09 contains home AND secondary pages for those chapters wanting to use that (the home page match the data fro 2022_06_01 as the official run kicked off later in the month and we just marked them as 2022_06_01).

We have a linter to attempt to block incorrect usage when copying queries from year to year. You can prevent the linter from flagging in these edge cases by adding --noqa: L063 to the end of the SQL line with the "incorrect" date.

Query conventions

#standardSQL
# The number of unique websites in the July 2019 crawl.
SELECT
  _TABLE_SUFFIX AS client,
  COUNT(DISTINCT url) AS num_pages
FROM
  `httparchive.summary_pages.2019_07_01_*`
GROUP BY
  client
  1. The first line should be #standardSQL so the query is interpreted as Standard SQL syntax
  2. The next line(s) should be a comment describing what the query is analyzing and the metric ID(s) if relevant
  3. Use uppercase for all SQL keywords (SELECT, COUNT, AS, FROM, etc)
  4. Fields, and tables, are descriptively named in snake_case (lowercase and underscore-delimited)
  5. Function names are descriptively named in camelCase (lowercase in general with uppercase letter indicating new word) to differentiate them from fields and tables. Do not reuse the exact same function name for different purposes in different queries. See below for more information on user defined functions.
  6. Table names should always specify the project (httparchive) and include both desktop and mobile when possible by using a wildcard (*)
  7. Separate the desktop and mobile results by grouping by client. When using a wildcard, this will be the _TABLE_SUFFIX special value.
  8. Filenames should have a .sql extension for syntax highlighting in GitHub and other editors, and also allow them to be recognised by the linter (see below).
  9. Filenames should not have spaces in them to make them easier to reference from the web.

Mobile or Desktop or Both?

In general we would like both data sets queried. It is tempting for some stats to only run one, whether there should not be any difference between the two (e.g. HTTP version supported, or looking at robots.txt), but that makes the assumption that we tests the same sites on both! There is a huge overlap alright, but not exactly the same because we base our crawl on CrUX's list which is based on usage. So https://m.facebook.com might appear on mobile but not on desktop. And maybe facebook forgot to add a robots.txt (or added a really weird one!) to their mobile site?

Most stats are available to both, the big exception being Lighthouse which currently is only available for the mobile crawl (due primarily to capacity constraints but something we are working on!).

We've also been surprised before when we wouldn't expect there to be much difference, so unless there's a very good reason (e.g. Lighthouse tests are only run on mobile, or it's for the mobile chapter), we recommend to include both - you might be surprised in the differences which makes interesting information to report on! That's not to say that the author needs to include both stats if they are repetitive, but let's at least run the data so they can make that call.

If we really do only need to run (or report on) only one, we generally prefer mobile over desktop because 1) We only have certain stats (Lighthouse) for mobile, 2) we've a bigger dataset for mobile and 3) we live in a mobile first world!

SQL Linting

Most of the above standards will be enforced by the linter (SQLFluff) when you open a Pull Request and the linting will fail if you deviate from these standards.

You can run the linter locally if you have a python environment set up, which can be set up with the following commands if using Virtual Env:

cd src
virtualenv --python python3.8 .venv
source .venv/bin/activate
pip install -r requirements.txt

Or for those on Windows:

cd src
virtualenv --python python3 .venv
.venv\Scripts\activate.bat
pip install -r requirements.txt

The environment will need to be activated each time but the virtual env should only need set up once and the requirements only need to be installed again if you want new versions.

To lint the 2020 resource-hints SQL files, for example, install the python environment as per above, and then issue the following command:

sqlfluff lint ../sql/2020/resource-hints

This will return errors like the following:

% sqlfluff lint ../sql/2020/resource-hints
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L:  25 | P:  26 | L010 | Inconsistent capitalisation of keywords.
L:  26 | P:  37 | L010 | Inconsistent capitalisation of keywords.
L:  34 | P:  63 | L038 | Trailing comma in select statement forbidden
All Finished 📜 🎉!

This states that:

  • On line 25, in position 26 you are using lowercase for keywords (e.g. as instead of AS) so failed rule L010.
  • Similarly on line 26, position 37.
  • And finally on line 34, position 63 you have an unnecessary comma (e.g. SELECT a,b, FROM table) and so failed rule L038. Remove the extra comma.

The list of rules can be found in the SQLFLuff documentation though we have turned a few of them off and configured others for our style (see the our .sqlfluff file if curious).

If you see any "unparseable" or PRS errors, then this is either an error in your code, or perhaps you've discovered a bug. Reach out to Barry (@tunetheweb) for help if stuck.

To attempt to autofix the errors you can use the fix command, instead of lint:

sqlfluff fix ../sql/2020/resource-hints

Which will produce similar output but with an offer to fix the issues it thinks it can fix:

% sqlfluff fix ../sql/2020/resource-hints
==== finding fixable violations ====
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L:  25 | P:  26 | L010 | Inconsistent capitalisation of keywords.
L:  26 | P:  37 | L010 | Inconsistent capitalisation of keywords.
L:  34 | P:  63 | L038 | Trailing comma in select statement forbidden
==== fixing violations ====
3 fixable linting violations found
Are you sure you wish to attempt to fix these? [Y/n]

If you lint again you should see most of the errors are fixed. Note that not all errors can be autofixed and some will require manual intervention but autofixing is useful for the simple errors. So while it's generally OK to run the fix command, do run the lint command when all clean to make sure the fix command didn't miss any issues.

Formatting results

When returning a percent value, ensure that the number is between 0 and 1, not 0 and 100 (we used 0 to 100 in 2019 but switched in 2020 to allow easier formatting in Sheets). We also stopped rounding in BigQuery from 2020 onwards and present the full decimal value and round in Sheets.

#standardSQL
# Take the average percent of external stylesheets over all requests.
SELECT
  _TABLE_SUFFIX AS client,
 COUNTIF(type = 'css') / COUNT(0) AS pct_stylesheets
FROM
  `httparchive.summary_requests.2020_08_01_*`
GROUP BY
  client

Usually you would sort by pct in DESCing order. Try to reuse the column names rather than reimplement the percentage formula in the ORDER BY clause too.

 SELECT
  _TABLE_SUFFIX AS client,
  element,
  COUNT(DISTINCT url) AS pages,
  total,
 COUNT(DISTINCT url) / total, 2 AS pct
FROM
  `some_dataset`
GROUP BY
  client,
  total,
  element
ORDER BY
  pct DESC,
  client

A warning on totals

We generally want to present stats as a percentage of pages. The HTTP Archive data set changes month by month (based on the latest available CrUX data) and has been growing over time. Therefore presenting absolute values can make it difficult to understand the scale, and to do year on year comparison. So as much as possible we want data presented as percentages.

To do this we need to take the absolute frequency of stats and divide by the total number. This means you need a total number. There are a number of ways of getting this.

You can use CTE (Common Table Expression):

WITH totals AS (
  SELECT
    _TABLE_SUFFIX AS _TABLE_SUFFIX,
    COUNT(0) AS total
  FROM
    `httparchive.pages.2021_07_01_*`
  GROUP BY
    _TABLE_SUFFIX
)

SELECT
  _TABLE_SUFFIX AS client
  COUNT(0) AS freq,
  total,
  COUNT(0) / total AS pct
FROM
  `httparchive.pages.2021_07_01_*`
JOIN
  totals
USING (_TABLE_SUFFIX)
WHERE
  ...
GROUP BY
  client
ORDER BY
  pct DESC

You can use a subquery:

SELECT
  _TABLE_SUFFIX AS client,
  COUNT(0) AS freq,
  total,
  COUNT(0) / total AS pct
FROM
  `httparchive.pages.2021_07_01_*`
JOIN
  (
    SELECT
      _TABLE_SUFFIX AS _TABLE_SUFFIX,
      COUNT(0) AS total
    FROM
      `httparchive.pages.2021_07_01_*`
    GROUP BY
      _TABLE_SUFFIX
  )
USING (__TABLE_SUFFIX)
WHERE
  ...
GROUP BY
  client
ORDER BY
  pct DESC

Or you can use a windowing clause using OVER:

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(...) AS freq,
  SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS total
  COUNT(0) / SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS pct
FROM
  `httparchive.pages.2021_07_01_*`
GROUP BY
  client
ORDER BY
  pct DESC

The last looks the simplest and is a very handy way of not having to do another join. However care must be taken with this, as it's easy to get the wrong total with this. In particular:

  • UNNEST queries change the totals and even using DISTINCT url or similar will not remove this as you are doing a SUM of those DISTINCT url's rather than a DISTINCT of the SUM. This often leads to a total much LARGER than expected.
  • WHERE clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected.
  • HAVING clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected. You can use the QUALIFY option instead which is basically the same as HAVING but is applied after.

So for simple functions with no joins, and no filters (using WHERE or HAVING) by all means use the windowing option, but be aware of its limitations.

It is always best to include the total in your exports so you can validate the numbers - why is this out of 500 million URLs? I thought our crawl was 7 million URLs?

Almanac custom metrics and UDFs

In the pages dataset, there is a payload property of _almanac containing an object with several custom metrics. See https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/almanac.js for the definitions of each metric.

Sometimes the values in a custom metric are too complex to process using the JSON_EXTRACT functions in BigQuery, so you can do more complex analysis with a user-defined function (UDF). In the example below, the UDF parses both the payload JSON and the JSON-encoded _almanac custom metric:

#standardSQL
# 01_15: Percent of pages that include link[rel=modulepreload]
CREATE TEMP FUNCTION hasModulePreload(payload STRING)
RETURNS BOOLEAN LANGUAGE js AS '''
  try {
    var $ = JSON.parse(payload);
    var almanac = JSON.parse($._almanac);
    return !!almanac['link-nodes'].find(e => e.rel.toLowerCase() == 'modulepreload');
  } catch (e) {
    return false;
  }

''';

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(hasModulePreload(payload)) AS num_pages,
  COUNTIF(hasModulePreload(payload)) / COUNT(0) AS pct_modulepreload
FROM
  `httparchive.pages.2019_07_01_*`
GROUP BY
  client

As per the example function names should be written in camelCase (hasModulePreload) rather than snake_case (has_module_preload) to help differentiate them from table names.

Additionally we have had reports that reusing the same function name in different queries being run at the same time (e.g. two tabs in your browser) in the same session, can mix up the results! Sounds like a BigQuery bug if this is the case, but either way it's easily avoided but not reusing the same function name for different functions across queries within the same set of chapter queries.

Almanac dataset

httparchive.almanac is a dataset containing 1k subset tables for all standard datasets (lighthouse, pages, requests, response_bodies, summary_pages, summary_requests, technologies)

There are also several preprocessed tables to make querying more convenient and cost-effective:

  • response_bodies: combination of desktop/mobile response bodies clustered by client, page, and url fields

  • summary_requests: combination of desktop/mobile summary requests clustered by client, page, and url fields. Note that page is the URL of the corresponding pageid in the summary_pages table.

  • summary_response_bodies: combination of desktop/mobile response bodies and summary requests data joined by page and url fields, clustered by client, firstHtml, and type fields. This is the preferred table for querying response bodies of a certain type, for example:

#standardSQL
# Calculates the percent of pages that reference the word "almanac" (case-insensitive).
SELECT
  client,
  COUNTIF(REGEXP_CONTAINS(body, '(?i)almanac')) / COUNT(0) AS pct_almanac
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  firstHtml
GROUP BY
  client

Note that summary_response_bodies is an 18 TB table, but thanks to the clustering, this query is limited to only the response bodies that are marked as firstHtml. So the query actually completes in ~13 seconds and consumes only 850 GB.

image

Rank

Since the June 2021 crawl, rank data is taken from the CrUX report allowing queries on the top 1k origins, top 10k, top 100k, top 1M...etc. The rank field is available in the httparchive.summary_pages.* tables and will contain the values 1000, 10000, 100000, 1000000 and 10000000 representing the rank for each url crawled.

Note that the ranks are exclusive, meaning that a URL may only have one rank. For example, the 100000 rank contains 99,000 URLs because the top 1,000 is its own exclusive segment. To include all URLs you may combine the different ranks as follows:

SELECT 
  client,
  rank_grouping,
  COUNT(0) AS total
FROM (
  SELECT 
    _TABLE_SUFFIX AS client,
    rank,
  FROM
    `httparchive.summary_pages.2021_07_01_*`
  ),
UNNEST([1000, 10000, 100000, 1000000, 10000000]) AS rank_grouping
WHERE
  rank <= rank_grouping
GROUP BY 
  client,
  rank_grouping
ORDER BY
  client,
  rank_grouping

When running the above query, you may notice that the total does not perfectly match the number of expected URLs, i.e. less than 1,000 URLs for the top 1000 rank. Since the CrUX rank for each origin is shared between desktop & mobile, it may be possible that an origin which features on CrUX for one device type, does not feature for the other (e.g. a mobile version of a site like https://m.example.com may not appear in desktop CrUX data if it doesn't get a lot of visits from desktop devices). As a result, for the 1000 rank, the list of URLs to crawl may contain a maximum of 1,000 URLs - if all URLs feature on both desktop and mobile - but likely to contain a subset.

Appendix

Scripts to analyze number of queries and lines of SQL for a given year.

Number of queries:

ls sql/2021/**/*.sql | wc -l

Total lines of SQL:

wc -l sql/2021/**/*.sql | tail -n 1