Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "top sites" section is *not* ordered by usage, even though it claims to be #3816

Open
mfreed7 opened this issue Apr 23, 2024 · 4 comments
Labels

Comments

@mfreed7
Copy link

mfreed7 commented Apr 23, 2024

Take this page, for example:

https://chromestatus.com/metrics/feature/timeline/popularity/4844

The header says "Sample URLs from latest run (ordered by Rank and URL)", but the list is clearly in alphabetical order (including "http" vs "https"). That makes it not as helpful, especially given that between 100 and 200 sites are listed. It would be very useful to have this list be rank-ordered by usage, so the top sites (by usage) can be easily checked.

@josepharhar @chrishtr

@mfreed7 mfreed7 added the bug label Apr 23, 2024
@josepharhar
Copy link

The fact that these aren't sorted also makes me worried that the sites listed aren't even "top sites" with regards to anything at all

@chrishtr
Copy link
Collaborator

@tunetheweb

@tunetheweb
Copy link
Member

So it's a bit complicated. But it is a sample of the URLs from the top sites using this feature.

First up, the rankings available in the HTTP Archive are based on the CrUX course rank magnitude. This means we only get groupings like top 1,000 then top 5,000, then top 10,000, then top 50,000, then 100,000...etc. So we do not have a precise "ranking" of 1, 2, 3, 4....etc.

We take the top 100 urls as ordered by rank, and url for mobile and for desktop so we have a max of 200 urls if they are distinct (often there's sites combined in both). This limiting to 100 for each is mainly done so they can be precomputed to keep the dashboard reasonably fast. Importantly, this list now is just URLs and no longer contains rank as it's stored as a simple array of URLs.

Then we combine this list and report it by alphabetical order.

What does all this mean? Let's take following usage as an example:

rank url
1,000 https://z.com
10,000 https://a.com
10,000 https://b.com
50,000 https://c.com
50,000 https://d.com
50,000 https://e.com
100,0000 https://f.com

Then let's say we only took the top 4 sites, instead of 200 for simplicity. It's already ordered by rank and url so we would take the following:

(note we include all of the top 1,000, all of the top 10,000 and only a bit of the top 50,000).

And then we present them as the following order (as we no longer have the rank to sort by):

This means we ARE broadly giving a sample of the urls ordered by rank (note that we didn't include f.com for example as it was lower ranked), but there's definitely some nuance here as it's no longer in strict rank order in the end . Though it STILL is still the most popular 200-ish URLs we have for that feature. Just not exactly in the rank order anymore.

I've updated the text to "Sample URLs of the most popular sites using this feature ordered alphabetically" in some vague way to try to explain this but not sure if it makes it any clearer!

If you want the actual rank order, then you can run the following SQL:

#standardSQL
SELECT DISTINCT yyyymmdd, feature, id, rank, url
FROM `httparchive.blink_features.features`
WHERE (feature = 'SelectParserDroppedTag' OR id = '4844')
AND yyyymmdd = (SELECT MAX(yyyymmdd) FROM `httparchive.blink_features.features`)
ORDER BY yyyymmdd DESC, rank, url
LIMIT 200;

In fact if you remove the limit, you'll get all 493 URLs as shown in this sheet in rank and then url order. Note that even then it is still in the course rank order.

yyyymmdd feature id rank url
2024-03-01 SelectParserDroppedTag 4844 50,000 https://www.loewe.com/
2024-03-01 SelectParserDroppedTag 4844 100,000 https://billetterie.rclens.fr/
2024-03-01 SelectParserDroppedTag 4844 100,000 https://onlinesbi.sbi/
2024-03-01 SelectParserDroppedTag 4844 100,000 https://www.buybestgear.com/
2024-03-01 SelectParserDroppedTag 4844 500,000 https://m.maccosmetics.com.mx/
2024-03-01 SelectParserDroppedTag 4844 500,000 https://m.maccosmetics.es/
... ... ... ... ...

Unfortunately this SQL is not possible to run in the Dashboard as it's very slow to run for all features (which is how Data Studio works). Hence why we go for the rather convoluted route we still gives broadly the same data, but sometimes not in the exact order expected.

However, you'll also note the top 200 URLs in the sheet are also what was provided by chromestatus. Just not in the same order.

The history of this is this used to be a completely random sampling of URLs which was not that useful at all. Now it's at least the most popular URLs but yes the ordering within that is still a little messy.

With a little more effort we could have a new table with the top 200 URLS and the rank column to make this all more obvious. But it still would be the same URL list and without fine-grained ranking that you may be looking for.

@mfreed7
Copy link
Author

mfreed7 commented Apr 25, 2024

Thanks for the very detailed explanation of what's going on with this list! I feel a lot better about the quality of the results.

Having said that, I do think it'd be very useful to either a) keep the list ordered by rank "bucket" so that the top-1000 results are at the top, or at least b) add a rank column so we could do that ourselves. While a fine-grained 1,2,3 ranking would be the best, we can still extract a lot of value from top-1000 vs. top-50000.

How much of a project would it be to do one of those things?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants