Are our scores misleading? #339

jakearchibald · 2020-06-30T12:13:23Z

https://twitter.com/boriscoder/status/1277937351164035078

Do we want people to switch to Rollup because it has a higher score? Well, no, and we explain that in our FAQ, but I'm not sure folks will see that.

I like the scores as 'a bit of fun', and something for tool maintainers to aim for, but should we add a disclaimer or something?

GijsWeterings · 2020-06-30T12:48:43Z

Random idea: Move the total scores to the bottom. Makes readers not immediately think "oh X is the best, whatever my usecase I'll just go for X" but does keep the challenge to the bundlers to improve their total scores :)

jakearchibald · 2020-06-30T14:01:57Z

That's certainly a quick fix. @argyleink @una: what do you think?

una · 2020-06-30T14:12:02Z

If we move4 the Summary section down, I think we'll want to rework the first section to have some more visual information and break up the text blocks. This doesn't seem like a quick fix as we'll need to rethink the IA:

Alternatively as a quick fix, we could add some text in the sidebar:

[EDIT] Or underneath:

argyleink · 2020-06-30T15:09:49Z

the data is the data, and people will be people; data above or below, people still the same 🤷

the scores arent misleading imo.

people will forever read the headline and bail or search for the headline / tldr so they can bail. what we've done is present the information from the top down, rollup up to unrolled. it's all there, and not in a raw format, in a multi-level digestible format. it'd only be misleading if we prevented folks from seeing more information and only gave them the card totals. which we arent: our data is open, transparent, and accessible.

i don't feel we need to change the design. nothing we can do to prevent incorrect extrapolation, i feel we've done our best here to prevent that already.

tomayac · 2020-06-30T15:37:25Z

One idea for the future could be to let people change the score weights. If one app doesn't use, say, web workers, they might want to opt out of anything that grades these aspects (i.e., set a weight of 0). If they don't mind overly much, but want still consider somewhat, say, lack of customization, they might set the related weights to 0.1. It could be interactive with sliders…

surma · 2020-06-30T15:48:08Z

We already have weights for the tests, so letting those be customized sounds like an interesting idea.

una · 2020-06-30T15:49:18Z

I think that's an interesting idea. It would be a nice future feature to allow for customized scores based on user needs, but it feels like this might be better placed on a "frameworks" overview page or elsewhere, as this would essentially become a recommendation engine or wizard. Currently, the site intention is to present the research data without specific recommendations.

surma · 2020-06-30T16:06:28Z

@jeremy-coleman How so? If you can provide us with a way for browserify to consume es modules, please let us know (but this should be in a separate issue).

GijsWeterings · 2020-06-30T22:51:38Z

the data is the data, and people will be people; data above or below, people still the same

I respectfully (partly) disagree. To a degree yes, the same info is there and you can pull your own weird conclusions from it no matter what. That said, humans seeing a total score above the fold have drawn conclusions before looking at the rest of the data. Presenting context about the test, the individual test results and then a summary gives users more incentive to see more of the page, catching visuals of certain bundlers scoring better in certain categories, before arriving at a conclusion.

I'd (again) propose moving the conclusions down, with a clear disclaimer on them, and using the first viewport to set expectations of the page

the scores arent misleading imo

They are not, if set in the appropriate context of the set of tests chosen. That information is extremely important and my worry as well as the person on Twitter that sparked this issue is that just seeing the rollup doesn't imply the explicit need for the context enough.

people will forever read the headline and bail or search for the headline / tldr so they can bail.

Again, I hope you agree the mere totals on their own shouldn't be the headline of the page, as they in isolation say nothing of value about any of the bundlers. The idea mentioned here of modifying the weights for different purposes is of course the best way to add value to these totals, but a step further from the current purpose of the site.

As a conclusion, my suggestion of moving the totals isn't to prevent any potential biased extrapolation of results, but to help set them in the correct context. The difference is subtle but in my personal opinion very beneficial in your content strategy.

emilio-martinez · 2020-07-01T03:26:41Z

Why not do something similar to what Lighthouse does? The tests are already grouped into categories, but perhaps distilling those into 4 or 5 keys areas of concern would be enough to create a "scorecard" of sorts to present at the top.

GijsWeterings · 2020-07-01T12:58:49Z

@emilio-martinez also a fine suggestion in my book! Anything that can help better set the total number in the correct context is a win in my book here

jeremy-coleman · 2020-07-01T15:37:43Z

What about changing “x out of y tests passed” to something like “x out of y solutions developed”? That way, it reads more like possible/difficult, while also being an implicit call to action for community members to submit or develop missing solutions. That’d also add a degree of separation to the summary stats.

tomByrer · 2020-07-06T05:23:09Z

"Are our scores misleading?" Is a good question to ask.
When I'm in Engineer mode, I'm more concerned with 'Will this tool work best in MY scenario?'

So perhaps a better solution to the question is not about the numbers, but providing a concise text summery for each tool? EG:

browserify: 
If it already is included in the project and it works for you, keep using it.

Parcel:
Quick to set up and use for most web-facing projects.

Rollup:
Fairly easy expand, very aggressive code optimization.

Webpack:
Very powerful and largest ecosystem.

shellscape · 2020-07-08T13:19:15Z

I'm in agreement with @argyleink - let's not let feelings get in the way of good, hard data. Anyone else old enough to remember the Acid Tests for browsers and how important those were in getting browsers on the same page in terms of support? They didn't tiptoe around what failed and what didn't. It was right there in your face, and it was useful.

Removing or obscuring easy to digest data will reduce the usability of the tool. Adding a hard to ignore "read this first" ahead of the scorecards is reasonable. From there, if people want to skip the hard to ignore information, it's on them.

jakearchibald · 2020-07-08T22:22:20Z

I think we can have an adult conversation here that doesn't involve "flat earth".

jeremy-coleman · 2020-07-09T01:14:35Z

Perhaps then, if deemed proper for adults, a bit of googling on the fallacy of suppressed evidence and inductive vs deductive reasoning may be in order.

jakearchibald · 2020-07-09T13:14:06Z

@jeremy-coleman constructive comments are welcome here. Comparisons to "flat earth" and "urr just Google it" are not constructive. Keep it civil please.

justinfagnani · 2020-07-09T16:21:20Z

Why not do something similar to what Lighthouse does? The tests are already grouped into categories, but perhaps distilling those into 4 or 5 keys areas of concern would be enough to create a "scorecard" of sorts to present at the top.

This could be used to handle my concern about appearing to encourage features that aren't compatible with native platform capabilities. One category could be something like "non-standard module types" or "non-standard extensions" and I could ignore it :)

justinfagnani · 2020-07-09T16:25:21Z

@shellscape

the Acid Tests for browsers and how important those were in getting browsers on the same page in terms of support? They didn't tiptoe around what failed and what didn't. It was right there in your face, and it was useful.

The Acid Tests were testing specified behavior. Many bundler features, outside of the desire to actually preserve module semantics, are not specified and some are some are really opinionated. So reducing the score to a single number loses a lot more information compared to Acid.

jeremy-coleman · 2020-07-09T22:50:36Z

@jakearchibald let me make this as simple as i possibly can
can a bird fly?

(fallacy)
an ostrich is a bird and it cannot fly -> used as proof that birds cannot fly.

(scientific method)
an ostrich is a bird and it cannot fly -> some birds cant fly -> must test again
an eagle is a bird and it can fly -> some birds can fly -> valid proof a bird can fly

I realize the tests here are tdd style, which is fine for development and a great fit for this project, but you cannot publicly report the absence of a pass as a failure. Instead, present the findings as a solution-only recipe book not a pass/fail test suite.

jakearchibald · 2020-07-10T08:26:02Z

Jeremy,

but you cannot publicly report the absence of a pass as a failure

For that test, yes you can. From https://bundlers.tooling.report/about/:

What is tooling.report? It's a quick way to determine the best build tool for your next web project

The "your" is important there. Parcel fails tests where you'd need to write your own plugin, because it doesn't have usable plugin documentation right now. If you're likely to need to write your own plugins, then failure to pass those tests is a strong indicator for your situation. In other situations, maybe it doesn't matter.

As you can see from the OP and other issues, I am concerned that the overall score doesn't effectively communicate that, so your point is already covered by others, and others managed to cover it without invoking flat earth or ostriches.

If I've missed your point, can you make it without obfuscation? If you're struggling to do that effectively, please reach out to me directly (me@jakearchibald.com, or Twitter DM), and I'll help you figure it out.

jeremy-coleman · 2020-07-10T20:48:09Z

Jake, I should have said "but you cannot publicly report the absence of a pass as a failure (without being misleading)". You claim yes you can, but I think it's fair to assume a reader will interpret a failure as "this bundler can't do this task". When in reality, the data is saying "we havn't figured out how to do this task with this bundler yet".( @argyleink thoughts? ). You also said, “...Parcel fails tests where you would need to write your own plugin...”. This is the core issue at hand. The "failed" tests could be due to the bundler itself or any number of non-bundler issues, including misconfiguration, lack of user knowledge, lack of existing plugins, lack of knowledge of the existence of plugins, etc. It is impossible to be certain if the failure is due to the bundler or user-error. Therefore, it seems the answer to “are our scores misleading?” is unequivocally yes and the discussion should be if/how to mitigate it. I'm not saying the project is shit and you should tear it down and set it on fire, just this specific issue should maybe be addressed (mainly because i am a browserify fan boy). But, why even ask the question if you don't want to entertain a logical answer?

ahmadnassri · 2020-07-18T18:06:55Z

had a brief chat on twitter with @argyleink and @surma about this today, and was linked here, so I wanna add my thoughts and feedback if they are of any use.

what problem does the "summary" score solve?

whether it's at the the top of the bottom, or displayed in a different way ... what is it really solving? if I'm missing any positive value it introduces, I'm happy to learn more ... but in my view, it's creating a "gaming dynamic" which signals "winners and losers", "better and worse", etc ... humans are gonna lean towards the lazy path, and just pick the "top" item.

feedback: simply remove the summary section, let developers read through the itemized list to realize what matters to them / their project's needs.

on "failing" tests

this also creates a "competitive" signal, where if some tool doesn't support certain functionality, it's deemed as a failure? I don't think all tools need to have parity of features, and certainly developers using those tools in their projects don't necessarily NEED all these features for their projects ...

if a new tools is built tomorrow that only does 10 things really well and it's only purpose is to do those 10 things, and targets a specific types of projects .. is it not worthy to be included?

what if I don't need my bundler to handle "image compression"? or my project has no concern for "Custom Type imports", or any "Non-JavaScript Resources" for that matter?

feedback: use terms like "supported" / "unsupported" / "partially supported" to clearly signal and help the reader pick tools with features that fit their project's needs

I guess this also brings up in my mind the target audience, many developers who are not building and shipping "open source" projects don't need full coverage of features and functionality, rather they are better off with tools that best fit their project needs ... I'm thinking of Enterprise Developers who are often overworked / too busy to deep dive and just pick the highest signal tooling and get stuck with their choices for years ...

jakearchibald · 2020-07-20T09:07:47Z

if I'm missing any positive value it introduces, I'm happy to learn more

We're currently seeing build tools competing on this number. Fixing long-standing bugs and improving documentation.

jakearchibald · 2020-07-20T09:33:15Z

@jeremy-coleman sorry I hadn't seen you'd edited your post:

I think it's fair to assume a reader will interpret a failure as "this bundler can't do this task". When in reality, the data is saying "we havn't figured out how to do this task with this bundler yet"

I'm sorry, but that isn't the reality. Failure means it can't 'reasonably' be achieved with the bundler. There is some wiggle room in 'reasonably', but it's roughly:

If it's undocumented, it doesn't exist. We believe developers deserve documentation, and shouldn't be left to read the source and guess if something is actually external or pseudo-private.
It's reasonable to have to write your own plugin for less common cases, but you can only write your own plugin using documented parts of the tool (see above), and that plugin shouldn't be fragile – it should play well with other plugins.
Bugs that change the behaviour of your code are serious bugs.
Developers shouldn't have to opt-out of incorrect behaviour when it comes to core features.

The "failed" tests could be due to the bundler itself or any number of non-bundler issues, including misconfiguration, lack of user knowledge, lack of existing plugins, lack of knowledge of the existence of plugins, etc.

We made the site available to the tool authors weeks before we went public, and they reviewed the tests. We're also continuing to work with them to update the site as bugs are fixed and documentation is written #357. Sure, maybe a test is marked as a failure because, we, and the authors of the tool itself, couldn't make it work, when in reality it can be made to work according to the criteria above, but surely we've done due diligence. Also, we accept bug reports and PRs on the data we've provided.

It is impossible to be certain if the failure is due to the bundler or user-error.

I don't understand this point. From a news story to a scientific paper, you could say "it's impossible to be certain if the information is accurate or an error", but I don't know what that proves.

In a scientific paper, this is generally mitigated by providing detailed test conditions and raw results, so a third party can assess the conclusion and reproduce the test. I feel like we've done exactly this with tooling.report, no?

surma · 2020-07-20T10:00:25Z

To bring this back to the original point: I have some sympathy for the concerns around the scores. I do think they are valuable and create a bit of healthy competition, but I also agree that they don’t need to be at the top of the page.

I liked the idea of moving the overall scores to after the grid, maybe with an extra paragraph explaining the nuance of the scores shown.

Optional: I also liked the idea of adding a total score per section, which might actually be valuable information for users.

GijsWeterings · 2020-07-22T13:59:25Z

@surma I like both moving the scores to after the grid and the intermediate totals!

una · 2020-07-24T15:44:57Z

For a customized view, we can do something like this based on the sub-categories, enabling scores to be adjusted for different user needs. It's a start. I'm thinking something like the Material UI Chip would be perfect. This also would work on mobile, as they would just stack.

The little "+" can rotate into the "x" as the colors change as well. The results would also respond to the filtering.

septatrix · 2021-01-10T00:15:50Z

I for one was very confused that the tools are not sorted by decreasing score from left to right. At first I thought browserify had the highest score and was confused after I already looked at half the comparisons. However after reading this thread I realized that scores are not all and sorting by name might be more convenient for the future. Still I think there are probably people who fall for this trap.

For this reason I would welcome any changes which would somehow emphasize that score are not all when it comes to making your selection.

developit pinned this issue Jul 2, 2020

GoogleChromeLabs deleted a comment from jeremy-coleman Jul 8, 2020

GoogleChromeLabs deleted a comment from shellscape Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are our scores misleading? #339

Are our scores misleading? #339

jakearchibald commented Jun 30, 2020

GijsWeterings commented Jun 30, 2020 •

edited

jakearchibald commented Jun 30, 2020

una commented Jun 30, 2020 •

edited

argyleink commented Jun 30, 2020

tomayac commented Jun 30, 2020

surma commented Jun 30, 2020

una commented Jun 30, 2020

surma commented Jun 30, 2020

GijsWeterings commented Jun 30, 2020 •

edited

emilio-martinez commented Jul 1, 2020

GijsWeterings commented Jul 1, 2020

jeremy-coleman commented Jul 1, 2020 •

edited

tomByrer commented Jul 6, 2020 •

edited

shellscape commented Jul 8, 2020

jakearchibald commented Jul 8, 2020

jeremy-coleman commented Jul 9, 2020

jakearchibald commented Jul 9, 2020

justinfagnani commented Jul 9, 2020

justinfagnani commented Jul 9, 2020

jeremy-coleman commented Jul 9, 2020 •

edited

jakearchibald commented Jul 10, 2020 •

edited

jeremy-coleman commented Jul 10, 2020 •

edited

ahmadnassri commented Jul 18, 2020 •

edited

jakearchibald commented Jul 20, 2020

jakearchibald commented Jul 20, 2020

surma commented Jul 20, 2020 •

edited

GijsWeterings commented Jul 22, 2020

una commented Jul 24, 2020

septatrix commented Jan 10, 2021

Are our scores misleading? #339

Are our scores misleading? #339

Comments

jakearchibald commented Jun 30, 2020

GijsWeterings commented Jun 30, 2020 • edited

jakearchibald commented Jun 30, 2020

una commented Jun 30, 2020 • edited

argyleink commented Jun 30, 2020

tomayac commented Jun 30, 2020

surma commented Jun 30, 2020

una commented Jun 30, 2020

surma commented Jun 30, 2020

GijsWeterings commented Jun 30, 2020 • edited

emilio-martinez commented Jul 1, 2020

GijsWeterings commented Jul 1, 2020

jeremy-coleman commented Jul 1, 2020 • edited

tomByrer commented Jul 6, 2020 • edited

shellscape commented Jul 8, 2020

jakearchibald commented Jul 8, 2020

jeremy-coleman commented Jul 9, 2020

jakearchibald commented Jul 9, 2020

justinfagnani commented Jul 9, 2020

justinfagnani commented Jul 9, 2020

jeremy-coleman commented Jul 9, 2020 • edited

jakearchibald commented Jul 10, 2020 • edited

jeremy-coleman commented Jul 10, 2020 • edited

ahmadnassri commented Jul 18, 2020 • edited

what problem does the "summary" score solve?

on "failing" tests

jakearchibald commented Jul 20, 2020

jakearchibald commented Jul 20, 2020

surma commented Jul 20, 2020 • edited

GijsWeterings commented Jul 22, 2020

una commented Jul 24, 2020

septatrix commented Jan 10, 2021

GijsWeterings commented Jun 30, 2020 •

edited

una commented Jun 30, 2020 •

edited

GijsWeterings commented Jun 30, 2020 •

edited

jeremy-coleman commented Jul 1, 2020 •

edited

tomByrer commented Jul 6, 2020 •

edited

jeremy-coleman commented Jul 9, 2020 •

edited

jakearchibald commented Jul 10, 2020 •

edited

jeremy-coleman commented Jul 10, 2020 •

edited

ahmadnassri commented Jul 18, 2020 •

edited

surma commented Jul 20, 2020 •

edited