Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are our scores misleading? #339

Open
jakearchibald opened this issue Jun 30, 2020 · 29 comments
Open

Are our scores misleading? #339

jakearchibald opened this issue Jun 30, 2020 · 29 comments

Comments

@jakearchibald
Copy link
Contributor

https://twitter.com/boriscoder/status/1277937351164035078

Do we want people to switch to Rollup because it has a higher score? Well, no, and we explain that in our FAQ, but I'm not sure folks will see that.

I like the scores as 'a bit of fun', and something for tool maintainers to aim for, but should we add a disclaimer or something?

@GijsWeterings
Copy link

GijsWeterings commented Jun 30, 2020

Random idea: Move the total scores to the bottom. Makes readers not immediately think "oh X is the best, whatever my usecase I'll just go for X" but does keep the challenge to the bundlers to improve their total scores :)

@jakearchibald
Copy link
Contributor Author

That's certainly a quick fix. @argyleink @una: what do you think?

@una
Copy link
Contributor

una commented Jun 30, 2020

If we move4 the Summary section down, I think we'll want to rework the first section to have some more visual information and break up the text blocks. This doesn't seem like a quick fix as we'll need to rethink the IA:

Kapture 2020-06-30 at 10 08 11

Alternatively as a quick fix, we could add some text in the sidebar:

Screen Shot 2020-06-30 at 10 11 31 AM

[EDIT] Or underneath:

Screen Shot 2020-06-30 at 10 13 50 AM

@argyleink
Copy link
Collaborator

the data is the data, and people will be people; data above or below, people still the same 🤷

the scores arent misleading imo.

people will forever read the headline and bail or search for the headline / tldr so they can bail. what we've done is present the information from the top down, rollup up to unrolled. it's all there, and not in a raw format, in a multi-level digestible format. it'd only be misleading if we prevented folks from seeing more information and only gave them the card totals. which we arent: our data is open, transparent, and accessible.

i don't feel we need to change the design. nothing we can do to prevent incorrect extrapolation, i feel we've done our best here to prevent that already.

@tomayac
Copy link
Member

tomayac commented Jun 30, 2020

One idea for the future could be to let people change the score weights. If one app doesn't use, say, web workers, they might want to opt out of anything that grades these aspects (i.e., set a weight of 0). If they don't mind overly much, but want still consider somewhat, say, lack of customization, they might set the related weights to 0.1. It could be interactive with sliders…

@surma
Copy link
Contributor

surma commented Jun 30, 2020

We already have weights for the tests, so letting those be customized sounds like an interesting idea.

@una
Copy link
Contributor

una commented Jun 30, 2020

I think that's an interesting idea. It would be a nice future feature to allow for customized scores based on user needs, but it feels like this might be better placed on a "frameworks" overview page or elsewhere, as this would essentially become a recommendation engine or wizard. Currently, the site intention is to present the research data without specific recommendations.

@surma
Copy link
Contributor

surma commented Jun 30, 2020

@jeremy-coleman How so? If you can provide us with a way for browserify to consume es modules, please let us know (but this should be in a separate issue).

@GijsWeterings
Copy link

GijsWeterings commented Jun 30, 2020

the data is the data, and people will be people; data above or below, people still the same

I respectfully (partly) disagree. To a degree yes, the same info is there and you can pull your own weird conclusions from it no matter what. That said, humans seeing a total score above the fold have drawn conclusions before looking at the rest of the data. Presenting context about the test, the individual test results and then a summary gives users more incentive to see more of the page, catching visuals of certain bundlers scoring better in certain categories, before arriving at a conclusion.

I'd (again) propose moving the conclusions down, with a clear disclaimer on them, and using the first viewport to set expectations of the page

the scores arent misleading imo

They are not, if set in the appropriate context of the set of tests chosen. That information is extremely important and my worry as well as the person on Twitter that sparked this issue is that just seeing the rollup doesn't imply the explicit need for the context enough.

people will forever read the headline and bail or search for the headline / tldr so they can bail.

Again, I hope you agree the mere totals on their own shouldn't be the headline of the page, as they in isolation say nothing of value about any of the bundlers. The idea mentioned here of modifying the weights for different purposes is of course the best way to add value to these totals, but a step further from the current purpose of the site.

As a conclusion, my suggestion of moving the totals isn't to prevent any potential biased extrapolation of results, but to help set them in the correct context. The difference is subtle but in my personal opinion very beneficial in your content strategy.

@emilio-martinez
Copy link

Why not do something similar to what Lighthouse does? The tests are already grouped into categories, but perhaps distilling those into 4 or 5 keys areas of concern would be enough to create a "scorecard" of sorts to present at the top.

@GijsWeterings
Copy link

@emilio-martinez also a fine suggestion in my book! Anything that can help better set the total number in the correct context is a win in my book here

@jeremy-coleman
Copy link

jeremy-coleman commented Jul 1, 2020

What about changing “x out of y tests passed” to something like “x out of y solutions developed”? That way, it reads more like possible/difficult, while also being an implicit call to action for community members to submit or develop missing solutions. That’d also add a degree of separation to the summary stats.

@developit developit pinned this issue Jul 2, 2020
@tomByrer
Copy link

tomByrer commented Jul 6, 2020

"Are our scores misleading?" Is a good question to ask.
When I'm in Engineer mode, I'm more concerned with 'Will this tool work best in MY scenario?'

So perhaps a better solution to the question is not about the numbers, but providing a concise text summery for each tool? EG:

browserify: 
If it already is included in the project and it works for you, keep using it.

Parcel:
Quick to set up and use for most web-facing projects.

Rollup:
Fairly easy expand, very aggressive code optimization.

Webpack:
Very powerful and largest ecosystem.

@shellscape
Copy link
Collaborator

I'm in agreement with @argyleink - let's not let feelings get in the way of good, hard data. Anyone else old enough to remember the Acid Tests for browsers and how important those were in getting browsers on the same page in terms of support? They didn't tiptoe around what failed and what didn't. It was right there in your face, and it was useful.

Removing or obscuring easy to digest data will reduce the usability of the tool. Adding a hard to ignore "read this first" ahead of the scorecards is reasonable. From there, if people want to skip the hard to ignore information, it's on them.

@GoogleChromeLabs GoogleChromeLabs deleted a comment from shellscape Jul 8, 2020
@jakearchibald
Copy link
Contributor Author

I think we can have an adult conversation here that doesn't involve "flat earth".

@jeremy-coleman
Copy link

Perhaps then, if deemed proper for adults, a bit of googling on the fallacy of suppressed evidence and inductive vs deductive reasoning may be in order.

@jakearchibald
Copy link
Contributor Author

@jeremy-coleman constructive comments are welcome here. Comparisons to "flat earth" and "urr just Google it" are not constructive. Keep it civil please.

@justinfagnani
Copy link

Why not do something similar to what Lighthouse does? The tests are already grouped into categories, but perhaps distilling those into 4 or 5 keys areas of concern would be enough to create a "scorecard" of sorts to present at the top.

This could be used to handle my concern about appearing to encourage features that aren't compatible with native platform capabilities. One category could be something like "non-standard module types" or "non-standard extensions" and I could ignore it :)

@justinfagnani
Copy link

@shellscape

the Acid Tests for browsers and how important those were in getting browsers on the same page in terms of support? They didn't tiptoe around what failed and what didn't. It was right there in your face, and it was useful.

The Acid Tests were testing specified behavior. Many bundler features, outside of the desire to actually preserve module semantics, are not specified and some are some are really opinionated. So reducing the score to a single number loses a lot more information compared to Acid.

@jeremy-coleman
Copy link

jeremy-coleman commented Jul 9, 2020

@jakearchibald let me make this as simple as i possibly can
can a bird fly?

(fallacy)
an ostrich is a bird and it cannot fly -> used as proof that birds cannot fly.

(scientific method)
an ostrich is a bird and it cannot fly -> some birds cant fly -> must test again
an eagle is a bird and it can fly -> some birds can fly -> valid proof a bird can fly

I realize the tests here are tdd style, which is fine for development and a great fit for this project, but you cannot publicly report the absence of a pass as a failure. Instead, present the findings as a solution-only recipe book not a pass/fail test suite.

@jakearchibald
Copy link
Contributor Author

jakearchibald commented Jul 10, 2020

Jeremy,

but you cannot publicly report the absence of a pass as a failure

For that test, yes you can. From https://bundlers.tooling.report/about/:

What is tooling.report? It's a quick way to determine the best build tool for your next web project

The "your" is important there. Parcel fails tests where you'd need to write your own plugin, because it doesn't have usable plugin documentation right now. If you're likely to need to write your own plugins, then failure to pass those tests is a strong indicator for your situation. In other situations, maybe it doesn't matter.

As you can see from the OP and other issues, I am concerned that the overall score doesn't effectively communicate that, so your point is already covered by others, and others managed to cover it without invoking flat earth or ostriches.

If I've missed your point, can you make it without obfuscation? If you're struggling to do that effectively, please reach out to me directly (me@jakearchibald.com, or Twitter DM), and I'll help you figure it out.

@jeremy-coleman
Copy link

jeremy-coleman commented Jul 10, 2020

Jake, I should have said "but you cannot publicly report the absence of a pass as a failure (without being misleading)". You claim yes you can, but I think it's fair to assume a reader will interpret a failure as "this bundler can't do this task". When in reality, the data is saying "we havn't figured out how to do this task with this bundler yet".( @argyleink thoughts? ). You also said, “...Parcel fails tests where you would need to write your own plugin...”. This is the core issue at hand. The "failed" tests could be due to the bundler itself or any number of non-bundler issues, including misconfiguration, lack of user knowledge, lack of existing plugins, lack of knowledge of the existence of plugins, etc. It is impossible to be certain if the failure is due to the bundler or user-error. Therefore, it seems the answer to “are our scores misleading?” is unequivocally yes and the discussion should be if/how to mitigate it. I'm not saying the project is shit and you should tear it down and set it on fire, just this specific issue should maybe be addressed (mainly because i am a browserify fan boy). But, why even ask the question if you don't want to entertain a logical answer?

@ahmadnassri
Copy link

ahmadnassri commented Jul 18, 2020

had a brief chat on twitter with @argyleink and @surma about this today, and was linked here, so I wanna add my thoughts and feedback if they are of any use.

what problem does the "summary" score solve?

whether it's at the the top of the bottom, or displayed in a different way ... what is it really solving? if I'm missing any positive value it introduces, I'm happy to learn more ... but in my view, it's creating a "gaming dynamic" which signals "winners and losers", "better and worse", etc ... humans are gonna lean towards the lazy path, and just pick the "top" item.

feedback: simply remove the summary section, let developers read through the itemized list to realize what matters to them / their project's needs.

on "failing" tests

this also creates a "competitive" signal, where if some tool doesn't support certain functionality, it's deemed as a failure? I don't think all tools need to have parity of features, and certainly developers using those tools in their projects don't necessarily NEED all these features for their projects ...

if a new tools is built tomorrow that only does 10 things really well and it's only purpose is to do those 10 things, and targets a specific types of projects .. is it not worthy to be included?

what if I don't need my bundler to handle "image compression"? or my project has no concern for "Custom Type imports", or any "Non-JavaScript Resources" for that matter?

feedback: use terms like "supported" / "unsupported" / "partially supported" to clearly signal and help the reader pick tools with features that fit their project's needs

I guess this also brings up in my mind the target audience, many developers who are not building and shipping "open source" projects don't need full coverage of features and functionality, rather they are better off with tools that best fit their project needs ... I'm thinking of Enterprise Developers who are often overworked / too busy to deep dive and just pick the highest signal tooling and get stuck with their choices for years ...

@jakearchibald
Copy link
Contributor Author

if I'm missing any positive value it introduces, I'm happy to learn more

We're currently seeing build tools competing on this number. Fixing long-standing bugs and improving documentation.

@jakearchibald
Copy link
Contributor Author

@jeremy-coleman sorry I hadn't seen you'd edited your post:

I think it's fair to assume a reader will interpret a failure as "this bundler can't do this task". When in reality, the data is saying "we havn't figured out how to do this task with this bundler yet"

I'm sorry, but that isn't the reality. Failure means it can't 'reasonably' be achieved with the bundler. There is some wiggle room in 'reasonably', but it's roughly:

  • If it's undocumented, it doesn't exist. We believe developers deserve documentation, and shouldn't be left to read the source and guess if something is actually external or pseudo-private.
  • It's reasonable to have to write your own plugin for less common cases, but you can only write your own plugin using documented parts of the tool (see above), and that plugin shouldn't be fragile – it should play well with other plugins.
  • Bugs that change the behaviour of your code are serious bugs.
  • Developers shouldn't have to opt-out of incorrect behaviour when it comes to core features.

The "failed" tests could be due to the bundler itself or any number of non-bundler issues, including misconfiguration, lack of user knowledge, lack of existing plugins, lack of knowledge of the existence of plugins, etc.

We made the site available to the tool authors weeks before we went public, and they reviewed the tests. We're also continuing to work with them to update the site as bugs are fixed and documentation is written #357. Sure, maybe a test is marked as a failure because, we, and the authors of the tool itself, couldn't make it work, when in reality it can be made to work according to the criteria above, but surely we've done due diligence. Also, we accept bug reports and PRs on the data we've provided.

It is impossible to be certain if the failure is due to the bundler or user-error.

I don't understand this point. From a news story to a scientific paper, you could say "it's impossible to be certain if the information is accurate or an error", but I don't know what that proves.

In a scientific paper, this is generally mitigated by providing detailed test conditions and raw results, so a third party can assess the conclusion and reproduce the test. I feel like we've done exactly this with tooling.report, no?

@surma
Copy link
Contributor

surma commented Jul 20, 2020

To bring this back to the original point: I have some sympathy for the concerns around the scores. I do think they are valuable and create a bit of healthy competition, but I also agree that they don’t need to be at the top of the page.

I liked the idea of moving the overall scores to after the grid, maybe with an extra paragraph explaining the nuance of the scores shown.

Optional: I also liked the idea of adding a total score per section, which might actually be valuable information for users.

@GijsWeterings
Copy link

@surma I like both moving the scores to after the grid and the intermediate totals!

@una
Copy link
Contributor

una commented Jul 24, 2020

For a customized view, we can do something like this based on the sub-categories, enabling scores to be adjusted for different user needs. It's a start. I'm thinking something like the Material UI Chip would be perfect. This also would work on mobile, as they would just stack.

tooling-report-filtering

The little "+" can rotate into the "x" as the colors change as well. The results would also respond to the filtering.

@septatrix
Copy link

I for one was very confused that the tools are not sorted by decreasing score from left to right. At first I thought browserify had the highest score and was confused after I already looked at half the comparisons. However after reading this thread I realized that scores are not all and sorting by name might be more convenient for the future. Still I think there are probably people who fall for this trap.

For this reason I would welcome any changes which would somehow emphasize that score are not all when it comes to making your selection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests