Ease bug diagnosis by reporting generalised or multiple failing examples #2192

Zac-HD · 2019-11-08T07:37:43Z

How can we make it easier to diagnose failing tests?

Shrinking examples is great: the first failures Hypothesis finds can be so complicated that they're not that helpful. However, sometimes this can go too far and shrinking can make a failure look less important than it really is - the canonical example is when we shrink floating-point examples until they look like rounding errors rather than serious problems (see e.g. #2180 or this essay).

Why do I care? Because as we push the pareto frontier of easy and powerful testing tools further out, users can write better software and more people will use the tools. IMO this is an especially important topic because it helps both novices with easy problems and experts with hard problems!

We already support multi-bug discovery, but here I'm talking about how we report each bug. Currently we print exactly one (minimal, failing,) example to demonstrate each bug, but there might be more informative options.

Generalising counterexamples

In Haskell, Extrapolate and SmartCheck try to generalise examples after minimising them. Translating into Python, it seems worth experimenting with the idea of shrinking to a minimal example and then varying parts of the strategy so that we can tell the user which parts are unimportant. We could also start from non-minimal examples, jump-start the search based on previously tried examples, or overengineer our way out with inductive programming.

We could try to display this by calculating the least-constrained strategy that always fails the test. For example, testing a / b with a=integers(), b=integers() could show something like a=integers(), b=just(0) -> ZeroDivisionError. Abstracting Failure-Inducing Inputs (pdf) and AlHazen (pdf) both take almost exactly this approach, though both are restricted to strings matching a context-free grammar.

Presenting multiple examples

I'm pretty sure I've also seen some prior art somewhere which presented a set of passing and a set of failing examples to help diagnose test failure. I have enough experience looking at shrunk examples that I can implicitly see what must have passed (because it would be a valid shrink otherwise), but it would be nice to print this explicitly for the benefit of new users and anyone using complex strategies.

We don't even need to execute variations on the minimal example for this - in many cases we could just choose a few of the intermediate examples from our shrinking process. It also seems easier to display examples than an algebraic repr in complicated cases like interactive data(), because we already do this for the final example!

Another valuable source of examples to display comes from targeted PBT: in addition to showing the minimal example, we could show the highest-scoring failing example for each label (subject to a size cap, for both performance and interpretability). This suggests that we might want to keep searching until the score stops growing, much as we shrink to a fixpoint. And why not also show the highest-scoring passing example?

What now?

This issue is designed to start a conversation - personally I think that a setting to report multiple examples would be useful enough that we might want it on-by-default or part of verbosity >= normal; generalising counterexamples seems like a really cool trick but I'm not convinced that it's worth the trouble if we have multi-example reporting.

If or when we have concrete proposals I'll split out focussed issues and close this one.

The text was updated successfully, but these errors were encountered:

DRMacIver · 2019-11-11T09:16:53Z

I'm pretty sure I've also seen some prior art somewhere which presented a set of passing and a set of failing examples to help diagnose test failure.

https://agroce.github.io/issta17.pdf is an example of this. We're already effectively doing the normalization bit, but it would definitely be interesting to add in generalisation.

Zac-HD · 2019-11-18T00:43:09Z

Of course, reporting more information isn't always better: our current MultipleFailures approach can be so verbose it's tough to read, and breaks pdb integration. So something more structural than printing each traceback would be nice, and hopefully support e.g. Pytest's various traceback formats.

People are (were?) working on a standard ExceptionGroup class for all the async projects and test runners, but python-trio/trio#611 seems inactive. IMO supporting this is clearly the best way forward, largely by making reporting someone else's problem, but it's unclear when there will actually be an upstream to support.

DRMacIver · 2019-11-18T09:45:18Z

I'd like to make the following tentative proposal:

We stop reporting multiple exceptions altogether.
We still discover and minimize multiple exceptions and tell the user that we have discovered other errors when we have
We add some functionality (e.g. a pytest flag) that can be used to select which error you want to show.
We think about the question of how to display a larger and more detailed report (which would include multiple errors) independently of the basic test runner integration.

Zac-HD · 2022-07-17T07:07:43Z

Replaced by the newer and more actionable issue linked just above this comment.

Zac-HD added opinions-sought tell us what you think about these ones! new-feature entirely novel capabilities or strategies labels Nov 8, 2019

This was referenced Nov 8, 2019

hypothesis should provide assert_almost_equal() #2180

Closed

Make use of targets during shrinking to highlight more dramatic failures #2193

Closed

Zac-HD mentioned this issue Nov 29, 2019

Include highest target scores of failing tests in statistics #2243

Merged

Zac-HD changed the title ~~Better reporting to ease bug diagnosis~~ Ease bug diagnosis by reporting generalised or multiple failing examples Dec 13, 2019

Zac-HD mentioned this issue Feb 16, 2021

Scrutineer: integrating opportunistic fault-localisation with PBT #2859

Merged

Zac-HD mentioned this issue Dec 10, 2021

Replace MultipleFailures with ExceptionGroup on Python 3.11+ #3191

Closed

Zac-HD mentioned this issue Jul 17, 2022

Explaining failing examples - by showing which arguments (don't) matter #3411

Closed

Zac-HD closed this as completed Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ease bug diagnosis by reporting generalised or multiple failing examples #2192

Ease bug diagnosis by reporting generalised or multiple failing examples #2192

Zac-HD commented Nov 8, 2019 •

edited

DRMacIver commented Nov 11, 2019

Zac-HD commented Nov 18, 2019

DRMacIver commented Nov 18, 2019

Zac-HD commented Jul 17, 2022

Ease bug diagnosis by reporting generalised or multiple failing examples #2192

Ease bug diagnosis by reporting generalised or multiple failing examples #2192

Comments

Zac-HD commented Nov 8, 2019 • edited

How can we make it easier to diagnose failing tests?

Generalising counterexamples

Presenting multiple examples

What now?

DRMacIver commented Nov 11, 2019

Zac-HD commented Nov 18, 2019

DRMacIver commented Nov 18, 2019

Zac-HD commented Jul 17, 2022

Zac-HD commented Nov 8, 2019 •

edited