Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ease bug diagnosis by reporting generalised or multiple failing examples #2192

Closed
Zac-HD opened this issue Nov 8, 2019 · 4 comments
Closed
Labels
new-feature entirely novel capabilities or strategies opinions-sought tell us what you think about these ones!

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Nov 8, 2019

How can we make it easier to diagnose failing tests?

Shrinking examples is great: the first failures Hypothesis finds can be so complicated that they're not that helpful. However, sometimes this can go too far and shrinking can make a failure look less important than it really is - the canonical example is when we shrink floating-point examples until they look like rounding errors rather than serious problems (see e.g. #2180 or this essay).

Why do I care? Because as we push the pareto frontier of easy and powerful testing tools further out, users can write better software and more people will use the tools. IMO this is an especially important topic because it helps both novices with easy problems and experts with hard problems!

We already support multi-bug discovery, but here I'm talking about how we report each bug. Currently we print exactly one (minimal, failing,) example to demonstrate each bug, but there might be more informative options.

Generalising counterexamples

In Haskell, Extrapolate and SmartCheck try to generalise examples after minimising them. Translating into Python, it seems worth experimenting with the idea of shrinking to a minimal example and then varying parts of the strategy so that we can tell the user which parts are unimportant. We could also start from non-minimal examples, jump-start the search based on previously tried examples, or overengineer our way out with inductive programming.

We could try to display this by calculating the least-constrained strategy that always fails the test. For example, testing a / b with a=integers(), b=integers() could show something like a=integers(), b=just(0) -> ZeroDivisionError. Abstracting Failure-Inducing Inputs (pdf) and AlHazen (pdf) both take almost exactly this approach, though both are restricted to strings matching a context-free grammar.

Presenting multiple examples

I'm pretty sure I've also seen some prior art somewhere which presented a set of passing and a set of failing examples to help diagnose test failure. I have enough experience looking at shrunk examples that I can implicitly see what must have passed (because it would be a valid shrink otherwise), but it would be nice to print this explicitly for the benefit of new users and anyone using complex strategies.

We don't even need to execute variations on the minimal example for this - in many cases we could just choose a few of the intermediate examples from our shrinking process. It also seems easier to display examples than an algebraic repr in complicated cases like interactive data(), because we already do this for the final example!

Another valuable source of examples to display comes from targeted PBT: in addition to showing the minimal example, we could show the highest-scoring failing example for each label (subject to a size cap, for both performance and interpretability). This suggests that we might want to keep searching until the score stops growing, much as we shrink to a fixpoint. And why not also show the highest-scoring passing example?

What now?

This issue is designed to start a conversation - personally I think that a setting to report multiple examples would be useful enough that we might want it on-by-default or part of verbosity >= normal; generalising counterexamples seems like a really cool trick but I'm not convinced that it's worth the trouble if we have multi-example reporting.

If or when we have concrete proposals I'll split out focussed issues and close this one.

@Zac-HD Zac-HD added opinions-sought tell us what you think about these ones! new-feature entirely novel capabilities or strategies labels Nov 8, 2019
@DRMacIver
Copy link
Member

I'm pretty sure I've also seen some prior art somewhere which presented a set of passing and a set of failing examples to help diagnose test failure.

https://agroce.github.io/issta17.pdf is an example of this. We're already effectively doing the normalization bit, but it would definitely be interesting to add in generalisation.

@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 18, 2019

Of course, reporting more information isn't always better: our current MultipleFailures approach can be so verbose it's tough to read, and breaks pdb integration. So something more structural than printing each traceback would be nice, and hopefully support e.g. Pytest's various traceback formats.

People are (were?) working on a standard ExceptionGroup class for all the async projects and test runners, but python-trio/trio#611 seems inactive. IMO supporting this is clearly the best way forward, largely by making reporting someone else's problem, but it's unclear when there will actually be an upstream to support.

@DRMacIver
Copy link
Member

I'd like to make the following tentative proposal:

  1. We stop reporting multiple exceptions altogether.
  2. We still discover and minimize multiple exceptions and tell the user that we have discovered other errors when we have
  3. We add some functionality (e.g. a pytest flag) that can be used to select which error you want to show.
  4. We think about the question of how to display a larger and more detailed report (which would include multiple errors) independently of the basic test runner integration.

@Zac-HD Zac-HD changed the title Better reporting to ease bug diagnosis Ease bug diagnosis by reporting generalised or multiple failing examples Dec 13, 2019
This was referenced Mar 15, 2021
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 17, 2022

Replaced by the newer and more actionable issue linked just above this comment.

@Zac-HD Zac-HD closed this as completed Jul 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature entirely novel capabilities or strategies opinions-sought tell us what you think about these ones!
Projects
None yet
Development

No branches or pull requests

2 participants