Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A big list of observability ideas #3845

Open
3 of 4 tasks
Zac-HD opened this issue Jan 15, 2024 · 0 comments
Open
3 of 4 tasks

A big list of observability ideas #3845

Zac-HD opened this issue Jan 15, 2024 · 0 comments
Labels
interop how to play nicely with other packages legibility make errors helpful and Hypothesis grokable new-feature entirely novel capabilities or strategies

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Jan 15, 2024

Since we added our experimental observability features, there have been a steady stream of ideas tossed around. This issue tracks ideas for future work in Hypothesis itself, and/or downstream (e.g. in Tyche). Of course, there's no timeline for these ideas; nor strong reason to think that we'll ever implement them - I just don't want to forget the ideas!

Inside Hypothesis

  • Track the current Phase on ConjectureRunner: currently, the active phase is defined by "what part of ConjectureRunner is currently executing", which is fine - but it means that our how_generated string just guesses the phase. We should just track this as an instance variable and then report a precise string in our observability output.
    Avoid pointless discards during the reuse and target phases #3862

  • Warn for assume() calls or stateful preconditions or filters which never pass: previously Hypothesis doesn't tell me when a rule never assumes something successfully #213. Best approach is probably to track the number sat/unsat in metadata, and then aggregate and have Tyche report iff we're confident there's a problem.

  • Improve reported reprs for stateful testing - we have them for terminal reports, so this should "just" take a bit of plumbing to include them in observability reports too.

    • Can we also expose timing information about which rules and invariants were executed, and how long they took to run? Easy enough to get sum-of-calltimes into the timing key, but the mean calltime is probably more informative... maybe put mean (and max) calltime into features and sum in timing?
  • Give more detail on status_reason to ensure reports are actionable

    • Include "status_reason_category" in features and "status_reason_location" in metadata
    • Report status_reason for overruns during shrinking, to clarify why there are so many
    • (downstream) show the number of each unique status_reason for status: gave_up test cases

Downstream interfaces

  • Discover the time-complexity of your tests: fit various classes (log, linear, n-log-n, quadratic, exponential, ...) to the per-strategy timings from Pull out timing observations, more jsonable arguments #3834 and the arguments json-ified arguments. Cool info line, or maybe an alert if it's really slow.
    This paper suggests that a two-parameter power-law fit is sufficient, but they're dealing with substantially larger inputs than Hypothesis will generate - in addition to counting basic blocks rather than durations (which we could do with sys.monitoring in python 3.12+). Conversely this preprint and R package just fits a few known classes to observed durations.

  • Better user interface to data from the explain phase: Reporting improvements for Scrutineer #3551 should not be the best we can do. Or go further, and use the coverage information to provide in-editor highlighting (example from debuggingbook.com) - but note that further techniques don't seem to help (note to self: finish my essay on that).

  • In-editor interface to apply explicit @example patches: it'd be neat to surface this feature to more users, and 'GUI to pick which chunks to commit' is a common tool for patch management. If needed we could emit these as an info message as well as writing them to disk.

  • Configuration feedback to help tune settings: 'please autotune max_examples' is a fairly common user request. I've declined because run-for-duration risks testing far less than expected (if the test is slower than believed), but providing information for manual tuning would still be very helpful. Following Estimating Residual Risk in Greybox Fuzzing, we can estimate the number of inputs required to saturate coverage and features, and show the distribution of that over test functions. Although maybe this is better left to HypoFuzz?

  • Explorable inputs with UMAP: dimensionality-reduction tools are a great way to explore data. Embed a coverage vector1 (or anything else!) for each example, show details on hover, color by status or runtime or arbitrary other classifications...

Footnotes

  1. Find the set of locations which are covered by some but not all examples, and are not redundant with any other such location. Convert each unique coverage observation to a zero-or-one vector of 'covered each location', and then run UMAP with some tuned-for-this-purpose hyperparameters.

@Zac-HD Zac-HD added new-feature entirely novel capabilities or strategies legibility make errors helpful and Hypothesis grokable interop how to play nicely with other packages labels Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop how to play nicely with other packages legibility make errors helpful and Hypothesis grokable new-feature entirely novel capabilities or strategies
Projects
None yet
Development

No branches or pull requests

1 participant