Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return efficient strategies for simple filter predicates, instead of rejection sampling #2701

Closed
7 of 8 tasks
Zac-HD opened this issue Dec 16, 2020 · 19 comments
Closed
7 of 8 tasks
Labels
enhancement it's not broken, but we want it to be better performance go faster! use less memory!

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Dec 16, 2020

TODO:

  • Preliminary support for integers()
  • Apply filters to each sub-strategy of a st.one_of(), rather than the top level
  • AST-based analysis of lambdas
  • Support one- or two-bounded integers()
  • Infer min_size=1 from e.g. bool or len or lambda x: x
  • Support floats(...)
  • Handle math.isfinite, math.isinf, and math.isnan for integers() and floats()
    (best handled directly in the respective .filter() methods, rather than the fancy bounds-filtering logic)
  • Support text()/binary() filtered with re.compile(...).find/match/fullmatch
    (includes adding an internal way to exclude characters from the from_regex() strategy; see also Use greenery and regex_transformer to merge pattern and patternProperties keywords python-jsonschema/hypothesis-jsonschema#85)

Rejected ideas:

  • length filters like len(x) > 2 - the name len might be rebound in ways that we can't see from the AST (e.g. monkeypatched from another module). Applies to any predicate with a function call.
  • 'seeing through' builds(foo, x=...).filter(lambda f: f.x > 0) - too many ways to map init-args to attributes, and too many ways to mess with attribute access. Applies to any attribute access.
  • filters comparing elements of tuples() or fixed_dictionaries() - I'd accept a PR but not a feature request for this, it's large and I just don't want to write it.
  • Support operator.contains or x.__contains__ for all core scalar strategies (this just doesn't happen often enough to justify the maintainence work to support it)

st.integers().filter(lambda x: x >= 0) is equivalent to the more efficient st.integers(min_value=0) - see constructive vs predicative data. So why not have the .filter() method return the efficient form when given a simple enough predicate? This is definitely possible, and using the AST module it's not even as bad as the other things we do internally ("terrible, yes, but also great..."). For completeness we should cover the equivalent cases with functools.partial(operator.foo, ...).

There's a prototype here which handles scalar comparisons (<, <=, >=, >) and regex matching, as part of @mristin's work on Parquery/icontract#177. Ideally we'd handle those cases, plus reverse operand order (1 < x), length bounds via e.g. len(x) < 10, non-lambda functions with equivalently simple bodies, and composition of clauses using such as x < 10 and x > 0.

For the design, I'd aim to have a function along the lines of hypothesis.internal.reflection.bounds_from_predicate; to start with it can return a dict of kwargs like min_value, exclude_max, max_size, etc. and then defining a .filter() method on e.g. BoundedIntStrategy which calls it and falls back to the current approach if the predicate is not understood.


Is this a good idea? Maybe not, but we already have similar logic for e.g. Django field validators. It would also make life a lot easier for people who infer their tests from types + predicates, which includes basically all design-by-contract libraries (lately both deal and icontract), and of course direct users in a hurry. Pandera's data synthesis (implementation) would also be far more efficient if refactored to take advantage. Symbolic execution tools like Crosshair wouldn't gain much though, since they can discover such constraints dynamically.

Doesn't this break performance intuitions? I don't think it's worse than e.g. our existing special handling for lists(sampled_from(...), unique=True) which converts quadratic to linear time! It's already difficult to reason about performance in the presence of rejection-sampling plus mutation-based example generation, and while this can be a substantial runtime improvement it's only a constant-factor speedup and falls back to the current behaviour in the worst case. So long as we consistently have this work in the covered cases I think it's worth having, and it is based on a consistent static analysis.

What about ... ? Great question, but let's get the basic cases covered before looking at e.g. integers().filter(lambda x: x % 2).filter(lambda x: x > 9) -> integers(min_value=5).map(lambda x: x * 2). Other fancy ideas: passing through mapped and one_of strategies, handling tuples and fixed_dictionaries (for *args, **kwargs patterns), predicates involving or, etc. I'm not opposed in principle to fancy stuff, since it's nicely self-contained and doesn't have user-facing API design issues, but simple stuff first!

@Zac-HD Zac-HD added enhancement it's not broken, but we want it to be better performance go faster! use less memory! labels Dec 16, 2020
@mristin
Copy link
Contributor

mristin commented Dec 16, 2020

Hi @Zac-HD ,
One comment from my (icontract) side. What about the preconditions that involve more than one argument?

@icontract.require(lambda x, y: x < y)
def some_func(x: int, y: int) -> None: ...

I suppose this comment belongs to "What about... ?" section :-), but maybe that is something worth considering from the start.

@Zac-HD
Copy link
Member Author

Zac-HD commented Dec 16, 2020

Yeah, for those cases the strategy would be tuples(tuples(<args>), fixed_dictionaries(<kwargs>)).filter(...) - and this is a very common pattern, used in both builds() and @given(), for example. We do have to implement things in order though:

  1. Simple filters on strategies for scalars are a building block for everything else
  2. 'reach through' structure, e.g. tuples(integers()).filter(lambda t: t[0] >= 0) -> tuples(integers(min_value=0))
  3. Multiple-argument predicates over structured inputs are probably the hardest part - unlike the previous steps, there's no single efficient pattern.

I think in general we'll have to turn to @st.composite and draw each variable in turn... and do that efficiently might involve finding a topological ordering of variables. Which in turn is tricky because e.g. float("nan") breaks certain basic invariants like "if not x < 0, then x >= 0". I'm happy to support simple but frequent cases, but in general I think this is going to have to rely on users writing their own strategies 😕

@mristin
Copy link
Contributor

mristin commented Dec 17, 2020

Just some more thoughts:

  • As a user, I would expect the filter to perform the filtering as expected. A bug in your optimization procedure would be nearly impossible to spot. Instead of directly optimizing, why not introduce a linter (as a separate command-line tool) or present the user with possible optimizations once there are too many rejected samples?
  • You probably don't want to optimize filters at the instantiation, but want to collect them in some kind of intermediate representation (e.g., a constraint database). They would be then holistically optimized.

I sketched some rough ideas in this issue on icontract-hypothesis which are also related to this issue.

@Zac-HD
Copy link
Member Author

Zac-HD commented Dec 17, 2020

I really appreciate your input on this 😍 - thanks! Now, lots more thoughts from me...

The trouble is that "perform the filtering as expected" really depends on what you expect! Because of Hypothesis' heuristics and mutation-based input generation, it's already almost impossible to describe the distribution of inputs - for example, integers().filter(lambda x: x >= 0) will succeed substantially more than half the time, and the ratio gets more extreme for more difficult filters.

There are two possible kinds of bugs here: either too permissive, or too restrictive. We can eliminate the former by continuing to run the predicate, and raising a "this is an internal error" if it fails. (note: adds some passthrough logic in FilteredStrategy.filter, but that's OK) For the latter case, I think we will have to rely mostly on code review, clear implementation, and lots of comments.

I like the idea of using an aggressive version to provide suggestions if a filter failing too often - this can be helpful even if it's not guaranteed to be correct in all cases. The challenge is that, as for #434 (comment), it's quite difficult to unpick which sub-strategy was the problem at runtime. See #2705 for my proposal to do so as a linter instead 😁

I also think that if we're doing this we should actually use the transformations though. Often filters are simply easier to write or more natural to read than a constructive generator for the same inputs, and "please write this more complicated code we could do for you" is not a great user experience. I personally prefer to read from_type(Bar).filter(lambda bar: bar.foo.x > 0) because it matches how I think about the instances, whereas builds(Bar, foo=builds(Foo, x=integers(min_value=1))) tells me how to make them rather than what they are. Admittedly personal preference, but I often take a small-to-moderate efficiency penalty for similar readability advantages.

You probably don't want to optimize filters at the instantiation, but want to collect them in some kind of intermediate representation (e.g., a constraint database). They would be then holistically optimized.

We may need to do this eventually, but for the first PR I'd just try to handle each in turn. We already have lazy instantiation of strategies, and this would simplify the implementation considerably.

Note that if we're unable to convert some filter and apply another, we can still try to convert the latter. s.filter(f).filter(g) is equivalent s.filter(g).filter(f), unless both f and g are stateful in some way which would prevent us from converting them.

And a final note, if we can detect "pure predicates" for this we should apply it in sampled_from() as well.

@mristin
Copy link
Contributor

mristin commented Dec 17, 2020

Often filters are simply easier to write or more natural to read than a constructive generator for the same inputs, and "please write this more complicated code we could do for you" is not a great user experience. I personally prefer to read from_type(Bar).filter(lambda bar: bar.foo.x > 0) because it matches how I think about the instances, whereas builds(Bar, foo=builds(Foo, x=integers(min_value=1))) tells me how to make them rather than what they are. Admittedly personal preference, but I often take a small-to-moderate efficiency penalty for similar readability advantages.

I agree, declarative is much clearer and easier than procedural.

What would be the canonical way of defining filtering on the multiple arguments of a function? I see that you can use tuples, but is there a way to have named arguments somehow? IMO, it would be important to document this upfront, even though you might optimize such strategies only in the future.

Here is a trivial example. A precondition such as:

@require(lambda x, y: x < y)

can be written as an unoptimized strategy:

st.tuples(st.integers(), st.integers())
.filter(lambda t: t[0] < t[1])

and passed as args to the function.

Alternatively, you can use dictionaries:

st.fixed_dictionaries('x': st.integers(), 'y': st.integers())
.filter(lambda d: d['x'] < d['y'])

and pass them as kwargs to the function.

There must be many nicer ways -- whatever you pick, I'd suggest you to make sure that it can interoperate with optimizing the strategies later down the line and recommend it as such to the user in the documentation.

@mristin
Copy link
Contributor

mristin commented Dec 17, 2020

Another idea re optimizations: it would be nice if the user could somehow inspect the optimization (by using a setting?) or be informed on too many rejected samples what the final optimized strategy looks like.

The magic would be the default, but it needs ways for "demistification" when it fails.

@Zac-HD
Copy link
Member Author

Zac-HD commented Dec 17, 2020

The canonical approach is *args -> tuples(), **kwargs -> fixed_dictionaries(), and *args, **kwargs -> tuples(tuples(), fixed_dictionaries()). I'd aim to support all these patterns for the 'reach through' and multiple-argument filters above. All the nicer ways to do this amount to "use builds()", and that's just syntactic sugar for this approach 😉

For demystification - I call that "legibility", where the system supports user understanding - I think it would be appropriate to debug_report(f"Rewriting {before!r} to {after!r} for efficiency"). Plus some documentation here, of course.

@Zac-HD
Copy link
Member Author

Zac-HD commented Feb 7, 2021

A first-pass prototype which mostly works: master...Zac-HD:efficient-filter

Note that this is solving a complementary problem to the icontract version: that's basically from_type() augumented by inferred bounds, whereas this changes the effect of applying a filter to existing strategies. The remaining challenges are basically to avoid breaking any reprs, and to check that we don't end up polluting the strategy cache.

The next steps after getting this working are, independently, to get this working for a wider variety of scalar types, and to handle a wider variety of predicates. I'm OK with special-casing partial(operator.XX, ...) for now though; it's reliable and still useful to e.g. Pandera.

@mristin
Copy link
Contributor

mristin commented Feb 7, 2021

Hi @Zac-HD !
Thanks for this code!

Could you please explain this part a little bit:

Note that this is solving a complementary problem to the icontract version: that's basically from_type() augumented by inferred bounds, whereas this changes the effect of applying a filter to existing strategies.

When I read the code, it seems to me that both approaches are pretty much the same (icontract-hypothesis works on AST, this approach on operator module). Both change the strategies (the unmatched contracts are left as .filter() chains in icontract-hypothesis and I interpreted the code of your approach to do the same), don't they? What am I missing? (As I am non-native English speaker, it can well be that I got "this" and "that" in your paragraph wrong?)

@mristin
Copy link
Contributor

mristin commented Feb 7, 2021

P.S.

that's basically from_type() augumented by inferred bounds [...]

Maybe this confused me -- if "that" refers to icontract-hypothesis, then the implementation scope is actually already wider (regular expressions are already matched as well, and the remaining contracts are given as a chain of .filter()'s).

@Zac-HD
Copy link
Member Author

Zac-HD commented Feb 7, 2021

I was referring to icontract-hypothesis, and ignoring regex for now - I do plan to include them soon after the basic numeric filters are done.

The main difference is that your code creates strategies, while my code modifies strategies - and is therefore more flexible for users but less elegant and less reliable.

@mristin
Copy link
Contributor

mristin commented Feb 7, 2021

The main difference is that your code creates strategies, while my code modifies strategies [...]

Thanks, this clarifies it for me!

@mristin
Copy link
Contributor

mristin commented Feb 7, 2021

Btw., is there a reasonable way to benchmark the overhead with this new approach as opposed to putting the bounds directly?

It would make more sense for icontract-hypothesis to translate contracts into the filters with partial functions with operators (as you indicated in the docstrings) and rely on Hypothesis for the optimization rather then reimplement the very same feature.

@Zac-HD
Copy link
Member Author

Zac-HD commented Feb 7, 2021

The overhead should be fairly small, I think, since we're talking about a one-time calculation to define the strategy rather than something which happens on every draw.

I'd also recommend against switching icontract-hypothesis over at the moment; since I intend to support AST-based analysis too once I've proven the logic works. It's probably easier just to wait for that to be supported 😁

@Zac-HD
Copy link
Member Author

Zac-HD commented Feb 8, 2021

I've just opened #2853, to rewrite simple predicates for bounded integers.

The logical next step is to extend this to integers with <2 bounds, and to floats. Fractions and decimals are also on the list, but lower priority than strings. That will probably involve re-architecting the numeric strategies to use the ConjectureData-level API rather than the .filter() method at a low level (e.g. integers(min_value=-1) is implemented as integers().filter(lambda x: x >= -1) - they're equivalent, but using the lower-level API give us a more consistent structure and therefore easier rewriting.

Further specific ideas which could be implemented independently:

  • apply filters to each sub-strategy of a st.one_of(), rather than the top level (requires a change in lists(one_of(...), unique=True)). This is a general nice-to-have for efficiency.
  • support collection length bounds (including .filter(bool), .filter(len), etc.)
  • support string regex patterns - includes adding an internal way to exclude characters from the from_regex() strategy (so that e.g. st.text(alphabet="abc").filter(re.compile("...").fullmatch) works). Fortunately I'll have a student working regex length-bounds and intersection from March to May this year, which is the tricky bit 😁
  • special case for st.text().filter(str.isidentifier), as per Document how to make a Strategy for enums #2693 (comment)
  • AST-based lambda analysis as demonstrated by icontract-hypothesis (see above)

@Zac-HD
Copy link
Member Author

Zac-HD commented Apr 26, 2021

OK, I sat down for a few hours and got AST analysis working, so that we can rewrite filters based on lambdas and other simple functions: master...Zac-HD:understand-filter-ast. PR later this week.

Only supports combinations of comparisons and and expressions using literals in a single-argument lambda (or function where the body is return ...), but that's actually pretty good! And importantly, it fails safely back to no-rewrites for uncertain cases.

@honno
Copy link
Member

honno commented Jun 27, 2022

The remaining challenges are basically to avoid breaking any reprs

I'm assuming this is in reference to filter-rewritten strategies still having a repr which contains the lambda expression, e.g. st.integers().filter(lambda n: n >= 0) reprs as it is defined, instead of st.integers(min_value=0).

Is this just a user experience thing? I could see the lambda expr would make more sense to the user then a possibly mystifying kwarg, or even it's problematic when debugging some complex "filtering architecture" e.g. whatever cool thing Pandera is doing. Am I missing another concern?

(also gah dependabot)

@Zac-HD
Copy link
Member Author

Zac-HD commented Jun 27, 2022

The remaining challenges are basically to avoid breaking any reprs

I'm assuming this is in reference to filter-rewritten strategies still having a repr which contains the lambda expression, e.g. st.integers().filter(lambda n: n >= 0) reprs as it is defined, instead of st.integers(min_value=0).

That's a now-stale reference to the unfinished branch I was working on - st.integers().filter(lambda n: n >= 0) is the intended repr. That's because it goes via LazyStrategy, which shows how the strategy was in fact constructed, whereas the unwrapped form tends to show a lot of less-readable internals. See for example the repr of FloatStrategy, which doesn't look like floats()!

And yes, I 😡 the spammy bots, it's pretty rude to improve the experience of your paying customers by making life worse for maintainers 😡

@Zac-HD
Copy link
Member Author

Zac-HD commented Oct 17, 2022

Because regex support will be more complicated, I've moved that to a new dedicated issue and am closing this one 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement it's not broken, but we want it to be better performance go faster! use less memory!
Projects
None yet
Development

No branches or pull requests

3 participants