Make Hypothesis strategies more efficient with statistics resolver and reducing use of `.filter()` #404

Zac-HD · 2021-02-06T12:12:21Z

Heya! I'm a Hypothesis core dev, and stoked that you've found it useful enough to promote to your users 🥰 I also have some suggestions for how to make things more efficient - the short version is "avoid using .filter() wherever possible"; the long version is... well, longer. And probably a lot of work for all of us 😅

Filtering is very convenient, but can also cause some performance issues - because it works by rejecting and re-drawing whatever data did not pass the predicate. This is OK (ish, usually) for scalars, but adds up really fast when you're rejecting whole columns or dataframes. So "try to filter elements, not columns" is the first piece of advice, and one that makes sense to implement immediately.

More generally, in many cases it's possible to define a strategy which always passes the check, instead of filtering:

pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])     # the schema
st.floats().filter(lambda x: x >= 0).filter(lambda x: x < 1)  # the current strategy - it's slow!
st.floats(0, 1, exclude_max=True)                             # the ideal strategy - #nofilter 😉

Now... it's basically fine to say this if you're writing it all by hand, but I agree that it's going to be painful to manage from library code. That's why we have HypothesisWorks/hypothesis#2701, a plan to automatically rewrite strategies when simple filter functions are applied - I'd propose that we work on that together rather than have everyone implement it separately, to split the work and share the benefits.

The text was updated successfully, but these errors were encountered:

Zac-HD · 2021-02-06T12:25:08Z

You also mention that

https://github.com/pandera-dev/pandera/blob/637f2bf1745d9053eede2ac15655a03b29210178/pandera/strategies.py#L307-L308

...that's just because nobody has asked for them yet; and if you're working on categorical or object support we'd love to have that upstream rather than Pandera-specific if that would work for you 😄

cosmicBboy · 2021-02-08T01:45:10Z

thanks @Zac-HD!

This is OK (ish, usually) for scalars, but adds up really fast when you're rejecting whole columns or dataframes. So "try to filter elements, not columns" is the first piece of advice, and one that makes sense to implement immediately.

Yes, this is something I noticed when building out the strategy wrappers, so currently pandera mostly filters elements where it can and only falls back to filtering whole columns/dataframes for user-defined custom checks that haven't been registered with a user-provided strategy via the extensions API.

Building off of your suggestion, I think there are a couple of possible approaches on the pandera strategy-chaining implementation:

use map for strategy chaining:

pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])     # the schema

# currently, the first check is the "base strategy", and its statistics are used to build the first strategy
st.floats(min_value=0).filter(lambda x: x < 1)

# using map instead
st.floats(min_value=0).map(lambda x: x if x <= 1 else 1)

introduce concept of "statistics resolvers" in the backend to aggregate multiple checks into a single strategy. This would throw an error if the resolver finds incompatible statistics.

pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])     # the schema

# statistics resolver would collect all relevant constraints
agg_stats = {"min_value": 0, "max_value": 1, ...}

st.floats(agg_stats["min_value"], agg_stats["max_value"], exclude_max=True)

# this should raise an error in the 
pa.Column(float, checks=[pa.Check.gt(0), pa.Check.le(1)])

The pro about (1) is ease of implementation: the subsequent strategies in a chain don't necessarily have to know anything about the constraints of the prior strategies. Con would be potential issues around oversampling of values based on the first strategy that don't agree with the second strategy (leading to a lot of 1s in the above example)

On the other hand, the pro about (2) is it elegantly handles multiple constraints and can catch incompatible sets of checks off the bat. Con would be potentially more complex logic for implementing the "statistics resolver".

That's why we have HypothesisWorks/hypothesis#2701, a plan to automatically rewrite strategies when simple filter functions are applied - I'd propose that we work on that together rather than have everyone implement it separately, to split the work and share the benefits.

Agreed, this feature would be amazing! Let me know if there's anything we can do on our end to help (or even contribute to the hypothesis codebase 😀)

Zac-HD · 2021-02-08T05:58:47Z

use .map for strategy chaining:

I wouldn't recommend this, as it skews the distribution very badly - you'll tend to end up testing the same forced-to-an-endpoint value essentially every time.

introduce concept of "statistics resolvers" in the backend to aggregate multiple checks into a single strategy. This would throw an error if the resolver finds incompatible statistics.

This is the "right way to do it", though it can also get complicated - which is why I've suggested that we do it upstream 😉

Agreed, this feature [automatic filter rewriting] would be amazing! Let me know if there's anything we can do on our end to help (or even contribute to the hypothesis codebase 😀)

As it happens, there is! I've just opened HypothesisWorks/hypothesis#2853, and HypothesisWorks/hypothesis#2701 (comment) describes the next steps - if I could "just" refactor the strategies and let you write the filter methods and tests for other integers and floats strategies, that would be so helpful!

After that we can look at string/regex strategies and handling lambdas in addition to functools.partial(operator.foo, ...), or jump directly to ensuring that Pandera can take advantage of the new functionality for numeric filters 😀

Zac-HD · 2021-02-24T06:03:00Z

@cosmicBboy - we've just released our first filter rewriting, for st.integers().

If you change the comparison and equality checks to use e.g. functools.partial(operator.ge, min_value) instead of a custom less_than function, it'll just work. The downside is of course no keyword arguments or type annotations 😕

cosmicBboy · 2021-02-24T14:09:58Z

thanks @Zac-HD! will update the minimum hypothesis version and make these improvements.

The downside is of course no keyword arguments or type annotations

Thats okay, currently we're using lambda functions for these right now, so we're not type-annotating anyway

antonl · 2021-05-06T16:21:42Z

Looks like @Zac-HD followed up with another release! https://github.com/HypothesisWorks/hypothesis/releases/tag/hypothesis-python-6.12.0

Zac-HD · 2021-05-07T03:55:44Z

Yep! Support for floats is probably a month or two off, and then text()-filtering-by-regex sometime after that.

Unfortunately we'll never be able to "rewrite" length filters (😢), because it's possible that len would be redefined in the outer scope and then we'd give the wrong answer. There are still some useful ideas to explore here, but they're all quite a lot of work and well in the future.

On another note, you might want to add the Framework :: Hypothesis trove classifier?

Zac-HD added the enhancement New feature or request label Feb 6, 2021

Zac-HD mentioned this issue Feb 6, 2021

Support data synthesis strategies for joint distributions #371

Open

cosmicBboy added this to Backlog in Release Roadmap Feb 19, 2021

cosmicBboy changed the title ~~Make Hypothesis strategies more efficient by reducing use of .filter()~~ Make Hypothesis strategies more efficient by with statistics resolver and reducing use of .filter() Jul 15, 2021

cosmicBboy changed the title ~~Make Hypothesis strategies more efficient by with statistics resolver and reducing use of .filter()~~ Make Hypothesis strategies more efficient with statistics resolver and reducing use of .filter() Jul 15, 2021

Zac-HD mentioned this issue Dec 14, 2021

Return efficient strategies for simple filter predicates, instead of rejection sampling HypothesisWorks/hypothesis#2701

Closed

8 tasks

Zac-HD mentioned this issue Jul 14, 2022

🏃 Sprints meta-issue HypothesisWorks/hypothesis#3402

Closed

20 tasks

This was referenced Jul 16, 2023

Way to communicate more information between libraries annotated-types/annotated-types#37

Open

Hypothesis's pandas extension is slow compared to pydantic HypothesisWorks/hypothesis#3701

Closed

Zac-HD mentioned this issue Feb 21, 2024

Efficient Hypothesis strategies #1503

Merged

cosmicBboy closed this as completed in #1503 Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Hypothesis strategies more efficient with statistics resolver and reducing use of `.filter()` #404

Make Hypothesis strategies more efficient with statistics resolver and reducing use of `.filter()` #404

Zac-HD commented Feb 6, 2021

Zac-HD commented Feb 6, 2021

cosmicBboy commented Feb 8, 2021

Zac-HD commented Feb 8, 2021

Zac-HD commented Feb 24, 2021

cosmicBboy commented Feb 24, 2021

antonl commented May 6, 2021 •

edited

Zac-HD commented May 7, 2021

Make Hypothesis strategies more efficient with statistics resolver and reducing use of .filter() #404

Make Hypothesis strategies more efficient with statistics resolver and reducing use of .filter() #404

Comments

Zac-HD commented Feb 6, 2021

Zac-HD commented Feb 6, 2021

cosmicBboy commented Feb 8, 2021

Zac-HD commented Feb 8, 2021

Zac-HD commented Feb 24, 2021

cosmicBboy commented Feb 24, 2021

antonl commented May 6, 2021 • edited

Zac-HD commented May 7, 2021

Make Hypothesis strategies more efficient with statistics resolver and reducing use of `.filter()` #404

Make Hypothesis strategies more efficient with statistics resolver and reducing use of `.filter()` #404

antonl commented May 6, 2021 •

edited