New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of unique lists with elements=sampled_from(...)
#2031
Conversation
c9819bb
to
259238b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Widening the scope would be nice" but still clearly a nice upgrade and I'd be happy to merge as-is!
(if you don't extend support though, could you open an enhancement issue? It's good sprint-fodder if nothing else.)
@@ -775,6 +776,26 @@ def unique_by(x): | |||
for i, f in enumerate(unique_by): | |||
if not callable(f): | |||
raise InvalidArgument("unique_by[%i]=%r is not a callable" % (i, f)) | |||
# Note that lazy strategies automatically unwrap when passed to a defines_strategy | |||
# function. | |||
if isinstance(elements, SampledFromStrategy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we went with the previous design because it also handled mapped and filtered strategies with the efficient path - and unfortunately the obvious way to handle that only works if we know that the function or predicate are pure functions, which they might not be...
It's kinda tedious, but if we're adding this special handling it would be nice if it could apply to sampled_from(...).map(str).filter(bool)
too 😕
elements=sampled_from(...)
elements=sampled_from(...)
8ef6ef8
to
f52b324
Compare
I'm not going to widen the scope as that requires a fair bit more work and I mostly wanted to unblock #2030, and this fixes what I think is probably the most common case. I'll open some further enhancement tickets though |
Failures in #2030 demonstrated that the current approach for this is still filtering out a lot of data. This replaces our special casing with the other approach I suggested, which uses a lazy fisher-yates shuffle to get better behaviour for this sampling.
WIP because I expect this to fail the test suite in a bunch of places and leave some code uncovered, and I'm being lazy and letting CI find that out for me.