Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Our internal average_size handling is incorrect #3143

Closed
Zac-HD opened this issue Nov 9, 2021 · 3 comments · Fixed by #3160
Closed

Our internal average_size handling is incorrect #3143

Zac-HD opened this issue Nov 9, 2021 · 3 comments · Fixed by #3160
Labels
internals Stuff that only Hypothesis devs should ever see performance go faster! use less memory!

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Nov 9, 2021

We have a nice cu.many() helper for generating collections with a minimum, average, and maximum size. To support local mutation (in generation or shrinking), we arrange this as a sequence of (should_add_element, element) pairs, with a constant probability of adding an element so that the encoding is invariant with respect to the collection index. That in turn gives us a geometric distribution over lengths - before accounting for subtree exhaustion - but anyway, it works nicely. Here's the implementation:

def __init__(self, data, min_size, max_size, average_size):
assert 0 <= min_size <= average_size <= max_size
self.min_size = min_size
self.max_size = max_size
self.data = data
self.stopping_value = 1 - 1.0 / (1 + average_size)

The problem is that this formula for stopping_value, based on the sum of a geometric series, is valid if and only if min_size=0 and max_size=infinity! We should instead compute the stopping_value which gives us our desired average_size over the finite sum of a geometric series between maybe-nonzero min_size and finite-and-maybe-small max_size.

(for implementation reasons max_size is bounded at 4K, typically much less, but such late terms make a negligible contribution to the sum and thus probability anyway)

@Zac-HD Zac-HD added performance go faster! use less memory! internals Stuff that only Hypothesis devs should ever see labels Nov 9, 2021
@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 9, 2021

And once that's sorted out we might revisit my prompting case, increasing the average density of small arrays in master...Zac-HD:denser-small-arrays

@jebob
Copy link
Contributor

jebob commented Nov 22, 2021

I manged to derive

average_size_of_infinite_distribution = 1.0 / (1 - stopping_value) - 1
average_size = min_size + average_size_of_infinite_distribution - average_size_of_infinite_distribution * stopping_value ** (max_size - min_size)
average_size = min_size + (1.0 / (1 - stopping_value) - 1) * (1 - stopping_value ** (max_size - min_size))

which isn't generally solvable for stopping_value for large inputs of max_size - min_size due to being a large polynomial. Therefore I don't think there's a nice general solution.

You still can get an improvement on the status quo by assuming max_size = inf though:

average_size = min_size + (1.0 / (1 - stopping_value) - 1) * (1)
1 + average_size - min_size = 1.0 / (1 - stopping_value) 
1 - stopping_value = 1.0 / (1 + average_size - min_size)
stopping_value = 1 - 1.0 / (1 + average_size - min_size)  # Proposal

This significantly reduces errors in e.g. the case where min_size=10, average_size=11, max_size=4000

@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 23, 2021

Very nice work! We should definitely take that as an improvement over the status quo 😁

For small max_size (<20?) we would get a noticable difference, but happily we can just iterate a few times to get a decent approximate solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Stuff that only Hypothesis devs should ever see performance go faster! use less memory!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants