Our internal `average_size` handling is incorrect #3143

Zac-HD · 2021-11-09T11:23:08Z

We have a nice cu.many() helper for generating collections with a minimum, average, and maximum size. To support local mutation (in generation or shrinking), we arrange this as a sequence of (should_add_element, element) pairs, with a constant probability of adding an element so that the encoding is invariant with respect to the collection index. That in turn gives us a geometric distribution over lengths - before accounting for subtree exhaustion - but anyway, it works nicely. Here's the implementation:

hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/utils.py

Lines 388 to 393 in 377b2ed

    
           def __init__(self, data, min_size, max_size, average_size): 
        
               assert 0 <= min_size <= average_size <= max_size 
        
               self.min_size = min_size 
        
               self.max_size = max_size 
        
               self.data = data 
        
               self.stopping_value = 1 - 1.0 / (1 + average_size)

The problem is that this formula for stopping_value, based on the sum of a geometric series, is valid if and only if min_size=0 and max_size=infinity! We should instead compute the stopping_value which gives us our desired average_size over the finite sum of a geometric series between maybe-nonzero min_size and finite-and-maybe-small max_size.

(for implementation reasons max_size is bounded at 4K, typically much less, but such late terms make a negligible contribution to the sum and thus probability anyway)

The text was updated successfully, but these errors were encountered:

Zac-HD · 2021-11-09T11:24:11Z

And once that's sorted out we might revisit my prompting case, increasing the average density of small arrays in master...Zac-HD:denser-small-arrays

jebob · 2021-11-22T19:08:02Z

I manged to derive

average_size_of_infinite_distribution = 1.0 / (1 - stopping_value) - 1
average_size = min_size + average_size_of_infinite_distribution - average_size_of_infinite_distribution * stopping_value ** (max_size - min_size)
average_size = min_size + (1.0 / (1 - stopping_value) - 1) * (1 - stopping_value ** (max_size - min_size))

which isn't generally solvable for stopping_value for large inputs of max_size - min_size due to being a large polynomial. Therefore I don't think there's a nice general solution.

You still can get an improvement on the status quo by assuming max_size = inf though:

average_size = min_size + (1.0 / (1 - stopping_value) - 1) * (1)
1 + average_size - min_size = 1.0 / (1 - stopping_value) 
1 - stopping_value = 1.0 / (1 + average_size - min_size)
stopping_value = 1 - 1.0 / (1 + average_size - min_size)  # Proposal

This significantly reduces errors in e.g. the case where min_size=10, average_size=11, max_size=4000

Zac-HD · 2021-11-23T12:12:56Z

Very nice work! We should definitely take that as an improvement over the status quo 😁

For small max_size (<20?) we would get a noticable difference, but happily we can just iterate a few times to get a decent approximate solution!

Zac-HD added performance internals labels Nov 9, 2021

Zac-HD mentioned this issue Nov 23, 2021

Fix average_size calculation #3160

Merged

Zac-HD closed this as completed in #3160 Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Our internal `average_size` handling is incorrect #3143

Our internal `average_size` handling is incorrect #3143

Zac-HD commented Nov 9, 2021

Zac-HD commented Nov 9, 2021

Uh oh!

jebob commented Nov 22, 2021 •

edited

Loading

Uh oh!

Zac-HD commented Nov 23, 2021 •

edited

Loading

Uh oh!

Our internal average_size handling is incorrect #3143

Our internal average_size handling is incorrect #3143

Comments

Zac-HD commented Nov 9, 2021

Zac-HD commented Nov 9, 2021

Uh oh!

jebob commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zac-HD commented Nov 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Our internal `average_size` handling is incorrect #3143

Our internal `average_size` handling is incorrect #3143

jebob commented Nov 22, 2021 •

edited

Loading

Zac-HD commented Nov 23, 2021 •

edited

Loading