Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix average_size calculation #3160

Merged
merged 1 commit into from Nov 28, 2021

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Nov 23, 2021

Closes #3143.

@Zac-HD Zac-HD added the internals Stuff that only Hypothesis devs should ever see label Nov 23, 2021
@Zac-HD Zac-HD force-pushed the denser-small-arrays branch 4 times, most recently from 220868a to 0c43f1b Compare November 24, 2021 09:26
@Zac-HD Zac-HD requested a review from Zalathar November 24, 2021 22:32
@jebob
Copy link
Contributor

jebob commented Nov 26, 2021

I suspect a gradient descent method might run faster and be more reliable, this approach can choke for relatively small ints (if target=10000, max_size=10001 you can get float overflows when you compute the average for large inc.). I can't test the validity of this Python locally currently but testing nasty edge cases in excel shows that this can handle up to 2000 in a few iterations.

# Gradient derivation
average_size = (1.0 / (1 - p_continue) - 1) * (1 - p_continue ** max_size)
daverage/dp = (1 - p_continue ** max_size) * d/dp((1 - p_continue)**-1 - 1) + (1.0 / (1 - p_continue) - 1) * d/dp(1 - p_continue ** max_size)
daverage/dp = (1 - p_continue ** max_size) * (1 - p_continue)**-2 - (1.0 / (1 - p_continue) - 1) * max_size * p_continue ** (max_size-1)
	@staticmethod
	def _calc_p_continue(desired_avg, max_size):
		p_continue = 1 - 1.0 / (1 + desired_avg)
		if p_continue == 0 or max_size == float("inf"):
			return p_continue
		# For small max_size, the infinite-series p_continue is a poor approximation,
		# and while we can't solve the polynomial a few rounds of iteration quickly
		# gets us a good approximate solution in almost all cases (sometimes exact!).
		err = desired_avg - _p_continue_to_avg(p_continue, max_size)
		for _ in range(10):
			# Should converge in <5 iterations for nearly all cases
			if abs(err) < desired_avg * 0.001:
				# Good enough
				break
			gradient = _p_continue_to_gradient(p_continue, max_size)
			p_continue += err / gradient
			err = desired_avg - _p_continue_to_avg(p_continue, max_size)
		else:
			assert False, "Infinite loop error"
		assert 0 < p_continue < 1, p_continue
		return p_continue

def _p_continue_to_avg(p_continue, max_size):
    """Return the average_size generated by this p_continue and max_size."""
    return (1.0 / (1 - p_continue) - 1) * (1 - p_continue ** max_size)
	
def _p_continue_to_gradient(p_continue, max_size):
    """Return the gradient (with respect to p_continue) for p_continue and max_size."""
	# While the true gradient is well behaved around 1, this function is not.
	# Approximating to near-one is good enough in most cases
	# Todo: if the true value of p_continue is > this number we won't converge
	p_continue = min(p_continue, 0.9999)
	
    return (
	    (1 - p_continue ** max_size) / (1 - p_continue) ** 2 
	    - (1.0 / (1 - p_continue) - 1) * max_size * p_continue ** (max_size-1)
	)

@Zac-HD Zac-HD force-pushed the denser-small-arrays branch 5 times, most recently from 7db8071 to 8b30b67 Compare November 28, 2021 11:54
Co-Authored-By: Robert Howlett <9222111+jebob@users.noreply.github.com>
@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 28, 2021

I've replaced my previous heuristic approach with a binary search, which is simple enough that it's clear-on-inspection that it converges and maintains the bounds we want at each step. The constant factors might be a bit larger, but it's still fast in practice and @lru_cache makes it faster again.

@jebob, I tried out your gradient-descent code, but couldn't get it numerically-stable enough to respect strict bounds when testing. Regardless, I feel that we've co-written this fix and have given you commit and coauthor credit accordingly.

@jebob
Copy link
Contributor

jebob commented Nov 28, 2021

@Zac-HD Thanks for the co-authorship!

Stability is essential so I agree binary search is the way to go.

Copy link
Contributor

@jebob jebob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the two comments

# gets us a good approximate solution in almost all cases (sometimes exact!).
while _p_continue_to_avg(p_continue, max_size) > desired_avg:
# This is impossible over the reals, but *can* happen with floats.
p_continue -= 0.0001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I think we can just return p_continue (or do nothing, thus skipping the zeroth iteration of the while loop)? If 1 - 1.0 / (1 + desired_avg) is close enough to the correct answer that floating-point error causes p_continue to be an overestimate then we're close enough for practical purposes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the estimate is close enough that we could just use the initial p_continue.

However, it's cheap to make this adjustment, and that allows us to test that our binary search always maintains a strict upper bound.



def test_p_continue_to_average_saturates():
assert cu._p_continue_to_avg(1.1, 100) == 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we replace this special case with an @example on test_p_continue_to_average?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do actually have such an example, but also need this test to reliably get full coverage from our conjecture-coverage task (the other is in nocover).

@Zac-HD
Copy link
Member Author

Zac-HD commented Nov 28, 2021

Thanks again for the review, and all your help with this!

@Zac-HD Zac-HD merged commit 84afe88 into HypothesisWorks:master Nov 28, 2021
@Zac-HD Zac-HD deleted the denser-small-arrays branch November 28, 2021 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internals Stuff that only Hypothesis devs should ever see
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Our internal average_size handling is incorrect
3 participants