Avoid rejection sampling in cu.integer_range
#2029
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rejection sampling means that there are many ways to represent any given value. Our prefix tree means we avoid generating previously used buffers. Togther, they ensure that we generate an increasingly long prefix of rejected examples! Unfortunately that's bad.
Take for example
integers(0, 2)
- we need to generate a two-bit number, but3
is disallowed. Status quo is that we will generate 0, 1, and 2; then continue to generate longer and longer sequences like e.g.3, 3, 3,2 until we hitsettings.max_examples
.This PR means that we instead draw however many bits we need, then compress the out-of-range numbers down to fit. For example:
integers(0, 2)
->0=0, 1=1, 2=2, 3=2
; andintegers(0, 5)
->0=0, 1=1, 2=2, 3=3, 4=4, 5=4, 6=5, 7=5
, and so on. This mapping preserves the identity for all simple inputs, is monotonic, and chosen to be reasonably smooth without distorting the distribution too much (the most-probable outputs are at most twice as likely as the least-probable).I admit that this isn't particularly elegant - but I think it's more important that this is literally thousands of times faster on real use-cases. Fixes #1864, fixes #1982, and fixes #2027.