Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid rejection sampling in cu.integer_range #2029

Closed

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Jul 2, 2019

Rejection sampling means that there are many ways to represent any given value. Our prefix tree means we avoid generating previously used buffers. Togther, they ensure that we generate an increasingly long prefix of rejected examples! Unfortunately that's bad.

Take for example integers(0, 2) - we need to generate a two-bit number, but 3 is disallowed. Status quo is that we will generate 0, 1, and 2; then continue to generate longer and longer sequences like e.g. 3, 3, 3, 2 until we hit settings.max_examples.

This PR means that we instead draw however many bits we need, then compress the out-of-range numbers down to fit. For example: integers(0, 2) -> 0=0, 1=1, 2=2, 3=2; and integers(0, 5) -> 0=0, 1=1, 2=2, 3=3, 4=4, 5=4, 6=5, 7=5, and so on. This mapping preserves the identity for all simple inputs, is monotonic, and chosen to be reasonably smooth without distorting the distribution too much (the most-probable outputs are at most twice as likely as the least-probable).

I admit that this isn't particularly elegant - but I think it's more important that this is literally thousands of times faster on real use-cases. Fixes #1864, fixes #1982, and fixes #2027.

@Zac-HD Zac-HD added the performance go faster! use less memory! label Jul 2, 2019
@Zac-HD Zac-HD requested a review from DRMacIver as a code owner July 2, 2019 05:50
@DRMacIver
Copy link
Member

No, sorry, I think this is a bad plan. We should fix the performance problem rather than working around it by making our data generation more complicated and worse. I'll try to find some time today to put together a proper fix.

@DRMacIver
Copy link
Member

I'm going to close this in favour of #2030, which I think is a better solution.

@DRMacIver DRMacIver closed this Jul 2, 2019
@Zac-HD Zac-HD deleted the redundant-representation branch December 8, 2019 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance go faster! use less memory!
Projects
None yet
2 participants