Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shootout: small, fast PRNGs #52

Open
dhardy opened this issue Nov 15, 2017 · 84 comments
Open

Shootout: small, fast PRNGs #52

dhardy opened this issue Nov 15, 2017 · 84 comments

Comments

@dhardy
Copy link
Owner

dhardy commented Nov 15, 2017

We've already had a lot of discussion on this. Lets summarise algorithms here.

This is about small, fast PRNGs. Speed, size and performance in tests like PractRand and TestU01 is of interest here; cryptographic quality is not.


Edit: see @pitdicker's work in #60.

@vks
Copy link

vks commented Nov 15, 2017

I have some benchmarks:

test rand_bytes_aes           ... bench:   8,389,735 ns/iter (+/- 273,533) = 124 MB/s              
test rand_bytes_chacha        ... bench:   3,691,665 ns/iter (+/- 101,794) = 284 MB/s              
test rand_bytes_isaac         ... bench:   2,043,659 ns/iter (+/- 93,872) = 513 MB/s               
test rand_bytes_isaac64       ... bench:   1,272,982 ns/iter (+/- 79,964) = 823 MB/s               
test rand_bytes_splitmix      ... bench:     245,919 ns/iter (+/- 12,513) = 4263 MB/s              
test rand_bytes_xoroshiro     ... bench:     334,576 ns/iter (+/- 24,947) = 3134 MB/s              
test rand_bytes_xoroshirostar ... bench:     430,575 ns/iter (+/- 64,488) = 2435 MB/s              
test rand_bytes_xorshift      ... bench:   1,306,342 ns/iter (+/- 95,741) = 802 MB/s               
test rand_bytes_xorshift1024  ... bench:     364,495 ns/iter (+/- 21,676) = 2876 MB/s              
test rand_u64_aes             ... bench:   6,385,340 ns/iter (+/- 289,658) = 125 MB/s              
test rand_u64_chacha          ... bench:   2,224,053 ns/iter (+/- 130,170) = 359 MB/s              
test rand_u64_isaac           ... bench:   1,038,451 ns/iter (+/- 93,778) = 770 MB/s               
test rand_u64_isaac64         ... bench:     370,893 ns/iter (+/- 27,691) = 2156 MB/s              
test rand_u64_splitmix        ... bench:     108,328 ns/iter (+/- 6,140) = 7384 MB/s               
test rand_u64_xoroshiro       ... bench:      85,963 ns/iter (+/- 4,069) = 9306 MB/s               
test rand_u64_xoroshirostar   ... bench:      98,519 ns/iter (+/- 4,330) = 8120 MB/s               
test rand_u64_xorshift        ... bench:     212,814 ns/iter (+/- 6,618) = 3759 MB/s               
test rand_u64_xorshift1024    ... bench:     274,191 ns/iter (+/- 7,599) = 2917 MB/s

However, I don't trust them that much, because I experienced systematic differences just by reordering the benchmarks.

@dhardy
Copy link
Owner Author

dhardy commented Nov 15, 2017

Added XorShiftRng. PractRand results:

$ target/debug/examples/cat_rng xorshift | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0xa4893a88
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0xa4893a88
length= 64 megabytes (2^26 bytes), time= 3.3 seconds
  Test Name                         Raw       Processed     Evaluation
  BRank(12):256(2)                  R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  BRank(12):384(1)                  R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  BRank(12):512(2)                  R=+11541  p~=  2e-3475    FAIL !!!!!!!!  
  BRank(12):768(1)                  R=+13672  p~=  1e-4116    FAIL !!!!!!!!  
  BRank(12):1K(1)                   R=+19183  p~=  1e-5775    FAIL !!!!!!!!  
  [Low1/8]BRank(12):128(2)          R= +1312  p~=  5.4e-396   FAIL !!!!!!!   
  [Low1/8]BRank(12):256(2)          R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  [Low1/8]BRank(12):384(1)          R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  [Low1/8]BRank(12):512(1)          R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  [Low4/32]BRank(12):256(2)         R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  [Low4/32]BRank(12):384(1)         R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  [Low4/32]BRank(12):512(1)         R= +8161  p~=  1e-2457    FAIL !!!!!!!!  
  [Low1/32]BRank(12):256(2)         R= +3748  p~=  3e-1129    FAIL !!!!!!!!  
  [Low1/32]BRank(12):384(1)         R= +5405  p~=  3e-1628    FAIL !!!!!!!!  
  ...and 125 test result(s) without anomalies

@dhardy
Copy link
Owner Author

dhardy commented Nov 15, 2017

@vks quite possibly you should call black_box on your buf after this line.

But anyway this is about much more than just speed. Which in particular do you think I should test?

@vks
Copy link

vks commented Nov 15, 2017

I think the most interesting ones are xoroshiro128+ and xorshift1024*, as suggested by Vigna. I don't have an opinion on cryptographic generators.

It is important to understand what the binary rank test does when judging the failures, see http://xoroshiro.di.unimi.it/:

In all papers describing these generators, the authors (including Marsaglia itself) consider the failure of the binary-rank test irrelevant from a practical viewpoint, as the output of the generator is used as a number, not as a vector on Z/2Z.

(I did not look up Marsaglia's paper to verify this claim.)

@dhardy
Copy link
Owner Author

dhardy commented Nov 15, 2017

I adapted your crate to use rand_core and ran some tests.

Interestingly SplitMix64 does very well on PractRand. I should look at running TestU01 too.

In light of your arguments about binary rank tests, it might be interesting to try PractRand with those disabled (if it has other useful tests). But I am not convinced we should ignore them entirely.

Test output:

14:40 dhardy@localhost:~/rust/rand$ target/release/examples/cat_rng SplitMix64 | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0x3be4c794
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0x3be4c794
length= 256 megabytes (2^28 bytes), time= 3.4 seconds
  no anomalies in 162 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 512 megabytes (2^29 bytes), time= 7.2 seconds
  no anomalies in 171 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 1 gigabyte (2^30 bytes), time= 14.5 seconds
  no anomalies in 183 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 2 gigabytes (2^31 bytes), time= 28.5 seconds
  no anomalies in 194 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 4 gigabytes (2^32 bytes), time= 55.7 seconds
  no anomalies in 203 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 8 gigabytes (2^33 bytes), time= 111 seconds
  no anomalies in 215 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 16 gigabytes (2^34 bytes), time= 219 seconds
  no anomalies in 226 test result(s)

rng=RNG_stdin, seed=0x3be4c794
length= 32 gigabytes (2^35 bytes), time= 434 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R=  +4.9  p =  2.8e-3   unusual          
  ...and 234 test result(s) without anomalies

^C
14:50 dhardy@localhost:~/rust/rand$ target/release/examples/cat_rng XoroShiro128 | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0xa6ddb56d
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0xa6ddb56d
length= 256 megabytes (2^28 bytes), time= 3.4 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/32]BRank(12):384(1)         R= +1272  p~=  5.4e-384   FAIL !!!!!!!   
  [Low1/32]BRank(12):512(1)         R= +2650  p~=  9.8e-799   FAIL !!!!!!!   
  ...and 160 test result(s) without anomalies

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 32, message: "Broken pipe" } }', /checkout/src/libcore/result.rs:860:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.
14:50 dhardy@localhost:~/rust/rand$ target/release/examples/cat_rng XorShift1024 | practrand stdin
RNG_test using PractRand version 0.93
RNG = RNG_stdin, seed = 0x1575d306
test set = normal, folding = standard(unknown format)

rng=RNG_stdin, seed=0x1575d306
length= 256 megabytes (2^28 bytes), time= 3.4 seconds
  no anomalies in 162 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 512 megabytes (2^29 bytes), time= 7.2 seconds
  no anomalies in 171 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 1 gigabyte (2^30 bytes), time= 14.6 seconds
  no anomalies in 183 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 2 gigabytes (2^31 bytes), time= 28.9 seconds
  no anomalies in 194 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 4 gigabytes (2^32 bytes), time= 56.0 seconds
  no anomalies in 203 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 8 gigabytes (2^33 bytes), time= 112 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low1/8]FPF-14+6/16:cross         R=  +4.7  p =  2.7e-4   unusual          
  ...and 214 test result(s) without anomalies

rng=RNG_stdin, seed=0x1575d306
length= 16 gigabytes (2^34 bytes), time= 221 seconds
  no anomalies in 226 test result(s)

rng=RNG_stdin, seed=0x1575d306
length= 32 gigabytes (2^35 bytes), time= 436 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low4/32]BCFN(2+1,13-0,T)         R=  +7.4  p =  1.7e-3   unusual          
  [Low1/32]BRank(12):3K(1)          R=+10916  p~=  3e-3287    FAIL !!!!!!!!  
  ...and 233 test result(s) without anomalies

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 32, message: "Broken pipe" } }', /checkout/src/libcore/result.rs:860:4
note: Run with `RUST_BACKTRACE=1` for a backtrace.

@vks
Copy link

vks commented Nov 15, 2017

Splitmix64 is recommended by Vigna for seeding xoroshiro128+ and xorshift1024*. It works with any seed (even zero), but it has a period of 2^64 which might be a bit short.

@pitdicker
Copy link

I have done a lot of experiments on Xoroshiro128+, SplitMix etc.

Xoroshiro128+ is always the fastest, with a margin of up to 20%. But it has several weaknesses, and should be used with a lot of care unless you use it to generate f32's of f64's.

I personally like truncated Xorshift*, and PCG. I have optimized and well-tested implementations here that are mostly ready. They are flawless under PractRand and TestU01.

Benchmarks from a month ago:

2017-10-15 8:30 x86_64
test gen_u32_chacha            ... bench:      10,821 ns/iter (+/- 807) = 369 MB/s
test gen_u32_isaac             ... bench:       4,192 ns/iter (+/- 4) = 954 MB/s
test gen_u32_isaac64           ... bench:       4,238 ns/iter (+/- 23) = 943 MB/s
test gen_u32_pcg               ... bench:       1,223 ns/iter (+/- 10) = 3270 MB/s
test gen_u32_pcg64             ... bench:       1,940 ns/iter (+/- 7) = 2061 MB/s
test gen_u32_xorshift          ... bench:       1,084 ns/iter (+/- 8) = 3690 MB/s
test gen_u32_xorshiftmult      ... bench:       1,127 ns/iter (+/- 3) = 3549 MB/s
test gen_u32_xorshiftmult32    ... bench:       1,121 ns/iter (+/- 2) = 3568 MB/s
test gen_u32_xorshiftmult64_32 ... bench:       1,686 ns/iter (+/- 18) = 2372 MB/s
test gen_u32_truncated_xorshift64star   ... bench:       1,720 ns/iter (+/- 7) = 2325 MB/s
test gen_u32_xoroshiro128plus           ... bench:       1,083 ns/iter (+/- 5) = 3693 MB/s
test gen_u32_xorshift128                ... bench:       1,000 ns/iter (+/- 4) = 4000 MB/s
test gen_u32_xorshift128plus            ... bench:       1,110 ns/iter (+/- 6) = 3603 MB/s

test gen_u64_chacha            ... bench:      21,445 ns/iter (+/- 95) = 373 MB/s
test gen_u64_isaac             ... bench:       8,123 ns/iter (+/- 56) = 984 MB/s
test gen_u64_isaac64           ... bench:       4,229 ns/iter (+/- 13) = 1891 MB/s
test gen_u64_pcg               ... bench:       3,544 ns/iter (+/- 19) = 2257 MB/s
test gen_u64_pcg64             ... bench:       1,942 ns/iter (+/- 10) = 4119 MB/s
test gen_u64_xorshift          ... bench:       2,639 ns/iter (+/- 5) = 3031 MB/s
test gen_u64_xorshiftmult      ... bench:       1,313 ns/iter (+/- 4) = 6092 MB/s
test gen_u64_xorshiftmult32    ... bench:       3,415 ns/iter (+/- 4) = 2342 MB/s
test gen_u64_xorshiftmult64_32 ... bench:       4,385 ns/iter (+/- 18) = 1824 MB/s
test gen_u64_truncated_xorshift64star   ... bench:       4,522 ns/iter (+/- 18) = 1769 MB/s
test gen_u64_xoroshiro128plus           ... bench:       1,083 ns/iter (+/- 4) = 7386 MB/s
test gen_u64_xorshift128                ... bench:       1,000 ns/iter (+/- 4) = 8000 MB/s
test gen_u64_xorshift128plus            ... bench:       1,111 ns/iter (+/- 3) = 7200 MB/s


2017-10-15 8:30 x86
test gen_u32_chacha            ... bench:      17,217 ns/iter (+/- 59) = 232 MB/s
test gen_u32_isaac             ... bench:       4,238 ns/iter (+/- 26) = 943 MB/s
test gen_u32_isaac64           ... bench:       5,825 ns/iter (+/- 32) = 686 MB/s
test gen_u32_pcg               ... bench:       3,034 ns/iter (+/- 4) = 1318 MB/s
test gen_u32_pcg64             ... bench:      13,564 ns/iter (+/- 162) = 294 MB/s
test gen_u32_xorshift          ... bench:       1,470 ns/iter (+/- 8) = 2721 MB/s
test gen_u32_xorshiftmult      ... bench:       2,923 ns/iter (+/- 9) = 1368 MB/s
test gen_u32_xorshiftmult32    ... bench:       1,729 ns/iter (+/- 2) = 2313 MB/s
test gen_u32_xorshiftmult64_32 ... bench:       2,392 ns/iter (+/- 3) = 1672 MB/s
test gen_u32_truncated_xorshift64star ... bench:       2,715 ns/iter (+/- 11) = 1473 MB/s
test gen_u32_xoroshiro128plus         ... bench:       1,955 ns/iter (+/- 9) = 2046 MB/s
test gen_u32_xorshift128              ... bench:       2,303 ns/iter (+/- 12) = 1736 MB/s
test gen_u32_xorshift128plus          ... bench:       2,434 ns/iter (+/- 8) = 1643 MB/s

test gen_u64_chacha            ... bench:      35,559 ns/iter (+/- 131) = 224 MB/s
test gen_u64_isaac             ... bench:       8,924 ns/iter (+/- 51) = 896 MB/s
test gen_u64_isaac64           ... bench:       5,961 ns/iter (+/- 126) = 1342 MB/s
test gen_u64_pcg               ... bench:       5,926 ns/iter (+/- 504) = 1349 MB/s
test gen_u64_pcg64             ... bench:      14,957 ns/iter (+/- 20,628) = 534 MB/s
test gen_u64_xorshift          ... bench:       4,590 ns/iter (+/- 120) = 1742 MB/s
test gen_u64_xorshiftmult      ... bench:      11,678 ns/iter (+/- 702) = 685 MB/s
test gen_u64_xorshiftmult32    ... bench:       3,983 ns/iter (+/- 49) = 2008 MB/s
test gen_u64_xorshiftmult64_32 ... bench:       5,882 ns/iter (+/- 82) = 1360 MB/s
test gen_u64_truncated_xorshift64star ... bench:       6,250 ns/iter (+/- 53) = 1280 MB/s
test gen_u64_xoroshiro128plus         ... bench:       2,436 ns/iter (+/- 3) = 3284 MB/s
test gen_u64_xorshift128              ... bench:       2,512 ns/iter (+/- 18) = 3184 MB/s
test gen_u64_xorshift128plus          ... bench:       2,932 ns/iter (+/- 11) = 2728 MB/s

@pitdicker
Copy link

The Xorshift variants in table form:

algorithm native result u32 u64 quality notes
xorshift128/32 32-bit u32 3693 MB/s 3034 MB/s bad current implementation
xorshift64/32*_truncated 32-bit u32 3561 MB/s 2339 MB/s excellent (-3,6%, -22,9%)
xorshift128/32*_truncated 32-bit u32 3270 MB/s 2536 MB/s excellent (-11,4%, -16,4%)
xorshift128/64*_truncated 64-bit u64 3125 MB/s 6309 MB/s excellent needs 128-bit support
xorshift128/64 64-bit u64 4000 MB/s 8016 MB/s bad
xorshift64* 64-bit u64 2415 MB/s 4828 MB/s good
xorshift64*_truncated 64-bit u32 2324 MB/s 1772 MB/s excellent
xorshift128+ 64-bit u64 3603 MB/s 7207 MB/s passable
xoroshiro128+ 64-bit u64 3690 MB/s 7380 MB/s passable

@pitdicker
Copy link

As @vks says the period of SplitMix is much to small to be used as a general PRNG. Unless we also use streams, that come with their own set of problems.

@vks
Copy link

vks commented Nov 16, 2017

I'm not sure xoroshiro's failures of the binary rank test qualify as a weakness in practice, see this discussion.

@pitdicker
Copy link

I saw it after my post 😄.

This is from memory, but I ran Xoroshiro+ through test, and disabled the binary rank tests. There were other ones it failed also. I also tested it by reversing the bits, but don't remember the results.

Simply put, it doesn't mix (avalance) its bits enough. Xorshift (or plain Xoroshiro, if that were a thing) have patterns. A simple addition reduces those patterns, but is not enough to remove them. Something like a multiply should be used instead. Using a large state also helps masking those patterns.

Not to say that I don't think Xoroshiro+ can have it's uses. Xorshift has it use too. Especially when converted to floating point, weaker least significant digits should just about never be interesting. But any user of it should be careful.

@dhardy already raised good points. Do you know the impact of the weaker bits on: (1) generating booleans, (2) the ziggurat table lookup, (3) sampling from a range, (4) our custom, high precision conversions to double? And that are just a few uses in rand. Now it may be useful to handle weaker RNG's nicely anyway in the case of rand.

For general use, I think there are better PRNG's we can go with. Then we can have a simple guideline: if you need a good RNG but don't need to worry about predictability (e.g. no adversaries), use Xorshift* or PCG. If you need unpredictability, use a cryptographic RNG.

Of course performance is a thing. But I consider anything that is on average close to / better than our current Xorshift extremely fast and good enough.

@vks
Copy link

vks commented Nov 16, 2017

Xorshift (or plain Xoroshiro, if that were a thing) have patterns.

What kind of patterns? If I understand correctly, all linear generators have patterns, the question is whether they are observable.

Something like a multiply should be used instead.

Why? (Are you basing this on O'Neill's blog post?) It's not clear that this is better. I recently had an email exchange with Sebastiano Vigna about the merits of xoroshiro* compared to xoroshiro+. He did some experiments, showing that xoroshiro+ has some Hamming dependencies after a lot of values, in this regard xoroshiro* is slightly better. But the lowest two bits of xoroshiro* are the same LSFR, while for xoroshiro+ the second lowest bit is a perturbed LSFR, so in that regard xoroshiro+ is slightly better.

Using a large state also helps masking those patterns.

I'm not sure which patterns you mean, but the linear dependencies don't go away by using a larger state. Large states also have trouble to recover from lots of zeros in the initial state and require more careful seeding.

Do you know the impact of the weaker bits

Well, empirically there seems to be no practical impact whatsoever. All popular RNGs (that is the Mersenne Twister for most programming languages and xorshift for browsers and the current rand crate) are an LSFR in all bits. AFAIK, the only known impact is on the calculation of the rank of random binary matrices. You are saying users should be careful about the linear dependencies. Why?

if you need a good RNG but don't need to worry about predictability (e.g. no adversaries), use Xorshift* or PCG.

Xorshift* linear dependencies in the lowest two bits as well. I don't see a reason why to use it over xoroshiro+, unless you are talking about xorshift1024* vs. xoroshiro128+.

Of course performance is a thing. But I consider anything that is on average close to / better than our current Xorshift extremely fast and good enough.

On the other hand, performance is the only reason to use such a generator over a cryptographic one.

@pitdicker
Copy link

What kind of patterns? If I understand correctly, all linear generators have patterns, the question is whether they are observable.

Completely agree. And if they are observable depends on how the random numbers are used.

From the first paper by Marsaglia it was noted that is was best to use Xorshift as a base generator, and apply some output function or combine it with a second generator of a different type. All the variants over the years, like XORWOW, XSadd, xorgens, RanQ1, Xorshift*, truncated Xorshift*, Xorshift+ and Xoroshiro+ differ in their choice of output function.

Xoroshiro+ just trades quality for a little bit of performance.

But the lowest two bits of xoroshiro* are the same LSFR, while for xoroshiro+ the second lowest bit is a perturbed LSFR, so in that regard xoroshiro+ is slightly better.

Ah, yes. I was imprecise. It is true that just multiplying effects the last two bits about as little as addition does. I once made a table of the avalancing effect of a single multiply. What makes a good output function is a multiply, and truncating the result to only the high bits.

AFAIK, the only known impact is on the calculation of the rank of random binary matrices. You are saying users should be careful about the linear dependencies. Why?

I think we have listed a couple of the problematic cases already. Can you have a look at them?

On the other hand, performance is the only reason to use such a generator over a cryptographic one.

The performance is so good already, that the difference between two PRNG algorithms is not much more than the difference caused by using one more register, or having one instruction that can not be run in parallel with another on the processor. There are only 8 or 9 instructions used per new random number (and a couple of mov's), several of which run in parallel.

At such a point it seems to me almost any use of the random number wil take quite a bit longer than generating one. Of course there may be uses where that last cycle can be the difference, and quality is less important. Then Xoroshiro128+ is great!

Note: I will be the last to say good performance is not a good quality. Optimising RNG's is fun ;-)

@pitdicker
Copy link

@vks I (we) are guilty of derailing this issue 😄. If you want to reply, could you open a new issue or something and mention my name?

@vks
Copy link

vks commented Nov 16, 2017

Is this discussion off-topic? I thought the shootout was about performance and quality of the generators.

I think we have listed a couple of the problematic cases already. Can you have a look at them?

I did, and my point was that basically everyone uses RNGs in these cases where all bits have linear dependencies. I'm not aware of any problems with this. You keep saying linear dependencies are problematic, but you never say why.

@dhardy
Copy link
Owner Author

dhardy commented Nov 16, 2017

No, you're not really off-topic, but I do fear the thread will be long and not so easy to read later!

Great that you already did significant research on this @pitdicker; I didn't realise you'd done so much. Have you come across v3b? I noticed it mentioned in a comment thread earlier; can't find a source for it however.

It sounds like there may be reason to include multiple "small fast PRNGs" in rand, but I don't think anyone wants it to become a huge library.

@pitdicker
Copy link

pitdicker commented Nov 16, 2017

You keep saying linear dependencies are problematic, but you never say why.

Okay, let me try to make a list of some of the problems in rand.

1) generating booleans

Current code for generating booleans:

impl Distribution<bool> for Uniform {
    #[inline]
    fn sample<R: Rng+?Sized>(&self, rng: &mut R) -> bool {
        rng.next_u32() & 1 == 1
    }
}

This should change to compare against some other bit than the last one to generate bools that really have a 50% chance.

2) the ziggurat table lookup

The 11 least significant bits of the random number are used to pick the 'layer' of the ziggurat, and the 53 most significant bits as fraction for a f64. The f64 is then multiplied by some value that belongs to the layer.

If the few bits used for the layer are not actually random, this has relatively large effects on the shape of the distribution.

This can have a mostly easy fix: there are only 8 bits needed for the layer, so don't use the 8 least significant bits, but bits 3..10.

3) sampling from a range

In distributions::Range we restrict the generated random numbers to a 'zone', and collapse that zone to the actual range using a modulus.

If we assume an RNG with weak least significant bits, those weaker bits will remain weak also in the much smaller range.

A solution would be to use division instead of a modulus. At least I remember reading that that works.

Without changes we can't really keep the promise of a uniformly distributed range.

4) high precision conversions to double

It turns out this part is ok. We use the 53 least significant bits for the fraction, and the remeaning 11 most significant bits for the exponent. This means only the last bits of the fracion are weak.

It it were reversed, some exponents would occur more often than others. And the effect of that would be huge for floats.

As you see, it takes some nontrivial thinking to see if your algorithm works well with a generator with some weaknesses. Now we could adapt this code to make using it with Xoroshiro+ safe. That seems like a good idea to me anyway.

But what to do when there is some other RNG that happens to have weaker most significant bits? Okay, I have not heard of such. But how much should generic code cater to the weaknesses of one RNG?

@dhardy
Copy link
Owner Author

dhardy commented Nov 16, 2017

Do linear dependencies in the low bits imply bias (P(1) ≠ 1/2)? I assumed not(?). Do they imply high predictability within these two bits over a very short sequence? As @vks says the weakness may not be so significant.

@pitdicker
Copy link

v3b comes from here: http://cipherdev.org/.
src

I know nothing about it though.

@pitdicker
Copy link

Maybe of interest. An evaluation of Xoroshiro by the author of PractRand, and since I last looked at some replies from Sebastiano Vigna.

@pitdicker
Copy link

For fun I wrote a Xoroshiro64+. It is not meant to be used. I had to calculate new constants for a, b and c. Creating a smaller version is useful, because for something like PractRand it is easier to analyse.

#[derive(Clone, Debug)]
pub struct Xoroshiro64PlusRng {
    s0: u32,
    s1: u32,
}

impl SeedFromRng for Xoroshiro64PlusRng {
    fn from_rng<R: Rng>(mut other: R) -> Result<Self, Error> {
        let mut tuple: (u32, u32);
        loop {
            tuple = (other.next_u32(), other.next_u32());
            if tuple != (0, 0) {
                break;
            }
        }
        let (s0, s1) = tuple;
        Ok(Xoroshiro64PlusRng{ s0: s0, s1: s1 })
    }
}

impl Rng for Xoroshiro64PlusRng {
    #[inline]
    fn next_u32(&mut self) -> u32 {
        let s0 = w(self.s0);
        let mut s1 = w(self.s1);
        let result = (s0 + s1).0;
        
        s1 ^= s0;
        self.s0 = (w(s0.0.rotate_left(19)) ^ s1 ^ (s1 << 13)).0; // a, b
        self.s1 = s1.0.rotate_left(10); // c

        result
    }
    
    #[inline]
    fn next_u64(&mut self) -> u64 {
        ::rand_core::impls::next_u64_via_u32(self)
    }

    #[cfg(feature = "i128_support")]
    fn next_u128(&mut self) -> u128 {
        ::rand_core::impls::next_u128_via_u64(self)
    }

    fn fill_bytes(&mut self, dest: &mut [u8]) {
        ::rand_core::impls::fill_bytes_via_u32(self, dest)
    }

    fn try_fill(&mut self, dest: &mut [u8]) -> Result<(), Error> {
        Ok(self.fill_bytes(dest))
    }
}

Results (only the last results, al the other failures before are similar):

rng=RNG_stdin32, seed=0x6fb2c44d
length= 32 gigabytes (2^35 bytes), time= 281 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R= +18.5  p =  1.9e-12    FAIL           
  BRank(12):3K(1)                   R=+583.3  p~=  1.2e-176   FAIL !!!!!!    
  BRank(12):4K(2)                   R= +1799  p~=  1.2e-542   FAIL !!!!!!!   
  BRank(12):6K(1)                   R= +2650  p~=  9.8e-799   FAIL !!!!!!!   
  BRank(12):8K(1)                   R= +4028  p~=  1e-1213    FAIL !!!!!!!!  
  [Low8/32]BRank(12):768(1)         R=+583.3  p~=  1.2e-176   FAIL !!!!!!    
  [Low8/32]BRank(12):1K(4)          R= +2544  p~=  4e-1354    FAIL !!!!!!!!  
  [Low8/32]BRank(12):1536(1)        R= +2650  p~=  9.8e-799   FAIL !!!!!!!   
  [Low8/32]BRank(12):2K(2)          R= +5696  p~=  1e-1715    FAIL !!!!!!!!  
  [Low8/32]BRank(12):3K(1)          R= +6783  p~=  5e-2043    FAIL !!!!!!!!  
  [Low8/32]BRank(12):4K(2)          R=+13490  p~=  8e-4062    FAIL !!!!!!!!  
  [Low8/32]BRank(12):6K(1)          R=+15050  p~=  2e-4531    FAIL !!!!!!!!  
  [Low1/32]BRank(12):128(8)         R= +3598  p~=  2e-2310    FAIL !!!!!!!!  
  [Low1/32]BRank(12):256(4)         R= +8055  p~=  4e-4285    FAIL !!!!!!!!  
  [Low1/32]BRank(12):384(1)         R= +6783  p~=  5e-2043    FAIL !!!!!!!!  
  [Low1/32]BRank(12):512(4)         R=+19077  p~= 0           FAIL !!!!!!!!  
  [Low1/32]BRank(12):768(1)         R=+15050  p~=  2e-4531    FAIL !!!!!!!!  
  [Low1/32]BRank(12):1K(2)          R=+29077  p~=  4e-8754    FAIL !!!!!!!!  
  [Low1/32]BRank(12):1536(1)        R=+31582  p~=  2e-9508    FAIL !!!!!!!!  
  [Low1/32]BRank(12):2K(2)          R=+60252  p~= 0           FAIL !!!!!!!!  
  [Low1/32]BRank(12):3K(1)          R=+64648  p~= 0           FAIL !!!!!!!!  
  ...and 159 test result(s) without anomalies

@dhardy
Copy link
Owner Author

dhardy commented Nov 16, 2017

Another generator and test suite that may be worth looking into: http://gjrand.sourceforge.net/boast.html

@nixpulvis
Copy link

nixpulvis commented Nov 16, 2017

I like the idea of having multiple PRNGs, but personally would have an extension crate for them. I feel like Rust should have a canonical implementation for getting a random number.

P.S. Sorry for continuing to recommend more crates, maybe I just love making crates :P.

@dhardy
Copy link
Owner Author

dhardy commented Nov 17, 2017

As I said in the first post, that discussion is for another issue!

It may be that rand ends up with multiple similar RNGs anyway simply because ARng is included, then BRng is added because it's better, but ARng is not removed because some users are dependent on it (including on it reproducing results exactly).

@vks
Copy link

vks commented Nov 17, 2017

@pitdicker I think you are confused about the properties of LFSRs.

This should change to compare against some other bit than the last one to generate bools that really have a 50% chance.

What makes you think this is not the case for LSFRs? If you look at the distribution, it is fine. (This is not a very hard test, a generator outputting true and false alternatingly would pass it.)

If the few bits used for the layer are not actually random, this has relatively large effects on the shape of the distribution.

What do you mean with "not actually random"? Do you mean the bits don't have a uniform distribution? This is not the case for LSFRs. In fact, they are usually designed to be equidistributed. If there was a large effect on the distribution of the sampled values, no one would be using LSFR-based generators like the Mersenne Twister or xorshiftt.

If we assume an RNG with weak least significant bits, those weaker bits will remain weak also in the much smaller range.

Yes, but I don't think this "weakness" is a problem in practice.

@pitdicker
Copy link

Let me emphasise I have nothing against Xoroshiro128+. It is one of the fastest small PRNG's currently in use, and it does not perform terrible on most statistical test.

But there are other RNG's that have up to 80% of the performance of Xoroshiro128+, and that don't have detectable statistical weaknesses. I think the question should be:

  • Do we want to go for the absolute best performance, and be maybe weaker?
  • Do we want to have very good performance and be usable without question for anything non-cryptographic?

@vks You raise a good point by questioning how much statistical weaknesses matter for real applications. That depends on the algorithm it is used in, and requires evaluation for every case. I can't answer it. Do we want every user to ask himself that question?

Again I want to note that 20~25% performance difference sounds like much, but it is a difference of only one or two clock cycles.

Do linear dependencies in the low bits imply bias (P(1) ≠ 1/2)? I assumed not

I think you are confused about the properties of LFSRs

You are right, my bad. It follows patterns, but the chance for one bit to be 1 or 0 should remain about 50%.

@pitdicker
Copy link

@dhardy Do you plan on collecting all small RNG's mentioned here in one repository/crate, for easy comparison? That would be an interesting collection. Or just mention them here?

@dhardy
Copy link
Owner Author

dhardy commented Nov 17, 2017

I'm considering that, but don't know if I will get around to it or not. It's not a priority anyway.

@vks
Copy link

vks commented Nov 17, 2017

You raise a good point by questioning how much statistical weaknesses matter for real applications. That depends on the algorithm it is used in, and requires evaluation for every case. I can't answer it. Do we want every user to ask himself that question?

It also depends on the statistical weakness in question. I looked up the paper where the binary rank test was introduced (Marsaglia, Tsay 1985):

Many Monte Carlo studies, particularly in combinatorics and graph theory, call for random incidence matrices, elements 0 or 1 to represent the absence or presence of some of some property. It is natural to let the rows of such a random matrix be formed by successive computer words, or portions of words, produced by a random number generator.

Shift-register or F(r, s, ⊕) generators are not suitable for such use. In order that the sequence of 1 x n binary vectors β, βT, βT²,... have long period, it is necessary that the first n vectors in the sequence be linearly independent. Thus a binary matrix with rows formed by no or fewer vectors produced by a shift-register matrix will have rank m with probability (1 - 2^(-n))(1 - 2^(1-n))...(1 - 2^(m+1-n)), about 0.30 when m=n and n≥10 or so.

[...]

If the rows of the m x n binary matrix are successive computer words, or portions thereof, produced by a random number generator, then the rank of the matrix should have the distribution given by the probabilities above. Shift-register generators will fail such tests, as will F(r, s, ⊕) generators. Congruential and F(r, s, ±) usually pass.

So the lowest two bits in xoroshiro might give you some problems when generating random incidence matrices. I don't think this is a common use case, as those linear dependencies are the status quo of the currently used RNGs. It seems the authors of xoroshiro thought it was worth the tradeoff (xoroshiro website):

With more operations it could be possible to completely eliminate the linear artifacts, but I consider this a waste of time: as we've seen, some of the most used generators around have linear dependencies in every bit, and this is not considered a problem.

So the question is whether we want to improve on the status quo at the cost of some performance. In my opinion, it is not necessary, because it does not seem that users expect this property from an RNG.

Should we write something like an RFC? All this binary-rank discussion is a bit scattered at the moment...

@Lokathor
Copy link

Lokathor commented Dec 1, 2017

Yeah, exposing inc as a publicly editable field is a really bad plan. That's why I don't do it.

Ultimately, we have to consider what the MultiStreamRng trait looks like. Or whatever other dumb name we give the trait for generators where you can pick a stream.

@Ichoran
Copy link

Ichoran commented Dec 1, 2017

@Lokathor - I just looked at your implementation, and while you don't expose inc directly, you only do inc | 1 in new, and you also expose the existing value of inc via a function. So it doesn't take very much for someone to decide to generate a family of streams and accidentally end up with half of them being duplicates.

This is easy to fix, though, just by doing a little bit-mixing on the input.

@pitdicker
Copy link

@vks

And don't you have the same problem with streams? You can't just pick them randomly because of the birthday paradox.

I thought so too, but it is not as bad as it seems. A quote from myself:

Lets take 2^48 as an upper limit for the expected amount of used results. A period of only 2^64 is just about enough to have a chance of 1 in 2000 for one new RNG to end up within the stream of one other RNG. If the RNG has 2^63 possible streams like PCG, 2^27 initializations are possible before the stream has a change of 1 in 1000 of being the same.

Combined this means it takes 2^27 initializations to get a chance of about 1 in a million before part of a window of 2^48 results is reused. Seems good enough to me.

I really tried to get a scheme using Xorshift jumps to work. In our conversations I may sound negative, but I like the Xorshift variants. Especially the papers of Vigna I have studied several times, and played with the code supporting his papers.

I wrote a jumping function, and tried calculating a custom jump polynomal. A fixed jump takes as many rounds of the RNG as it has state bits. Together with bookkeeping, this makes a jump as slow as requesting new bytes from OsRng.

I also experimented with variable jumps. For now I calculated the jump polynomal by hand in Excel :-(. A variable jump is at least several times slower than a fixed jump.

An other idea was to use variable jumps that are multiples of 2^50. Picking them at random gives a similar chance of duplicate streams as PCG streams. But this improves on the birthday problem only a little, while taking very much more time.

In the end using jumps to make the birthday problem when initializing RNGs less of an issue, seemed to not really work out. Unless you have a single 'mother' RNG that is jumped every time a new 'child' RNG is split off. And even then it is slow.

@Lokathor
Copy link

Lokathor commented Dec 1, 2017

@Ichoran I admit that my code is not fully idiot proof because it's mostly intended for only me to ever use it, but I already also provide split and split_many as a way to guide users in the correct direction in terms of stream selection. I also document over and over that the inc value passed must be odd, and if it's not it gets bumped up.

One thing is though that MultiStreamRng should have stream count and stream selection specified by u64, not usize. The local machine size doesn't have any effect on how many streams your generator will support or not, best not try and fool people into thinking it's somehow related.

@pitdicker I forget my birthday problem formula exactly, but your math seems wrong just because its logic is wrong. The period of the PCG used has absolutely nothing to do with stream selection overlap. It's not the case that the inc shifts you "forward" by some amount within a single sequence of numbers that the generator as a whole always uses, each inc value produces a fully distinct sequence of numbers. "State 0 Stream 3" is not the same as "State 0+2steps Stream 1", so you can't compare them for your birthday calculation.

@pitdicker
Copy link

pitdicker commented Dec 1, 2017

@Lokathor
That is just a very short conclusion I pasted there. But I calculated the chance to end up in a stream that was already used before. That results gets multiplied by the chance to end up in the same 'window' of used results from the period, from that stream.

@Lokathor
Copy link

Lokathor commented Dec 1, 2017

Ah, my mistake then.

@vks
Copy link

vks commented Dec 1, 2017

@Lokathor

If the RNG has 2^63 possible streams like PCG, 2^27 initializations are possible before the stream has a change of 1 in 1000 of being the same.

I'm getting this result as well with the approximate solution of the birthday problem:

P(n, d) = 1 - exp(-n*(n-1)/(2*d))

@pitdicker
Copy link

pitdicker commented Dec 1, 2017

@Lokathor

Defaulting to the full next_u64 process and then throwing away half the bits is probably a bad choice if we can make next_u32 faster by using an alternate permutation. The 32-bit stream of an Rng already isn't going to match the 64-bit stream, so we might as well make it go as quick as we can.

I just tried this out for the XSL RR 128/64 (MCG) variant. The advantage of a custom permutation is that the truncation to u32 can happen earlier. Because on x86_64 64-bit operations are about as fast as 32-bit, it did not change the benchmarks at al... On x86 the speed was already abysmal, and it remained so.

And for the 64/32 variants there is not much creative we can do with the output functions, right?

@pitdicker
Copy link

pitdicker commented Dec 2, 2017

Wow, the extension method for PCG is complex! And the C++ template stuff and lack of comments don't help either.

I tried implementing pcg32_k2_fast. This is the PCG-XSR-RR 64/32 variant, with MCG as a base generator and an extension array of 1 word. From its description "pcg32_k2_fast occupies the same space as pcg64, and can be called twice to generate 64 bits, but does not require 128-bit math; on 32-bit systems, it's faster than pcg64 as well."

That claim does not seem true though, as it is implemented with an extension array of 32 32-bit words. The file check-pcg32_k2_fast.out says it has a period of 2^2112, and a size of 264 bytes. But that also seems strange, because that assumes 64-bit words in the extension array, while the number of bits should be the same as the number of bits the RNG outputs, 32 in this case.

The extension mechanism comes with two choices: we can pick a size, and whether we want k-dimensional equidistribution (kdd). It is best if the size is a power of two, this makes the point when the extension table should be updated easier to recognise.

To generate a new random number, the output of the base generator (PCG-XSR-RR 64/32 in my case) is xored with a randomly picked value from the extension array.

Which function to use to pick a value from the extension array depends on whether we want kdd. The PCG paper explains:

The selector function can be an arbitrary function, but for simplicity let us assume that k = 2^x and that selection merely drops bits to choose x bits from the state. Two obvious choices for choosing x bits are taking the high-order bits and taking the low-order bits. If we wish to perform party tricks like the ones we discussed in Section 4.3.4, it makes sense to use the low-order bits because they will access all the elements of the extension array before any elements repeat (due to the property of LCGs that the low-order l bits have period 2^l). Conversely, if we use the high-order bits, the extension array will be accessed in a more unpredictable pattern.

Sometimes the values in the extension table need to be updated. PCG chooses for the following scheme:

  • A 'tock' happens every time the base RNG makes a full period, than the extension array is advanced. If the base RNG has a state of 64 bits or more, it is safely assumed this will never happen (and compiled out).
  • A 'tick' happens if the number of words in the extension array is less than the number of bits in the base RNG state. All values in the extension array get advanced once they are all consumed (in the kdd case, otherwise when they are all consumed on average).
    Edit: All values in the extension array get advanced once 2^(extension array size) rounds.

Every value in the extension array is its own little PCG-RXS-M-XS RNG. The process to update a value is complex, slow, and in my opinion ugly. First the inverse of the RXS-M-XS output function is applied. Which includes using a recursive un-xorshift function twice, and multiplying with the modular inverse of the multiplier of RXS-M-XS. Than the state recovered with that is advanced as if it is an LCG. Next the RXS-M-XS output function is applied to get the new value for the extension array. Repeat until all values are updated.

In about 50% (?) of the cases a value is also advanced a second time, to break the different array values out of lockstep. If the base generator is an MCG at least.

I did not finish my implementation of PCG with an extension array. It certainly does not fit in the 'small, fast RNG' category. I suppose the PCG EXT variants only look so fast in the benchmarks if the whole table update part does not happen.

@Lokathor
Copy link

Lokathor commented Dec 2, 2017

Yeah, the extension for k-dimensional equidistribution is tricky. The Xoroshiro-style extension with cycling through array positions doesn't give k-dimensional but it is dead simple. Unfortunately, I'm pretty sure (I think?) that it doesn't give more period length very quickly. 2^n state slots in an array gives (again, i think) +n to your period (eg: 2^64 becomes 2^(64+n) instead).

I'm not sure what a good answer is, but I'll try to keep thinking about it in the next few days.

@vks
Copy link

vks commented Dec 2, 2017

@Lokathor I'm pretty sure the period is as large as it can be, since there is only one cycle for these RNGs.

@Lokathor
Copy link

Lokathor commented Dec 3, 2017

False. Please read the PCG paper.

@Ichoran
Copy link

Ichoran commented Dec 4, 2017

It really is worth reading the PCG paper. It's amazingly clear and approachable for what often seems like an arcane and difficult topic.

With regard to 2^n state slots, yes, you end up repeating yourself after 2^(m+n) where m is the period of the underlying generators; the proof is trivial since every 1/2^n times you advance one slot, and so the first slot will be advanced back to the beginning--2^m advances for itself--after 2^(m+n) steps. Since each slot gets the same number of advances, that reasoning applies to every slot.

So you need a more complex scheme, some of which have self-similarity problems. So I'd agree with @pitdicker that it isn't really a small fast RNG any more. There aren't many instructions executed per clock cycle, but the logic for advancing is somewhat complex, and not completely trivial mathematically. (I'd still characterize it as straightforward, but there are plenty of opportunities for implementation errors.)

Anyway, is there a compelling reason not to just pick one of the non-extension schemes and have that be the default? Having fast yet decent-quality random numbers (even if on some architectures it's not as fast as others) seems like a sizable improvement over the status quo; and you can always leave in the existing implementations for people who have reason to prefer the old algorithm.

The nice thing about the PCG family is that not only are the algorithms close to as fast as they can be, there's also a theoretical framework that helps reassure us that it's unlikely that a really problematic non-random structure is lurking in there somewhere that just doesn't happen to be tested with the typical tests. This is a great reassurance for a standard library to have. (Note: we only have that reassurance for a single stream, not for comparison between multiple streams.)

@Lokathor
Copy link

Lokathor commented Dec 4, 2017

Well, the only problem is that the default PCG you want kinda depends on the output you want. If you want mostly u32 values then 64/32 will probably serve you better than 128/64 simply because it's less space taken and it's somewhere between slightly faster to much faster depending on machine (it's not unreasonable to think that rust will regularly be run on 32 bit devices as well).

We could provide both and then explain why you'd want one or the other. Later on we might even be able to provide a pcg extras crate with macros that build a PCG type for you on the fly, complete with Rng impl and such. Fancy stuff can come later once we pick a good default.

@dhardy
Copy link
Owner Author

dhardy commented Dec 4, 2017

I think 32-bit x86 is pretty dead by now, but 32-bit ARM is still common, so there is some value. Note that the default generator does not need to be the same on all platforms; however I don't think we can't switch the algorithm depending on whether more u32 or u64 output is requested.

@Lokathor
Copy link

Lokathor commented Dec 4, 2017

The generator stepping and permuting the LCG output are separate phases of the process. As long as the generator stepping is consistent for both modes, you can use a different permutation for u32 and u64 output.

And yeah, I own a 32-bit ARM device that I use often enough, the Raspberry Pi board series. I'm sure that plenty enough other single-board computers are also 32-bit ARM devices and that people want to be able to use Rust on them.

@pitdicker
Copy link

pitdicker commented Dec 5, 2017

For a quick summary:
We are looking for an RNG that can generate u64's of good quality reasonably quickly on a 32-bit platform.

An RNG that outputs u32's is not great, because then producing one u64 means combining two outputs. This is more than twice as slow, and also reduces the period. One RNG that can generate u64's directly with good statistical quality is the 128-bit variant of PCG. Another is the 64-bit Xorshift/Xoroshiro with a widening multiply to 128 bits as an output function. Both are great on x86_64, but very slow on x86 because both need 128-bit multiplies which are not available and need to be emulated.

What we are trying here is:

  • PCG with a 64-bit LCG base generator
  • output function XSH RR 64/32 for next_u32
  • output function RXS M XS 64 for next_u64

As the PCG paper notes the problem of the RXS M XS variants is that every random number appears exactly once over a period of 2^64. It is relatively quickly possible to see that results are not truely random because there are no duplicates. Of course a PRNG never is truly random, but it should appear so.

The question is: does the extension mechanism of PCG not only enlarge the period, but also fix this problem with RXS M XS?
That is only true if the extension array is updated much more frequently than every time the period of the base RNG crosses over. Otherwise there will still not be any doubles during the very large period of the RNG. If the extension array is small it works, because the array gets updated frequently. But the PCG extension mechanism is not really made for small extension arrays like EXT2, and is very slow when it has to update frequently (at least from what I understand of it at the moment, note I edited #52 (comment)).

I think the problem is simple: a requirement to get the proper number of doubles according to the generalized birthday problem is that at least 128 bits of state need to get updated frequently.

It seems to me a simple solution could be good enough: xor the output of PCG RXS M XS 64 with a 64-bit counter. And if we use a Weyl sequence instead of a counter, we can maybe even get away with using MCG as a base generator (and if we want we can still have streams). So just about no slowdown 😄. I think this gives the proper distribution, but has the consequence that some results will not appear at all...

Something like this:

fn next_u64() -> u64 {
    // MCG
    self.m = self.m.wrapping_mul(MULTIPLIER);
    // Weyl sequence
    self.w = self.m.wrapping_add(INCREMENT);
    let state = self.m ^ self.w;
    output_rxs_m_xs(state)
}

It will take a few days before I can test this though. It should also be possible to test the distribution of the results with a 32-bit variant, that would need only 4gb of memory.

@Ichoran
Copy link

Ichoran commented Dec 5, 2017

I'm not sure the Weyl sequence adds anything beyond a simple incrementer, given the mixing afterwards (assuming we use the PCG mixers).

MCG is a bit risky given that it's degenerate when self.m is zero.

@Lokathor
Copy link

Lokathor commented Dec 5, 2017

The question is: does the extension mechanism of PCG not only enlarge the period, but also fix this problem with RXS M XS?

It should. It is my understanding that the 64/64 permutation has problems with each output appearing exactly once precisely because the period is too small. If you used it with a larger period, it would be fine. The reason that the 128/64 scheme works fine with 64-bit output is because the period is 2^128, not because the permutation is magically better on its own.

I think the problem is simple: a requirement to get the proper number of doubles according to the generalized birthday problem is that at least 128 bits of state need to get updated frequently.

This sums it up nicely.

I think this gives the proper distribution, but has the consequence that some results will not appear at all

That sounds like a very improper solution! I'd be upset to use a generator where some results can't possibly happen, even despite the fact that any particular result only has a 1/(2^64) chance to begin with.

Proposed Alternate Solution: We could wait a release cycle or two for 128-bit rust to become stable (assuming that it's out in the next cycle or two?), write the 128/64 PCG (which will have great 64-bit output), and then just accept that it will run very slowly on a 32-bit machine and tell people in the docs.

The reason that you'd use 128/64 is because you want to focus on 64-bit output, and if you're doing something that needs 64-bits at a time but running it on a 32-bit machine I hardly know what you're doing to begin with. That's just goofy. People don't normally think about the 32-bit/64-bit jump at all, but PRNGs are one of the things where it was a big deal and it continues to be a big deal and you do have to think about it. That's not something that we can fix ourselves because it's just part of how the math and hardware works out.

@pitdicker
Copy link

pitdicker commented Dec 6, 2017

Found time to do some testing already. Code:

    fn next_u64(&mut self) -> u64 {
        // MCG
        self.m = self.m.wrapping_mul(6364136223846793005);
        // Weyl sequence
         self.w = self.w.wrapping_add(1442695040888963407);
        let mut state = self.m ^ self.w;

        // output function RXS M XS:
        // random xorshift, mcg multiply, fixed xorshift
        const BITS: u64 = 64;
        const OP_BITS: u64 = 5; // log2(BITS)
        const MASK: u64 = BITS - 1;

        let rshift = (state >> (BITS - OP_BITS)) & MASK;
        state ^= state >> (OP_BITS + rshift);
        state = state.wrapping_mul(6364136223846793005);
        state ^ (state >> ((2 * BITS + 2) / 3))
    }

    fn next_u32(&mut self) -> u32 {
        self.m = self.m.wrapping_mul(6364136223846793005);
        self.w = self.w.wrapping_add(1442695040888963407);
        let state = self.m ^ self.w;

        // output function XSH RR: xorshift high (bits), followed by a random rotate
        const IN_BITS: u32 = 64;
        const OUT_BITS: u32 = 32;
        const OP_BITS: u32 = 5; // log2(OUT_BITS)

        const ROTATE: u32 = IN_BITS - OP_BITS; // 59
        const XSHIFT: u32 = (OUT_BITS + OP_BITS) / 2; // 18
        const SPARE: u32 = IN_BITS - OUT_BITS - OP_BITS; // 27

        let xsh = (((state >> XSHIFT) ^ state) >> SPARE) as u32;
        xsh.rotate_right((state >> ROTATE) as u32)
    }

I was wrong when estimating the period, because the MCG is off by 1. This helps a lot:
Period MCG: 2^62 - 1
Period Weyl sequence: 2^64
Combined period: smallest common multiple = 2^64 * (2^62 - 1) = 2^126

Edit: I really have to learn more about MCG, choosing good multipliers and how they relate to the period. Period of this combination is 2^64.

Performance is not bad, but not as good as I hoped. About 15~25% better than combining two outputs from PXG XSH 64/32 RR.
Benchmarks with Xorshift128/32 (current RNG in rand) as a baseline:

x86_64:

test gen_u32_mwp                 ... bench:       1,472 ns/iter (+/- 8) = 2717 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,082 ns/iter (+/- 2) = 3696 MB/s

test gen_u64_mwp                 ... bench:       1,363 ns/iter (+/- 6) = 5869 MB/s
test gen_u64_xorshift_128_32     ... bench:       2,638 ns/iter (+/- 23) = 3032 MB/s

x86:

test gen_u32_mwp                 ... bench:       3,311 ns/iter (+/- 13) = 1208 MB/s
test gen_u32_xorshift_128_32     ... bench:       1,475 ns/iter (+/- 60) = 2711 MB/s

test gen_u64_mwp                 ... bench:       5,101 ns/iter (+/- 27) = 1568 MB/s
test gen_u64_xorshift_128_32     ... bench:       4,591 ns/iter (+/- 59) = 1742 MB/s

PractRand seems pretty happy with it until now (half a terabyte tested).

@Ichoran

MCG is a bit risky given that it's degenerate when self.m is zero.

Good point. We would have to make sure the seed is not 0, just like we have to for Xorshift/Xoroshiro and PCG with MCG as a base generator.

It is my understanding that the 64/64 permutation has problems with each output appearing exactly once precisely because the period is too small. If you used it with a larger period, it would be fine.

Yes. Although it also depends a little on how the period works. For example imagine a scheme where the base generator first gives every number between 0 and 2^64 in one order, and for the next period every number again only once but in some other order. That is why I wanted to know the details of the extension mechanism.

@Lokathor

I think this gives the proper distribution, but has the consequence that some results will not appear at all

That sounds like a very improper solution! I'd be upset to use a generator where some results can't possibly happen, even despite the fact that any particular result only has a 1/(2^64) chance to begin with.

I agree it is not nice. On the other hand I don't think it matters. You only know which values are missing after generating and keeping track of 2^64 numbers. I don't think that is even possible. And for every seed the numbers that are double / triple, and the results that are missing are different. But it doesn't matter, the period is larger than I estimated.

Thanks both for thinking about along seriously!

I am not going to push this RNG too far, but it seems to work well and is faster than the other alternatives for generating good-quality u64's on x86.

@pitdicker pitdicker mentioned this issue Dec 7, 2017
@pitdicker
Copy link

pitdicker commented Dec 7, 2017

Another possible solution: the XSH output function needs only 6 bits to do it's work, and should be able to output up to 58 bits. The mantissa of a f64 can only store 53 bits. So we could make something like PCG XSH RR 64/53 work.

I see a few disadvantages though:

  • generating doubles directly does not fit with the current trait design.
  • we would have to use a more primitive conversion to doubles, instead of one that gets higher precision closer to zero.
  • It is still a little slower than my concoction from the previous comment 😄

@Lokathor
Copy link

Lokathor commented Dec 8, 2017

Well, converting one or more u64 values into a "good" f64 value (for some range and distribution and such) is something that's rather external to any particular generator. Assuming that you have a uniform production of u64 values, there should be a single formula that takes a generator and makes a f64 for any given distribution you want. Of course, some generators are ever so slightly non-uniform, and so they will make the final f64 ever so slightly non-uniform, but that's something the end user will have to care about (or not) when they're picking a generator. We can only document it and hope they read the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants