Avoid surrogates when generating `char` using Standard distribution #519

Pazzaz · 2018-06-20T14:20:02Z

This probably isn't performance critical but I thought it seemed wasteful to have a loop (that could theoretically go on forever) just to generate a char. The new version also seems faster when I benchmarked it.

test misc_gen_chars_old       ... bench:       1,031 ns/iter (+/- 177) = 496 MB/s
test misc_gen_chars_new       ... bench:         920 ns/iter (+/- 978) = 556 MB/s

dhardy · 2018-06-20T14:31:38Z

Since the old version only rejected 0.2% of samples, that doesn't explain the performance increase. Simpler logic (no branching other than the subtraction, which might not even be implemented as a branch) may explain the improvement.

Looks good anyway! @pitdicker?

dhardy · 2018-06-20T14:34:23Z

Might be faster if you sample from 0 to 0x11_0000 - GAP_SIZE (depending on optimiser).

Pazzaz · 2018-06-20T14:41:33Z

Seems to give the same performance.

test misc_gen_chars       ... bench:         926 ns/iter (+/- 463) = 552 MB/s

pitdicker · 2018-06-20T16:21:57Z

Good job!

I think the distr_standard_codepoint is a bit more reliable than the one you added, or at least easier to compare with the benchmarks that measure the performance of only the PRNG, and benchmarks of other distributions. Can you remove the new benchmark?

Benchmark before:

test distr_standard_codepoint    ... bench:       2,137 ns/iter (+/- 5) = 1871 MB/s

With this PR:

test distr_standard_codepoint    ... bench:       1,943 ns/iter (+/- 51) = 2058 MB/s

With Uniform::new(0, 0x11_0000 - GAP_SIZE) (and the corresponding changes):

test distr_standard_codepoint    ... bench:       2,020 ns/iter (+/- 16) = 1980 MB/s

Not sure why the different range is a bit slower. It has to do one addition less in the range code. On the other hand the check and compensation for characters in the gap and above (n > 0xD800) will be true much more often.

Can you add a comment that this is investigated, and a range with (GAP_SIZE, 0x11_0000) seems faster?

pitdicker · 2018-06-20T16:36:15Z

Somewhat funny: On Reddit there is a there is a unhappy discussion on the amount of unsafe code (and questionable use) in actix-web, yet we start adding more unsafe code 😄.

dhardy · 2018-06-20T17:16:28Z

Hmm. My take from a quick glance at that discussion is (a) that unsafe is sometimes required and sometimes merely used for performance (as here), and (b) that one must be careful to make unsafe uses easy to review and not mask real issues.

In this case the code is quite easy to understand, so not a big issue I think.

pitdicker · 2018-06-20T17:18:01Z

I agree, and am all for it in this case.

Pazzaz · 2018-06-20T17:28:10Z

Can you remove the new benchmark?

Sorry, didn't see the existing one. I've removed it.

Can you add a comment that this is investigated.

Done

pitdicker · 2018-06-20T17:31:45Z

Thank you!

sicking · 2018-06-20T20:47:17Z

One way to reduce the concern about unsafe code would be to add a debug_assert! which checks that char::from_u32(n) would have returned the same result.

TheIronBorn · 2018-06-21T06:10:13Z

src/distributions/other.rs

@@ -44,15 +44,21 @@ pub struct Alphanumeric;
 impl Distribution<char> for Standard {
    #[inline]
    fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> char {
-        let range = Uniform::new(0u32, 0x11_0000);


Could we make this new(0u32, char::MAX as u32)? More explicit about what we're doing

This line is being removed. But I don't think this is a good idea; you're off by 1 (should be inclusive) and if char::MAX were ever to change we don't know now whether we should use the whole range (minus the existing gap). So better just to use local constants as in the current implementation.

dhardy · 2018-06-21T09:07:09Z

One way to reduce the concern about unsafe code would be to add a debug_assert! which checks that char::from_u32(n) would have returned the same result.

I was thinking char::from_u32(n).unwrap_unchecked() where the latter method does a debug assert. But this requires a new Option method!

TheIronBorn · 2018-06-23T01:57:51Z

Debug assertions aren't run with cargo test --release.
#[cfg(any(test, debug_assertions))] would be better

sicking · 2018-06-23T02:15:59Z

Yeah. A unwrap_unchecked() function isn't really implementable in the language as it currently stands, even with unsafe code. There's no way to unwrap an enum without using a full match like unwrap() does. I guess what you could do is use raw pointers and rely on the binary encoding of the Option, but that's shaky enough that I don't think it'll happen.

dhardy · 2018-06-23T14:49:19Z

I wonder if it would be possible with some ugly trick like transmuting to enum UncheckedSomeOption<T> { Some(T), None(!) }. But we're way off topic.

I guess adding debug_assert!(char::from_u32(n).is_some()); would be reasonable here.

Pazzaz added 2 commits June 20, 2018 18:38

Avoid surrogates when generating char using Standard distribution

0b1061a

Mention reason for interval choise

02a7a11

Pazzaz force-pushed the char-standard branch from e00ec65 to 02a7a11 Compare June 20, 2018 17:25

pitdicker approved these changes Jun 20, 2018

View reviewed changes

TheIronBorn reviewed Jun 21, 2018

View reviewed changes

dhardy merged commit af1303c into rust-random:master Jun 21, 2018

sync-by-unito bot mentioned this pull request Dec 15, 2020

Bump rand_core from 0.5.1 to 0.6.0 AleoHQ/leo#503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid surrogates when generating `char` using Standard distribution #519

Avoid surrogates when generating `char` using Standard distribution #519

Pazzaz commented Jun 20, 2018

dhardy commented Jun 20, 2018

dhardy commented Jun 20, 2018

Pazzaz commented Jun 20, 2018

pitdicker commented Jun 20, 2018 •

edited

pitdicker commented Jun 20, 2018

dhardy commented Jun 20, 2018

pitdicker commented Jun 20, 2018

Pazzaz commented Jun 20, 2018

pitdicker commented Jun 20, 2018

sicking commented Jun 20, 2018 •

edited

TheIronBorn Jun 21, 2018 •

edited

dhardy Jun 21, 2018

dhardy commented Jun 21, 2018

TheIronBorn commented Jun 23, 2018 •

edited

sicking commented Jun 23, 2018

dhardy commented Jun 23, 2018

Avoid surrogates when generating char using Standard distribution #519

Avoid surrogates when generating char using Standard distribution #519

Conversation

Pazzaz commented Jun 20, 2018

dhardy commented Jun 20, 2018

dhardy commented Jun 20, 2018

Pazzaz commented Jun 20, 2018

pitdicker commented Jun 20, 2018 • edited

pitdicker commented Jun 20, 2018

dhardy commented Jun 20, 2018

pitdicker commented Jun 20, 2018

Pazzaz commented Jun 20, 2018

pitdicker commented Jun 20, 2018

sicking commented Jun 20, 2018 • edited

TheIronBorn Jun 21, 2018 • edited

Choose a reason for hiding this comment

dhardy Jun 21, 2018

Choose a reason for hiding this comment

dhardy commented Jun 21, 2018

TheIronBorn commented Jun 23, 2018 • edited

sicking commented Jun 23, 2018

dhardy commented Jun 23, 2018

Avoid surrogates when generating `char` using Standard distribution #519

Avoid surrogates when generating `char` using Standard distribution #519

pitdicker commented Jun 20, 2018 •

edited

sicking commented Jun 20, 2018 •

edited

TheIronBorn Jun 21, 2018 •

edited

TheIronBorn commented Jun 23, 2018 •

edited