Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SampleUniform for char #1068

Merged
merged 6 commits into from Nov 27, 2020
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
103 changes: 103 additions & 0 deletions src/distributions/uniform.rs
Expand Up @@ -706,6 +706,88 @@ uniform_simd_int_impl! {
u8
}

impl SampleUniform for char {
type Sampler = UniformChar;
}

/// The back-end implementing [`UniformSampler`] for `char`.
///
/// Unless you are implementing [`UniformSampler`] for your own type, this type
/// should not be used directly, use [`Uniform`] instead.
///
/// This differs from integer range sampling since the range `0xD800..=0xDFFF`
/// are used for surrogate pairs in UCS and UTF-16, and consequently are not
/// valid Unicode code points. We must therefore avoid sampling values in this
/// range.
#[derive(Clone, Copy, Debug)]
#[cfg_attr(feature = "serde1", derive(Serialize, Deserialize))]
pub struct UniformChar {
sampler: UniformInt<u32>,
}

/// UTF-16 surrogate range start
const CHAR_SURROGATE_START: u32 = 0xD800
/// UTF-16 surrogate range size
const CHAR_SURROGATE_LEN: u32 = 0xE000 - CHAR_SURROGATE_START;

impl UniformChar {
#[inline]
fn new_(mut low: u32, mut high: u32) -> Self {
if low >= CHAR_SURROGATE_START {
low -= CHAR_SURROGATE_LEN;
dhardy marked this conversation as resolved.
Show resolved Hide resolved
}
Copy link
Member

@newpavlov newpavlov Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth to add the following helper functions:

/// UTF-16 surrogate range start
const RANGE_START: u32 = 0xD800
/// UTF-16 surrogate range size
const RANGE_SIZE: u32 = 0xE000 - RANGE_START;

/// Convert `char` to compressed `u32`
fn to_comp_u32(c: char) -> u32 {
    match c as u32 {
        c if c >= RANGE_START => c - RANGE_SIZE,
        c => c,
    }
}

/// Convert compressed `u32` to `char`
fn from_comp_u32(mut c: u32) -> u32 {
    if c >= RANGE_START {
        c += RANGE_SIZE;
    }
    // SAFETY: checked that `c` is a legal unicode value, since `c` is initially
    // constructed from `char` it can not be larger than 0xFFFE, i.e. safety of
    // this function relies on the constructor methods
    unsafe { core::char::from_u32_unchecked(c) }
}

In addition to make code easier to understand it would allow to remove the new_ method.

IIRC it's allowed to have private functions safety of which relies on code in the current module (e.g. it applies to the Vec code).

if high >= CHAR_SURROGATE_START {
high -= CHAR_SURROGATE_LEN;
}
UniformChar {
sampler: UniformInt::<u32>::new_inclusive(low, high),
}
}
}

impl UniformSampler for UniformChar {
type X = char;

#[inline] // if the range is constant, this helps LLVM to do the
// calculations at compile-time.
fn new<B1, B2>(low_b: B1, high_b: B2) -> Self
where
B1: SampleBorrow<Self::X> + Sized,
B2: SampleBorrow<Self::X> + Sized,
{
let low = *low_b.borrow() as u32;
let high = *high_b.borrow() as u32;
assert!(low < high, "Uniform::new called with `low >= high`");
Self::new_(low, high - 1)
}

#[inline] // if the range is constant, this helps LLVM to do the
// calculations at compile-time.
fn new_inclusive<B1, B2>(low_b: B1, high_b: B2) -> Self
where
B1: SampleBorrow<Self::X> + Sized,
B2: SampleBorrow<Self::X> + Sized,
{
let low = *low_b.borrow() as u32;
let high = *high_b.borrow() as u32;
assert!(
low <= high,
"Uniform::new_inclusive called with `low > high`"
);
Self::new_(low, high)
}

fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Self::X {
let mut x = self.sampler.sample(rng);
if x >= CHAR_SURROGATE_START {
x += CHAR_SURROGATE_LEN;
dhardy marked this conversation as resolved.
Show resolved Hide resolved
}
// SAFETY: x must not be in surrogate range or greater than char::MAX.
// This relies on range constructors which accept char arguments.
// Validity of input char values is assumed.
unsafe { core::char::from_u32_unchecked(x) }
dhardy marked this conversation as resolved.
Show resolved Hide resolved
}
}

/// The back-end implementing [`UniformSampler`] for floating-point types.
///
Expand Down Expand Up @@ -1207,6 +1289,27 @@ mod tests {
}
}

#[test]
#[cfg_attr(miri, ignore)] // Miri is too slow
fn test_char() {
let mut rng = crate::test::rng(891);
let mut max = core::char::from_u32(0).unwrap();
for _ in 0..100 {
let c = rng.gen_range('A'..='Z');
assert!('A' <= c && c <= 'Z');
max = max.max(c);
}
assert_eq!(max, 'Z');
let d = Uniform::new(
core::char::from_u32(0xD7F0).unwrap(),
core::char::from_u32(0xE010).unwrap(),
);
for _ in 0..100 {
let c = d.sample(&mut rng);
assert!((c as u32) < 0xD800 || (c as u32) > 0xDFFF);
}
}

#[test]
#[cfg_attr(miri, ignore)] // Miri is too slow
fn test_floats() {
Expand Down