Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SampleUniform for char #1068

Merged
merged 6 commits into from Nov 27, 2020
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
25 changes: 19 additions & 6 deletions src/distributions/uniform.rs
Expand Up @@ -714,20 +714,30 @@ impl SampleUniform for char {
///
/// Unless you are implementing [`UniformSampler`] for your own type, this type
/// should not be used directly, use [`Uniform`] instead.
///
/// This differs from integer range sampling since the range `0xD800..=0xDFFF`
/// are used for surrogate pairs in UCS and UTF-16, and consequently are not
/// valid Unicode code points. We must therefore avoid sampling values in this
/// range.
#[derive(Clone, Copy, Debug)]
#[cfg_attr(feature = "serde1", derive(Serialize, Deserialize))]
pub struct UniformChar {
sampler: UniformInt<u32>,
}

/// UTF-16 surrogate range start
const CHAR_SURROGATE_START: u32 = 0xD800
/// UTF-16 surrogate range size
const CHAR_SURROGATE_LEN: u32 = 0xE000 - CHAR_SURROGATE_START;

impl UniformChar {
#[inline]
fn new_(mut low: u32, mut high: u32) -> Self {
if low >= 0xD800 {
low -= 0xE000 - 0xD800;
if low >= CHAR_SURROGATE_START {
low -= CHAR_SURROGATE_LEN;
dhardy marked this conversation as resolved.
Show resolved Hide resolved
}
Copy link
Member

@newpavlov newpavlov Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth to add the following helper functions:

/// UTF-16 surrogate range start
const RANGE_START: u32 = 0xD800
/// UTF-16 surrogate range size
const RANGE_SIZE: u32 = 0xE000 - RANGE_START;

/// Convert `char` to compressed `u32`
fn to_comp_u32(c: char) -> u32 {
    match c as u32 {
        c if c >= RANGE_START => c - RANGE_SIZE,
        c => c,
    }
}

/// Convert compressed `u32` to `char`
fn from_comp_u32(mut c: u32) -> u32 {
    if c >= RANGE_START {
        c += RANGE_SIZE;
    }
    // SAFETY: checked that `c` is a legal unicode value, since `c` is initially
    // constructed from `char` it can not be larger than 0xFFFE, i.e. safety of
    // this function relies on the constructor methods
    unsafe { core::char::from_u32_unchecked(c) }
}

In addition to make code easier to understand it would allow to remove the new_ method.

IIRC it's allowed to have private functions safety of which relies on code in the current module (e.g. it applies to the Vec code).

if high >= 0xD800 {
high -= 0xE000 - 0xD800;
if high >= CHAR_SURROGATE_START {
high -= CHAR_SURROGATE_LEN;
}
UniformChar {
sampler: UniformInt::<u32>::new_inclusive(low, high),
Expand Down Expand Up @@ -769,9 +779,12 @@ impl UniformSampler for UniformChar {

fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Self::X {
let mut x = self.sampler.sample(rng);
if x >= 0xD800 {
x += 0xE000 - 0xD800;
if x >= CHAR_SURROGATE_START {
x += CHAR_SURROGATE_LEN;
dhardy marked this conversation as resolved.
Show resolved Hide resolved
}
// SAFETY: x must not be in surrogate range or greater than char::MAX.
// This relies on range constructors which accept char arguments.
// Validity of input char values is assumed.
unsafe { core::char::from_u32_unchecked(x) }
dhardy marked this conversation as resolved.
Show resolved Hide resolved
}
}
Expand Down