Add `DistString` impl to `Uniform` and `Slice` #1315

aobatact · 2023-05-26T07:42:13Z

Uniform<char> and Slice<char> can be useful with DistString to generate a String with range or set or chars.

Question

Currently impl DistString for Slice<'a, char> checks the max_char_len to minimize the reserve length, but do we need this?

dhardy

Thanks for the PR

dhardy · 2023-05-26T08:20:20Z

src/distributions/slice.rs

+        let max_char_len = self
+            .slice
+            .iter()
+            .try_fold(1, |max_len, char| {
+                // When the current max_len is 4, the result max_char_len will be 4.
+                Some(max_len.max(char.len_utf8())).filter(|len| *len < 4)
+            })
+            .unwrap_or(4);
+
+        string.reserve(max_char_len * len);
+        string.extend(self.sample_iter(rng).take(len))


If the slice is large, this could take significant time (and the only purpose is to potentially reduce the max-char-len used for space reservation below 4). I suggest only iterating when the length is under some bound (1000 maybe?).

Further, a slice could contain one 4-byte character but mostly 1-byte chars so this could massively over-reserve (perhaps significant if len is large). It isn't obvious what the best approach is (or how to test — it could be very context dependent — perhaps try an approach which isn't terrible anywhere rather than to perfectly optimise).

I mean it's arguable whether just reserving len bytes would be better. Or you could sample len/4, check the length and repeat... but probably over-complex.

What is the goal here? How many allocations are acceptable? We could always have conservative lower limit and let extend deal with the additional allocations. If we reserve the minimal len/4, this should result in at most 3 allocations, right? I think this is good enough for a "general-purpose" implementation.

Are there use cases where the slice is large? Typically, I would expect the "alphabet" to be small, unless it is abused for weighted sampling, for which we have better implementations. Another use case could be "valid Unicode except a few characters", but for this, rejection sampling should be used.

Typically, I would expect the "alphabet" to be small

As would I, but it could be large and this is easy to test. The simplest solution would be to measure max-char-size only where the alphabet is reasonably small, e.g. under 200 items.

As pointed out, I implemented not checking for long slices. The limit is currently 200 but I don't know what is the best size for this.
I also split the sampling if the sampling length is long or the slice contains not only ascii.

dhardy · 2023-05-26T08:31:06Z

src/distributions/uniform.rs

+        // Get the utf8 length of hi to minimize extra space.
+        // SAFETY: hi used to be valid char.
+        // This relies on range constructors which accept char arguments.
+        let max_char_len = unsafe { char::from_u32_unchecked(hi).len_utf8() };


I don't believe char::len_utf8 even cares whether hi is a valid char so this should be safe, but the use of unsafe still feels unnecessary. Suggestion:

let max_char_len = char::from_u32(hi).map(char::len_utf8).unwrap_or(4);

I have applied your suggestion.

dhardy · 2023-05-26T08:35:21Z

src/distributions/uniform.rs

+        // Getting the hi value to assume the required length to reserve in string.
+        let mut hi = self.0.sampler.low + self.0.sampler.range;
+        if hi >= CHAR_SURROGATE_START {
+            hi += CHAR_SURROGATE_LEN;
+        }


Largest possible result is one less than this I think?

It would be nice to have some tests for such corner cases.

I have fix this and add some test.

aobatact · 2023-05-31T11:47:18Z

Thank you for review!

dhardy

Thanks!

Add impl for DistString to Uniform and Slice

7b5a417

dhardy reviewed May 26, 2023

View reviewed changes

Fix DistString impl.

4d4f34d

vks added the D-review Do: needs review label Jul 4, 2023

dhardy approved these changes Jul 14, 2023

View reviewed changes

dhardy merged commit ee80b41 into rust-random:master Jul 14, 2023
12 checks passed

aobatact deleted the more-dist-string branch July 19, 2023 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `DistString` impl to `Uniform` and `Slice` #1315

Add `DistString` impl to `Uniform` and `Slice` #1315

aobatact commented May 26, 2023

dhardy left a comment

dhardy May 26, 2023

vks May 26, 2023

dhardy May 27, 2023

aobatact May 31, 2023

dhardy May 26, 2023

aobatact May 31, 2023

dhardy May 26, 2023

vks May 26, 2023

aobatact May 31, 2023

aobatact commented May 31, 2023

dhardy left a comment

Add DistString impl to Uniform and Slice #1315

Add DistString impl to Uniform and Slice #1315

Conversation

aobatact commented May 26, 2023

Question

dhardy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aobatact commented May 31, 2023

dhardy left a comment

Choose a reason for hiding this comment

Add `DistString` impl to `Uniform` and `Slice` #1315

Add `DistString` impl to `Uniform` and `Slice` #1315