Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to std::simd, expand SIMD & docs #1239

Merged
merged 5 commits into from
Aug 10, 2022

Conversation

TheIronBorn
Copy link
Collaborator

@TheIronBorn TheIronBorn commented Jul 7, 2022

Closes #1232. Should also fix #1162.

The new API is pretty easy and the new trait system made for some convenient generic implementations.

Changes:

  • moved __m128i & __m256i away from simd_support feature
  • added __m512i and some internal AVX512 optimizations
  • dropped u8x2/i8x2 types due to lack of support in std::simd
  • added SIMD types (and NonZero) to the list in Standard type documentation
  • implemented Distribution<maskNxM> for Standard, behaves like rng.gen::<bool>
  • implemented Distribution<mask64xM> for Bernoulli, each lane uses the same prob

The trait system allows impls to adapt to future SIMD types and doesn't clutter the rendered documentation. Using it in more places would mean some code duplication though.

__m512i requires nightly Rust's stdsimd feature at the moment so I put it under the simd_support feature but nightly might make more sense.

@TheIronBorn
Copy link
Collaborator Author

Since std::simd uses the same LLVM API as packed_simd I doubt this will change benchmarks at all but I'll try to run some later.

@TheIronBorn
Copy link
Collaborator Author

Turns out we don't have any SIMD benchmarks

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good riddance, fragile packed_simd! Lets hope std SIMD is better.

This isn't a full review. Thanks @TheIronBorn.

benches/misc.rs Outdated Show resolved Hide resolved
src/distributions/bernoulli.rs Outdated Show resolved Hide resolved
src/distributions/integer.rs Outdated Show resolved Hide resolved
@TheIronBorn
Copy link
Collaborator Author

I'm agreed that introducing seemingly unpredictable type inference is bad.

An extra trait or something would also make documentation easier, we could use #[cfg_attr(doc_cfg, doc(cfg(feature = "simd_support")))] and other stuff as well.

The Bernoulli stuff should perhaps wait for #1227

@TheIronBorn TheIronBorn force-pushed the std-simd branch 2 times, most recently from acd5020 to d41a948 Compare July 9, 2022 18:02
move __m128i to stable, expand documentation, add SIMD to Bernoulli, add maskNxM, add __m512i

genericize simd uniform int

remove some debug stuff

remove bernoulli

foo

foo
@TheIronBorn
Copy link
Collaborator Author

Removed the Bernoulli stuff and squashed some commits.

Ran some benchmarks for 128/256 bit vectors and they're mostly the same except for small optimizations packed_simd had but std::simd isn't prioritizing. Maybe LLVM might optimize it better in the future.

src/distributions/integer.rs Outdated Show resolved Hide resolved
Comment on lines 176 to 178
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Mask<T, LANES> {
rng.gen().lanes_lt(Simd::default())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this compute to false in all lanes?

And, regarding the proceeding doc, ... doesn't this make the Mark type inappropriate if normally only one bit from each lane is used? Ugh, guess we can't fix that.

BTW one can assume that a random boolean is true with 50% probability, but should one actually assume anything about what a random Mask is?

Maybe instead we should provide a higher level API around select etc... though given that SIMD is a little niche, maybe just a few recipes in the book would be enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaskElement must be a signed integer so we have some form of iNxM.lt(0)

There are other SIMD operations which use all the bits of a mask, like an SSE2 select (x & m) | (y & !m), and std::simd doesn't specify layout or value representation so we have to do this.

Correct, we can't assume anything about Mask value representation but I don't know what users might need that for. Is there a specific usecase example you're thinking of?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific use-case; I was just going by what you mentioned (select). Probably this is unnecessarily to get into however.

@TheIronBorn
Copy link
Collaborator Author

Realized that the nightly feature is documented to only provide optimizations, not extra API stuff so putting __m512 under simd_support probably makes the most sense

@dhardy
Copy link
Member

dhardy commented Jul 11, 2022

Realized that the nightly feature is documented to only provide optimizations, not extra API stuff so putting __m512 under simd_support probably makes the most sense

Any flags depending on nightly are unstable so we don't have to worry too much. I would guess if other SIMD parts get stabilized in libstd first, then __m512 would eventually end up under another feature flag.. or simd_support would end up only adding the non-stabilized SIMD features.

@dhardy dhardy mentioned this pull request Aug 4, 2022
@TheIronBorn
Copy link
Collaborator Author

Anything left to review here?

@TheIronBorn
Copy link
Collaborator Author

Oh there's a weird new failure. I don't understand why it seems to be enabling simd_support with just nightly

#[cfg(feature = "simd_support")]
macro_rules! simd_impl {
($(($intrinsic:ident, $vec:ty),)+) => {$(
macro_rules! intrinsic_impl {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an x86 only impl, maybe the name should reflect that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Comment on lines 152 to 181
/// Note that on some hardware like x86/64 mask operations like [`_mm_blendv_epi8`]
/// only care about a single bit. This means that you could use uniform random bits
/// directly:
///
/// ```ignore
/// // this may be faster...
/// let x = unsafe { _mm_blendv_epi8(a.into(), b.into(), rng.gen::<__m128i>()) };
///
/// // ...than this
/// let x = rng.gen::<mask8x16>().select(b, a);
/// ```
///
/// Since most bits are unused you could also generate only as many bits as you need.
///
/// [`_mm_blendv_epi8`]: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_blendv_epi8&ig_expand=514/
/// [`simd_support`]: https://github.com/rust-random/rand#crate-features
#[cfg(feature = "simd_support")]
impl<T, const LANES: usize> Distribution<Mask<T, LANES>> for Standard
where
T: MaskElement + PartialOrd + SimdElement<Mask = T> + Default,
LaneCount<LANES>: SupportedLaneCount,
Standard: Distribution<Simd<T, LANES>>,
{
#[inline]
fn sample<R: Rng + ?Sized>(&self, rng: &mut R) -> Mask<T, LANES> {
// `MaskElement` must be a signed integer, so this is equivalent
// to the scalar `i32 < 0` method
rng.gen().lanes_lt(Simd::default())
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand this is correct, but not necessarily the most efficient: only the high bit of each lane generated by rng.gen() is used. I'm not sure if this is actually worth worrying about however.

I'm also just a little surprised it type-checks: it relies on only one type supporting lanes_lt(..) -> Mask<T, LANES>. Better to clarify with rng.gen::<T>().lates_lt(T::default())?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inefficiency can actually be worse in scalar. mask8x16 wastes 7 bits per lane while bool wastes 31, mask64x2 though wastes 63. I'll look into more efficient methods but might just point readers to it in the docs

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the above comments aren't blockers, so I'll also approve this PR.

@TheIronBorn
Copy link
Collaborator Author

Ah, the tests are failing because the stdsimd API changed in a recent nightly. Shame, I hoped stdsimd would be more stable, though I suppose it's a different kind of instability.

We could add some compilation conditions or wait until the changes reach stable.

@dhardy
Copy link
Member

dhardy commented Aug 6, 2022

I'm sure you know more than I do about the state of std::simd so I'll let you decide on that. Our status quo is also broken.

@TheIronBorn
Copy link
Collaborator Author

we could simply merge the packed_simd fix while we wait

@dhardy
Copy link
Member

dhardy commented Aug 6, 2022

Can do. The only thing against is that your migrations here will likely end up stale. Any idea on the time frame for stability or how much churn is likely to std::simd?

@TheIronBorn
Copy link
Collaborator Author

I don't know what I was thinking earlier but since std::simd is only available on nightly we can just fix and merge right now

@TheIronBorn
Copy link
Collaborator Author

Unfortunately there's no roadmap for stdsimd but there is likely to be at least a little churn. Though we aren't using too many features so I think we won't be affected much.
And at least it's more likely to be API instability than forcing us to rollback as often as packed_simd

@TheIronBorn
Copy link
Collaborator Author

The only remaining failure is crossbeam

@TheIronBorn TheIronBorn merged commit 2c16a92 into rust-random:master Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace packed_simd with stdlib simd
2 participants