Use std::arch for SIMD and target_feature #46

bluss · 2016-01-09T11:40:26Z

Use to select impl for unrolled dot product and scalar sum.

bluss · 2018-11-13T22:12:30Z

Preferred approach would be to move the heavy lifting and inner loops (dot product etc) to a separate crate in the style of https://github.com/bluss/numeric-loops or another existing already simdified crate.

SparrowLii · 2021-03-09T03:37:05Z

@bluss I am contributing to std::arch to make it a stable feature as soon as possible. I would like to undertake the simd-realization of ndarray. I think we can create a new branch from master for realizing and discussing.

The following is a very simple example:

#![feature(stdsimd)]
#![feature(stdsimd_internal)]
use ndarray::*;
use core_arch::simd::*;
use core_arch::simd_llvm::*;
use std::intrinsics::transmute;
use core_arch::arch::x86_64::{__m128bh, m128bhExt};

// Just for demonstration, much faster way is supposed to be used.
pub fn simd_arr1(xs: &[i32]) -> Array1<i32x4> {
    let len = xs.len();
    assert!(len % 4 == 0);
    let mut i = 0;
    let mut v: Vec<i32x4> = Vec::new();
    while i + 4 <= len {
        v.push(i32x4::new(xs[i], xs[i+1], xs[i+2], xs[i+3]));
        i += 4;
    }
    ArrayBase::from(v)
}

fn main() {
    let a = arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let b = arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let c = Zip::from(&a).and(&b).map_collect(|x, y| x * y);
    println!("{}", c);

    let a_simd = simd_arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    let b_simd = simd_arr1(&[1, 2, 3, 4, 5, 6, 7, 8]);
    unsafe {
        let c_simd = Zip::from(&a_simd).and(&b_simd).map_collect(|x, y| simd_mul(transmute::<_, __m128bh>(x.clone()), transmute::<_, __m128bh>(y.clone())).as_i32x4());
        println!("{:?}", c_simd);
    }
}

Output:

[1, 4, 9, 16, 25, 36, 49, 64]
[i32x4(1, 4, 9, 16), i32x4(25, 36, 49, 64)], shape=[2], strides=[1], layout=CFcf (0xf), const ndim=1

bluss · 2021-03-09T19:43:54Z

Hey, it's good if we talk about this before you get started. Notice that in this issue - it's not intended to be about arrays using those explicit simd types at all - that would be a different design - accelerating operations on Array<f64, _> would be a lot more interesting.

IMO simd that we are most interested in, for x86 at least, is already stable.

Notice also in this issue that I have suggested that any simd code like that happens in a new crate that we depend on. That means, it is not part of the ndarray crate.

SparrowLii · 2021-03-09T20:13:19Z

@bluss Then I hope we create such a crate in rust-ndarray ( instead of a personal crate).
So do we need a crate similar to universal intrinsics? Or we can also refer to usimd in numpy.
Yes, std::arch for x86 and x86_64 are already stable, I can start from here right away.

SparrowLii · 2021-03-13T10:17:09Z

I tried to use the simd in the operator overloading of multiplication. here.
And put the usage of avx512f instructions in another crate
Then a simd test was performed on an array with a scale of 500x500: main.rs:

use ndarray::Array;
use std::time;
use ndarray_rand::RandomExt;
use ndarray_rand::rand::distributions::Uniform;

fn main() {
    // f64
    let a = Array::random((500, 500), Uniform::new(0., 2.));
    let b = Array::random((500, 500), Uniform::new(0., 2.));
    let start = time::SystemTime::now();
    let c_simd = &a * &b;
    let end = time::SystemTime::now();
    println!("simd f64 {:?}",end.duration_since(start).unwrap());

    let start = time::SystemTime::now();
    let c = a * b;
    let end = time::SystemTime::now();
    println!("normal f64 {:?}",end.duration_since(start).unwrap());
    assert_eq!(c_simd, c);

    // i32
    let a = Array::random((500, 500), Uniform::new(0, 255));
    let b = Array::random((500, 500), Uniform::new(0, 255));
    let start = time::SystemTime::now();
    let c_simd = &a * &b;
    let end = time::SystemTime::now();
    println!("simd i32 {:?}",end.duration_since(start).unwrap());

    let start = time::SystemTime::now();
    let c = a * b;
    let end = time::SystemTime::now();
    println!("normal i32 {:?}",end.duration_since(start).unwrap());
    assert_eq!(c_simd, c);
}

The result is as follows:

simd f64 6.6887ms
normal f64 14.7793ms
simd i32 3.4118ms
normal i32 13.6641ms

The operation of f64 has been accelerated by 2x+, and the operation of i32 has been accelerated by 4x+.

I'm wondering if I am working in the right direction.

SparrowLii · 2021-03-14T15:45:51Z

@bluss Could you help pointing out which methods in ndarray should use simd in the first place?

SparrowLii · 2021-04-29T09:57:50Z

Here is my plan

Build a more easy-to-use simd crate based on stdarch and stdsimd which implements automatic detection of hardware characteristics, doesn't distinguish the vector lengths.
Help the compiler team to complete specialization. In this way, simd acceleration can be achieved with little changing in ndarray. And it can also solve the issue of broadcasting.
This looks crazy but I will try my best

dafmdev · 2023-07-26T13:44:56Z

I think you may be interested in this project, when simd is in std possibly ndarray will support this to further improve its performance.

skewballfox · 2024-01-04T01:15:13Z

Preferred approach would be to move the heavy lifting and inner loops (dot product etc) to a separate crate in the style of https://github.com/bluss/numeric-loops or another existing already simdified crate.

Is anybody working on this, or any reason I shouldn't attempt it?

Just to clarify, I'm assuming this means extracting the internal contents (like loops and basic operations) of the existing Ndarray functions into a separate crate ndarray-core , which can then be feature flagged or swapped with another?

bluss changed the title ~~Use cfg(target_feature=) when stable~~ Use std::arch for SIMD and target_feature Nov 13, 2018

cuihantao mentioned this issue Dec 4, 2022

Project status SparrowLii/vectorization#2

Closed

bluss mentioned this issue May 3, 2023

SIMD enhanced Wasm compilation #1271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use std::arch for SIMD and target_feature #46

Use std::arch for SIMD and target_feature #46

bluss commented Jan 9, 2016

bluss commented Nov 13, 2018

SparrowLii commented Mar 9, 2021 •

edited

bluss commented Mar 9, 2021 •

edited

SparrowLii commented Mar 9, 2021 •

edited

SparrowLii commented Mar 13, 2021 •

edited

SparrowLii commented Mar 14, 2021

SparrowLii commented Apr 29, 2021 •

edited

dafmdev commented Jul 26, 2023

skewballfox commented Jan 4, 2024 •

edited

Use std::arch for SIMD and target_feature #46

Use std::arch for SIMD and target_feature #46

Comments

bluss commented Jan 9, 2016

bluss commented Nov 13, 2018

SparrowLii commented Mar 9, 2021 • edited

bluss commented Mar 9, 2021 • edited

SparrowLii commented Mar 9, 2021 • edited

SparrowLii commented Mar 13, 2021 • edited

SparrowLii commented Mar 14, 2021

SparrowLii commented Apr 29, 2021 • edited

dafmdev commented Jul 26, 2023

skewballfox commented Jan 4, 2024 • edited

SparrowLii commented Mar 9, 2021 •

edited

bluss commented Mar 9, 2021 •

edited

SparrowLii commented Mar 9, 2021 •

edited

SparrowLii commented Mar 13, 2021 •

edited

SparrowLii commented Apr 29, 2021 •

edited

skewballfox commented Jan 4, 2024 •

edited