Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice performance #571

Open
nilgoyette opened this issue Dec 12, 2018 · 13 comments
Open

slice performance #571

nilgoyette opened this issue Dec 12, 2018 · 13 comments

Comments

@nilgoyette
Copy link
Collaborator

Don't try to understand my first version too much, the second function is much simpler. It's simply comparing two 3x3x3 patches of data in a 11x11x11 image and returning a score. I refactored this function from

fn test_loop(data: &[f64], m: usize, n: usize, o: usize) -> f64 {
    let pl = 3;
    let br = 4;
    let bl = 11;
    let bl2 = bl * bl;
    let mut sum = 0.0;
    for a in 0..pl {
        let idx1_a = (br + a) * bl2;
        let idx2_a = (m + a) * bl2;
        for b in 0..pl {
            let idx1_b = (br + b) * bl;
            let idx2_b = (n + b) * bl;
            for c in 0..pl {
                let idx1 = idx1_a + idx1_b + br + c;
                let idx2 = idx2_a + idx2_b + o + c;
                let diff = (data[idx1] - data[idx2]).powi(2);
                sum += diff;
            }
        }
    }
    sum
}

to

fn test_slice(data: &Array3<f64>, m: usize, n: usize, o: usize) -> f64 {
    let pl = 3;
    let br = 4;
    let s1 = data.slice(s![br..br+pl, br..br+pl, br..br+pl]);
    let s2 = data.slice(s![m..m+pl, n..n+pl, o..o+pl]);

    let mut sum = 0.0;
    azip!(s1, s2 in { sum += (s1 - s2).powi(2) });
    sum
}

Of course, I'm happy with the code quality now (!), but the clean version is surprisingly slow. I benched those 2 functions using the same data, with test_loop using &data.as_slice().unwrap() instead of &data

test bench_loop              ... bench:          25 ns/iter (+/- 5)
test bench_slice             ... bench:         142 ns/iter (+/- 1)

Are those results surprising to you? Both versions don't allocate, calculate the indices (in src or lib) and the sum, etc. I fail to see why the clean version is almost 6 times slower. Is slice doing something really complex?

@jturner314
Copy link
Member

jturner314 commented Dec 12, 2018

I did some profiling of the Array-based approach.

Cargo.toml:

[package]
name = "nilgoyette_test"
version = "0.1.0"
edition = "2018"

[dependencies]
ndarray = "0.12"

[profile.release]
debug = true

[patch.crates-io]
ndarray = { git = "https://github.com/jturner314/ndarray.git", branch = "master" }

(Using the current master branch of ndarray, revision 27c84a5. I also tried it with 0.12.1 and got similar results.) I'm building in release mode with debugging symbols.

src/main.rs:

extern crate ndarray;

use ndarray::prelude::*;
use ndarray::{azip, s};

#[inline(never)]
fn test_slice(data: &Array3<f64>, m: usize, n: usize, o: usize) -> f64 {
    let pl = 3;
    let br = 4;
    let s1 = data.slice(s![br..br + pl, br..br + pl, br..br + pl]);
    let s2 = data.slice(s![m..m + pl, n..n + pl, o..o + pl]);

    let mut sum = 0.0;
    azip!(s1, s2 in { sum += (s1 - s2).powi(2) });
    sum
}

fn main() {
    let data = Array::from_shape_fn((11, 11, 11), |(i, j, k)| {
        i as f64 * 121. + j as f64 * 11. + k as f64
    });
    for _ in 0..50_000_000 {
        test_slice(&data, 2, 4, 5);
    }
}

I built with cargo build --release, and then ran the profiler with perf record --call-graph=dwarf -- target/release/nilgoyette_test. I then generated the report with perf report -n and expanded the interesting sections. This is the top part of the results:

Samples: 49K of event 'cycles:uppp', Event count (approx.): 38468252766
  Children      Self       Samples  Command          Shared Object       Symbol
-   94.48%     4.18%          2091  nilgoyette_test  nilgoyette_test     [.] nilgoyette_test::test_slice                                                                                                                                     ◆
   - 90.30% nilgoyette_test::test_slice                                                                                                                                                                                                      ▒
      - 44.96% ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::slice                                                                                                                                                                 ▒
         - 44.16% ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::slice_move (inlined)                                                                                                                                               ▒
            - 36.28% ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::slice_collapse (inlined)                                                                                                                                        ▒
                 core::iter::iterator::Iterator::for_each (inlined)                                                                                                                                                                          ▒
                 <core::iter::Enumerate<I> as core::iter::iterator::Iterator>::fold (inlined)                                                                                                                                                ▒
                 <core::slice::Iter<'a, T> as core::iter::iterator::Iterator>::fold (inlined)                                                                                                                                                ▒
                 _$LT$core..iter..Enumerate$LT$I$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$::fold::_$u7b$$u7b$closure$u7d$$u7d$::h9ab995f5be170e99 (inlined)                                                                          ▒
                 core::iter::iterator::Iterator::for_each::_$u7b$$u7b$closure$u7d$$u7d$::h220ca70508cbe599 (inlined)                                                                                                                         ▒
               - ndarray::impl_methods::_$LT$impl$u20$ndarray..ArrayBase$LT$S$C$$u20$D$GT$$GT$::slice_collapse::_$u7b$$u7b$closure$u7d$$u7d$::h8858fce0a9c26b37 (inlined)                                                                    ▒
                  - ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::slice_axis_inplace (inlined)                                                                                                                                     ▒
                     - 33.80% ndarray::dimension::do_slice                                                                                                                                                                                   ▒
                        - 4.98% ndarray::dimension::to_abs_slice (inlined)                                                                                                                                                                   ▒
                             1.65% ndarray::dimension::abs_index (inlined)                                                                                                                                                                   ▒
                             0.68% <core::option::Option<T>>::unwrap_or (inlined)                                                                                                                                                            ▒
            + 3.93% core::iter::iterator::Iterator::for_each (inlined)                                                                                                                                                                       ▒
            + 0.56% _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::zeros::hc3f7bf56e53b8fe8 (inlined)                                                    ▒
         + 0.67% ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::view (inlined)                                                                                                                                                      ▒
      - 24.52% <ndarray::zip::Zip<(P1, P2), D>>::apply                                                                                                                                                                                       ▒
         - <ndarray::zip::Zip<P, D>>::apply_core (inlined)                                                                                                                                                                                   ▒
            - 19.28% <ndarray::zip::Zip<P, D>>::apply_core_strided (inlined)                                                                                                                                                                 ▒
               - 11.38% _$LT$ndarray..zip..Zip$LT$$LP$P1$C$$u20$P2$RP$$C$$u20$D$GT$$GT$::apply::_$u7b$$u7b$closure$u7d$$u7d$::h487429d98cc31b6b (inlined)                                                                                    ▒
                  - 11.24% nilgoyette_test::test_slice::_$u7b$$u7b$closure$u7d$$u7d$::ha858ec7295617482 (inlined)                                                                                                                            ▒
                       std::f64::<impl f64>::powi (inlined)                                                                                                                                                                                  ▒
               + 2.29% <(A, B) as ndarray::zip::OffsetTuple>::stride_offset (inlined)                                                                                                                                                        ▒
                 1.68% _$LT$ndarray..dimension..dim..Dim$LT$$u5b$usize$u3b$$u20$_$u5d$$GT$$u20$as$u20$ndarray..dimension..dimension_trait..Dimension$GT$::next_for::haac29d1e0bcee731 (inlined)                                              ▒
                 1.57% core::iter::range::<impl core::iter::iterator::Iterator for core::ops::range::Range<A>>::next (inlined)                                                                                                               ▒
               + 0.87% <(A, B) as ndarray::zip::ZippableTuple>::uget_ptr (inlined)                                                                                                                                                           ▒
              5.11% <ndarray::zip::Zip<(P1, P2), D>>::apply                                                                                                                                                                                  ▒
      - 13.83% <ndarray::zip::Zip<(P1,), D>>::and                                                                                                                                                                                            ▒
         - 8.12% <ndarray::ArrayBase<ndarray::ViewRepr<&'a A>, D> as ndarray::zip::NdProducer>::layout (inlined)                                                                                                                             ▒
            - ndarray::zip::<impl ndarray::ArrayBase<S, D>>::layout_impl (inlined)                                                                                                                                                           ▒
               - 7.49% ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (inlined)                                                                                                                                  ▒
                  - ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout                                                                                                                           ▒
                     + 3.84% ndarray::dimension::dimension_trait::Dimension::default_strides                                                                                                                                                 ▒
                     + 1.83% core::iter::iterator::Iterator::any (inlined)                                                                                                                                                                   ▒
         + 0.89% <ndarray::zip::Zip<Parts, D>>::check (inlined)                                                                                                                                                                              ▒
      - 6.98% <ndarray::zip::Zip<(P,), D>>::from (inlined)                                                                                                                                                                                   ▒
           <ndarray::ArrayBase<ndarray::ViewRepr<&'a A>, D> as ndarray::zip::NdProducer>::layout (inlined)                                                                                                                                   ▒
           ndarray::zip::<impl ndarray::ArrayBase<S, D>>::layout_impl (inlined)                                                                                                                                                              ▒
           ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout (inlined)                                                                                                                                              ▒
         - ndarray::impl_methods::<impl ndarray::ArrayBase<S, D>>::is_standard_layout::is_standard_layout                                                                                                                                    ▒
            + 3.54% ndarray::dimension::dimension_trait::Dimension::default_strides                                                                                                                                                          ▒
            + 1.98% core::iter::iterator::Iterator::any (inlined)                                                                                                                                                                            ▒
   + 3.53% _start                                                                                                                                                                                                                            ▒
+   94.05%     0.23%           113  nilgoyette_test  nilgoyette_test     [.] nilgoyette_test::main                                                                                                                                           ▒
+   94.05%     0.00%             0  nilgoyette_test  nilgoyette_test     [.] _start                                                                                                                                                          ▒
+   94.05%     0.00%             0  nilgoyette_test  libc-2.28.so        [.] __libc_start_main                                                                                                                                               ▒
+   94.05%     0.00%             0  nilgoyette_test  nilgoyette_test     [.] main                                                                                                                                                            ▒
+   94.05%     0.00%             0  nilgoyette_test  nilgoyette_test     [.] std::rt::lang_start_internal                                                                                                                                    ▒
+   94.05%     0.00%             0  nilgoyette_test  nilgoyette_test     [.] std::panic::catch_unwind (inlined)                                                                                                                              ▒
+   94.05%     0.00%             0  nilgoyette_test  nilgoyette_test     [.] std::panicking::try (inlined)                                                                                                                                   ▒
+   94.05%     0.00%             0  nilgoyette_test  nilgoyette_test     [.] __rust_maybe_catch_panic                                                                                                                                        ▒
+   94.05%     0.00%             0  nilgoyette_test  nilgoyette_test     [.] std::panicking::try::do_call

The biggest piece of the time is ArrayBase::slice, followed by Zip::apply. Drilling down into ArrayBase::slice, the biggest piece is ndarray::dimension::do_slice, and most of that is outside of the to_abs_slice call. This is the source code for do_slice. I don't see anything especially expensive. Maybe the division/modulo? [Edit: I checked with perf annotate, and yes, the division is the most expensive part by far.]

The Array-based approach is doing more work and cannot be optimized as readily as the for loops approach primarily because everything in the Array-based approach is more general (n dimensions, possibly steps other than 1, arbitrary axis lengths instead of always pl = 3, etc.). If the slices were larger (more elements), I'd expect the performance of the two approaches to be more similar.

However, 142 ns is very little time anyway, so is this really an issue? (142 ns/iter - 25 ns/iter) * 10^8 iterations = 11.7 s, so the difference would only be noticeable in practice with hundreds of millions of iterations.

@nilgoyette
Copy link
Collaborator Author

Thank you for testing, @jturner314. You're right, 142ns is nothing! I made my bench as simple as possible to have a simple issue, but there are more details.

Of course, all parameters can change. This function is called in 3 for-loop, for m, n, o, which is called for all voxels in a 3D or 4D image, which dimension are at least 100x100x100. All this makes my running time go from 1m30s to 6m00s for the same 3D dataset.

So you're telling me most of the time difference is in this block?

let d = m / step.abs() as usize;
let r = m % step.abs() as usize;

My goal wasn't to complain, I mostly wanted to know if the situation is normal and if there's an easy fix. I guess it a 'yes' and a 'no' :)

@jturner314
Copy link
Member

jturner314 commented Dec 13, 2018

Ah, so you really are calling it hundreds of millions of times or more. :)

My goal wasn't to complain, I mostly wanted to know if the situation is normal and if there's an easy fix.

I understand; I do think it's worth looking into things like this to see if there are any clear areas for improvement. Differences of nanoseconds can make a practical difference when the algorithm is O(n^3) or more, as you demonstrated. :)

So you're telling me most of the time difference is in this block?

Within the slicing stuff, yes, most of the time is spent there. I wonder if it would be worth adding a check if the step is 1 before doing the division? I would think a step of 1 would be the most common case by far. [Edit: I've created a branch for this. See my comment below.]

We might be able to improve the performance of apply_core_strided (called by Zip::apply) by unrolling the inner two loops instead of just the inner one, but I'm not sure if that would hurt performance in other situations.

[Edit: Another thing that might help is adding support for slicing with types like [Range<usize>; 3] instead of just SliceInfo. This should be possible with a CanSlice trait, as in #570.]

I think we should make an effort to improve the performance of ndarray as much as possible, but I don't think ndarray will be able to match the performance of nested for loops in this case. Since the function is being called so often, it may make sense to manually implement it with for loops like that.

@jturner314
Copy link
Member

jturner314 commented Dec 13, 2018

@nilgoyette Will you please try your benchmark with the specialize-step-1 branch on my repo? I expect it will improve the bechmark by ~25%. To do this, add the following to your Cargo.toml and then re-run the benchmark:

[dependencies]
ndarray = "0.12"

[patch.crates-io]
ndarray = { git = "https://github.com/jturner314/ndarray.git", branch = "specialize-step-1" }

[Edit: This still isn't close to the for loops, but it should be a pretty substantial improvement.]

@jturner314
Copy link
Member

I wonder if there are any possible algorithmic improvements rather than just improving this small section in isolation. Are you able to provide more of the surrounding code?

@bluss
Copy link
Member

bluss commented Dec 13, 2018

Thanks for providing a clear benchmark, thats the ground work for improvement! Might even be worth it to avoid the division for more than the 1 case, and it's a nice idea jturner. A static div by 2 is strength reduced to shift etc, basically any static div can be inserted conditionally.

I guess that yes, making a small cutout in a 3d array has quite some overhead in the loop and the slicing. I'd like zip to handle it but maybe it needs a special case function.

@bluss
Copy link
Member

bluss commented Dec 13, 2018

Looks like is_standard_layout for 2d arrays can be improved.

@fmorency
Copy link

@jturner314 @bluss Thanks a lot for looking into this. I am @nilgoyette colleague.

We have a Rust implementation of 1 based on the implementation in 2 and we found this ndarray performance issue while refactoring the Rust code to be more idiomatic - use higher-order functions/iterators.

I tested your patch on a real use-case (3D image 256x256x176) and we went from 4m56 to 4m40. I'll let @nilgoyette answer with his own benchmark but it appears we didn't get the expected 25% speedup.

I ran perf using my real use-case and your patch and it seems to show that a lot of time is past in do_slice, is_standard_layout and apply

@bluss
Copy link
Member

bluss commented Dec 13, 2018

I'd say if you want a short answer, it's not surprising, it is not designed with this in mind (overhead of slicing becomes large when the array slice is very small — like 3x3x3). We do have some overhead we need to combat with special cases for lower dimensions (2 and 3d arrays). We have some mandatory overhead since we use bounds checking for correctness, and in this case, we are bounds checking in three dimensions.

The example at example/convo.rs from 2016 sort of bears witness to the fact that anything that looks like a convolution is known not to be efficient using slicing. (There is not much thought behind this example, it's not necessarily a good example of how to do it, it's just lying around.)

We do have an ndproducer called .windows() that could be used like:

Zip::from(&mut sum)
  .and(X.windows((pl, pl, pl)))
  .and(Y.windows((pl, pl, pl)))
  .apply(|sum, w1, w2| /* compare your windows */);

does .windows() visit the array the way you need? But even there, I suspect the economics of the situation might mean there is very noticeable overhead in a similar way. It would be good to find a big picture solution. It is a benefit to reduce the problem from 3 to 2 dimensions for example, if it's possible.

@jturner314
Copy link
Member

One possible higher-level improvement is to change the order of iteration. For the sake of illustration, consider a much simpler but related problem: calculating for each element in a 2D array the sum of squares of differences between that element and nearby elements (within some distance), i.e.

      m+r   n+r
yₘₙ = ∑     ∑     (xᵢⱼ − xₘₙ)²
      i=m−r j=n−r

The most obvious approach is to iterate over windows centered on the elements and calculate the sum of squares of differences for each window, like this:

const RADIUS: usize = 1;
const WINDOW_SIZE: usize = 2 * RADIUS + 1;

fn sum_sq_diff_windows(data: ArrayView2<f64>) -> Array2<f64> {
    let mut out = Array2::zeros((data.rows() - 2 * RADIUS, data.cols() - 2 * RADIUS));
    Zip::from(&mut out)
        .and(data.windows((WINDOW_SIZE, WINDOW_SIZE)))
        .apply(|out, window| {
            let center = window[(RADIUS, RADIUS)];
            *out = window.fold(0., |acc, x| acc + (x - center).powi(2));
        });
    out
}

This seems similar to the approach @nilgoyette / @fmorency are currently trying. Note that calling .slice() to create each window would have much worse performance than using .windows(), but even with .windows(), there's still some overhead. The outer loops (over the windows) are much longer than the inner loops (over the elements in the windows) since the data array is much larger than the window size.

Another approach is to swap the order of iteration so that the outer loops are short and the inner loops are long. We can do this by making the outer loops iterate over the offsets within the window shape and making the inner loops iterate over the window centers. Note the example below may be slightly confusing because I'm using .windows() as a clean way of iterating over views at the possible offsets; the shape of each of the offset_data views is (data.rows() - 2 * RADIUS, data.cols() - 2*RADIUS), not (WINDOW_SIZE, WINDOW_SIZE).

const RADIUS: usize = 1;
const WINDOW_SIZE: usize = 2 * RADIUS + 1;

fn sum_sq_diff_offset(data: ArrayView2<f64>) -> Array2<f64> {
    let mut acc = Array2::zeros((data.rows() - 2 * RADIUS, data.cols() - 2 * RADIUS));
    let radius = RADIUS as isize;
    let centers = data.slice(s![radius..(-radius), radius..(-radius)]);
    for offset_data in data.windows(centers.raw_dim()) {
        Zip::from(&mut acc)
            .and(centers)
            .and(offset_data)
            .apply(|acc, center, x| *acc += (x - center).powi(2));
    }
    acc
}

There is a trade-off here. The first approach has better cache locality, while the second approach has lower overhead and is much friendlier to the branch predictor because the inner loops are a lot longer.

The performance of the second approach is quite a bit better for this problem. The results are shown below, where the horiziontal axis is the axis length of the square data array and the vertical axis is the execution time. The first example is the red line, and the second example is the green line.

[Note: I ran this benchmark using a fork of ndarray with performance improvements to fold that haven't been merged into master yet, to show the "best case" scenario. If you run the benchmark with the current master of ndarray, the results for the first example are even worse.]

lines2 svg

It's worth noting that while the second approach works better for this example problem, that's not necessarily true for the @nilgoyette's / @fmorency's problem, because there is a trade-off as I mentioned. It also may be difficult to make this transformation for the real algorithm; I haven't looked at the paper enough in detail to know whether or not that's the case.

@nilgoyette
Copy link
Collaborator Author

nilgoyette commented Dec 14, 2018

@jturner314 I tested your branch and I do see an interesting speedup on my 9x9x9 (m, n, o) benches.

test bench_before             ... bench:     102,600 ns/iter (+/- 581)
test bench_after              ... bench:      81,254 ns/iter (+/- 1,534)

Was there a drawback to your branch?

@bluss Yes, the windows approach was my faster clean version. With this trick, I'm 2x slower instead of 3x slower.

However, I just found out that s1 can be moved out of the loops, to where my 11x11x11 data is created. This was hard to see with my horrible first version! This makes my current benches less useful. I'll try to optimize my code knowing this new fact.

I think this doesn't make the current issue useless. You seem to have some good optimization idea. I'll let you close it when you're ready.

@jturner314
Copy link
Member

Was there a drawback to your branch?

It might make slicing marginally slower for the step.abs() != 1 case, but it shouldn't be noticeable. I've created #575 with this change.

@polarathene
Copy link

Out of curiousity, as I found this thread interesting/informative; The performance issue was from accessing a small 3D/Voxel volume within a large dataset right? And that was due to the multiple dimensions? (internal logic issues identified aside)

It was suggested that reducing dimensionality would have helped? Would something like a hilbert curve have helped here with data layout and cache locality? It's meant to be good for that afaik and should go well with a voxel grid? A simpler to implement alternative would be a Z-order curve / Morton Code.

I'm not sure how practical that is with ndarray(I don't have any experience with this crate yet), I see Z-order curve support was requested 2 years ago, I take it that response still applies?

So while there may be more a suitable data structure to use for working with this sort of data/logic, ndarray was a pragmatic choice to go with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants