Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting limitations / Custom strides? #437

Closed
davenza opened this issue Apr 11, 2018 · 5 comments
Closed

Broadcasting limitations / Custom strides? #437

davenza opened this issue Apr 11, 2018 · 5 comments

Comments

@davenza
Copy link

davenza commented Apr 11, 2018

Hi, I have a problem that I don't know how to solve with the current broacast implementation.

I want to multiply two matrices where some axis in both matrices could have length 1. For example:

shape [1,2,2] * shape [2,1,2]

The expected matrix would be a shape [2, 2, 2]. #265 illustrates this behaviour better than me. This is totally natural in Numpy. Of course, the current implementation of ndarray does not allow this type of operations and @bluss commented that was not keen to implement it.


Custom strides?

I have been thinking how to solve this problem, and the unique (efficient) solution that I could come up is using custom strides where the stride is 0 in the axis with length 1. So, our original matrices:

shape [1,2,2], strides [4,2,1] * shape [2,1,2], strides[4,2,1]

would be converted to:

shape [2,2,2], strides [0,2,1] * shape [2,1,2], strides[4,2,1]

Note that only the left matrix needs to be customized. However, the current ArrayBase implementation does not allow changes on the strides. The unique way to customize strides is: from_shape_vec. However, as the documentation states:

Errors if strides allow multiple indices to point to the same element, so this type of changes are not allowed. How would you solve this problem?

@jturner314
Copy link
Member

Currently, ndarray supports automatic broadcasting of only the array on the right-hand side of a binary operator. However, it is possible to manually broadcast the left-hand array by using the .broadcast() method, which does what you're describing with "Custom strides?". Here's an example:

#[macro_use]
extern crate ndarray;

fn main() {
    let a = array![[[1, 2], [3, 4]]];
    let b = array![[[5, 6]], [[7, 8]]];
    let c = &a.broadcast((2, 2, 2)).unwrap() * &b;
    assert_eq!(a.shape(), [1, 2, 2]);
    assert_eq!(b.shape(), [2, 1, 2]);
    assert_eq!(c.shape(), [2, 2, 2]);
    println!("{}\n*\n{}\n=\n{}", a, b, c);
}

which prints

[[[1, 2],
  [3, 4]]]
*
[[[5, 6]],
 [[7, 8]]]
=
[[[5, 12],
  [15, 24]],
 [[7, 16],
  [21, 32]]]

which is the same result NumPy gives.

Basically, .broadcast() creates a view of an array with strides of 0 where necessary. It returns an Option because it may not be possible to broadcast an array to the specified shape.

Note that Array::from_shape_vec() does not allow overlapping elements in an owned array because that would make it possible to violate Rust's ownership rules. There's a similar method specifically for views (ArrayView::from_shape()), but it turns out that it also disallows overlaps even though that isn't strictly necessary. (I created #438 for this.) So, currently the only way to create overlapping elements in safe code is the .broadcast() method.

Having to manually call .broadcast() is somewhat annoying. I would like to add support for co-broadcasting, and I've worked some on the implementation, but I've held off on working on it anymore until Rust gains support for const generics. The tracking issue for co-broadcasting in ndarray is #11.

@davenza
Copy link
Author

davenza commented Apr 13, 2018

Thank you @jturner314! That solves my problems with broadcasting.

Now, this could be another issue, but I have compared the performance of ndarray and Numpy and the results are very surprising to me:

i7 6700k 32GB 1867Mhz

n Python Rust
2 0.85 us 0.87 us
4 1.11 us 4.12 us
6 1.83 us 21.22 us
8 6.55 us 176.76 us
10 35.36 us 1.41 ms
12 230.48 us 12.05 ms

The benchmark code is here, if you wish to check it. In this benchmark I multiplied two arrays with 2^{n} elements. I added n/2 axes with length 1 on each array. In one array, the axes were added at the start. In the other array, the axes were added at the end. The expected result is going to have 2^{n + n / 2} elements.

Of course, the problem is exponential with n. Python's solution is way, way more performant. I know that Numpy is really mature and it is a great piece of work (with huge amounts of C code), but the difference seems like black magic coming from the Numpy devs.

Maybe am I doing something wrong? I added blas support in the rust project. However, it does not make any difference in the performance results (not worse, not better). That is weird to me, because blas should be pretty optimized I guess.

Thank you again for your help and support!

@jturner314
Copy link
Member

Well, NumPy is mature and highly optimized, while ndarray is fairly new and needs more profiling and optimization work done. I worked on your benchmarks and was able to achieve substantial improvements for the ndarray benchmarks at the cost of some ergonomics, but the times are still >20x larger than the NumPy times for large-dimension arrays. More investigation is necessary to determine the exact cause. In principle, it should be possible for us to make ndarray match or exceed NumPy's performance in almost all cases (once Rust gets stable SIMD support), but doing so requires profiling and optimization work. In the current state, I would expect some things to be faster with NumPy and others to be faster with ndarray.

Some of the changes I made to your benchmarks were to make the NumPy and ndarray benchmarks more equivalent, which helped somewhat for small n. Other than that, I achieved the biggest improvement (30–50%) in the ndarray times by changing the multiplication to use Zip. When evaluating &a * &b, ndarray clones all the data in a into a new owned array and then performs the multiplication, storing the result in the new owned array. Cloning the data is quite expensive since it requires a clone of 8*2^(n+n/2) bytes of data, which is 2.1MB for the n=12 case. Until Rust has specialization, I don't see a good way to avoid the clones in the implementation of the multiplication operator. However, you can avoid the clones manually by using Zip with an uninitialized (or zero-filled) output array. This still allocates the required array for the result but avoids copying all that data.

For what it's worth, the benchmark you're using is probably a worst-case for the way arithmetic operators are currently implemented in ndarray, since the arrays have so many dimensions (18 dimensions for the n=12 case). When comparing NumPy and ndarray, it's also worth noting that ndarray is able to do things that NumPy can't because of Rust's borrow checker. For example, multiplying two owned 512 x 512 f64 arrays in ndarray with a * &b is >30% faster than the NumPy equivalent on my machine, partially because ndarray can reuse the left array since it takes ownership, while NumPy has to allocate a new one.

A couple of notes:

@davenza
Copy link
Author

davenza commented Apr 14, 2018

Thank you again @jturner314. Your suggestions are always good.

I didn't prepare the benchmark to compare ndarray and Numpy in general. The benchmark was created because I need that functionality on one of my projects. I understand that the library and the language itself should be more mature to compete against Numpy.

So thanks again, and keep the good work!

@jturner314
Copy link
Member

@davenza You're welcome. I hope ndarray works well for you!

It seems like this issue is resolved, so I'm closing it. If there's anything else, please feel free to reopen this issue or open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants