Zip: Handle preferred memory layout of inhomogenous inputs better #809

bluss · 2020-04-22T20:55:48Z

For example, when we use Zip::from(a).and(b); the Zip will examine the inputs and try to determine if they are all contiguous (and in the same way); it can now also determine what tendency the inputs have, to further guide which axis should be used for the innermost loop, even if not all the inputs are contiguous.

This helps for example with indexed Zip on f-order producers. The index producer has no bias in either direction, so all the other inputs will determine the layout preference.

The improved layout preference also affects parallelism, because in some cases we can better choose which axis to split along to preserve locality better.

The Layout type was improved to make this possible. It now has flags for C/F-contig and for C/F-preference. The new layout bits are visible in the array debug output.

Fixes #749

They were previously pub, just because we didn't have the pub(crate) feature yet.

bluss · 2020-04-22T20:58:01Z

Benchmark after this PR (somewhat noisy results, so I don't have a comparable before-after).

The ones that improved in this PR, were all examples of "ff" that improved to match "cc":

slice_split_zip_ff
slice_zip_ff
zip_indexed_ff

The rest should have been unchanged.

Benchmarks contributed by @nilgoyette and some extra cases added by me.

test slice_split_zip_cc ... bench:   1,422,302 ns/iter (+/- 389,440)
test slice_split_zip_ff ... bench:   1,438,825 ns/iter (+/- 696,813)
test slice_zip_cc       ... bench:   1,464,805 ns/iter (+/- 525,807)
test slice_zip_ff       ... bench:   1,469,397 ns/iter (+/- 318,191)
test zip_cc             ... bench:     760,087 ns/iter (+/- 140,810)
test zip_ff             ... bench:     758,991 ns/iter (+/- 236,212)
test zip_indexed_cc     ... bench:   4,733,670 ns/iter (+/- 959,597)
test zip_indexed_ff     ... bench:   4,151,390 ns/iter (+/- 923,228)
test zip_mut_with_cc    ... bench:     725,328 ns/iter (+/- 120,358)
test zip_mut_with_ff    ... bench:     718,780 ns/iter (+/- 170,944)

The reason the index benchmarks are much slower than reported in #749, is that here the index values are black_box'ed, to make sure they are not optimized out. Both using and not using black box can make for misleading benchmarks, the truth can be somewhere in between (because it can inhibit some loop optimizations), but it tells us more. Ndarray docs state "Indexed zip has overhead."

nilgoyette · 2020-04-23T14:01:36Z

A big thank you for this PR! I tried doing it several months ago and abandoned, so I'm quite happy to have nerd sniped you ;)

I ran the benchmarks and, as you wrote, they are noisy so it's kind of hard to spot the tiny changes. I ran them 5-6 times each and picked the best results. Having said this, it looks like the 'cc' cases are a little slower than they were, almost nothing, and probably in the noise range, but the 'ff' cases are now equal and this is a great news for us because we are stuck with loading and writing images in fortran order.

test zip_mut_with_cc     ... bench:         326 ns/iter (+/- 13)
test zip_mut_with_ff     ... bench:         341 ns/iter (+/- 29
test slice_split_zip_cc  ... bench:     416,585 ns/iter (+/- 22,373)
test slice_split_zip_ff  ... bench:     394,320 ns/iter (+/- 16,938)
test slice_zip_cc        ... bench:     404,885 ns/iter (+/- 19,939)
test slice_zip_ff        ... bench:     398,475 ns/iter (+/- 10,017)
test zip_cc              ... bench:     283,030 ns/iter (+/- 16,391)
test zip_ff              ... bench:     285,967 ns/iter (+/- 21,100)
test zip_indexed_cc      ... bench:   1,318,215 ns/iter (+/- 12,108)
test zip_indexed_ff      ... bench:   1,122,380 ns/iter (+/- 10,241)
test zip_mut_with_cc     ... bench:     276,915 ns/iter (+/- 15,970)
test zip_mut_with_ff     ... bench:     280,205 ns/iter (+/- 31,782)

Just a note, there are now 2 benchmarks named zip_mut_with_cc and zip_mut_with_ff, in iter.rs and zip.rs.

bluss · 2020-04-23T14:37:12Z

Thanks for running benchmarks! Reducing the overhead of Zip would be interesting.

I'll try to deduplicate the benchmarks and maybe remove some of the mixed benchmarks here, I don't want to run them anyway, even if they are useful for comparison and perspective.

Since you ran all benchmarks I'll share my tip of only compiling and running what you need, which we use to cope with building rust: cargo bench --bench zip builds just the zip.rs benches. Also works fine for picking out a particular test file, for quicker dev cycle on that.

My develop machine has changed, and the new one is very flaky at benchmarks. I can maybe see the point of criterion now; I used to have a setup that made for stable and reproducible benchmarks before.

Using split tests performance of the Zip in parallelization, so that we can see if there are benefits to splitting arrays better.

Using the index shows more directly the overhead of indexed zip

Support both unroll over c- and f-layout preferred axis in Zip inner loop (the fallback when inputs are not all contiguous and same layout). Keep a tendency score when building the Zip, so that we know if the inputs are tending to be c- or f- layout. This improves performance on the just added zip_indexed_ff benchmark, so that it seems to match its (already fast) cc counterpart.

bluss · 2020-04-23T20:01:06Z

Clippy is inexplicably emitting a lint that we are allowing, reproduces locally. It doesn't touch the code in the pr, so it's a separate issue.

bluss and others added 2 commits April 22, 2020 22:37

FIX: Layout, make all the constructors pub(crate)

f6239c0

They were previously pub, just because we didn't have the pub(crate) feature yet.

TEST: Zip vs Zip::indexed etc benchmarks

2a44fb6

bluss added 5 commits April 23, 2020 21:08

TEST: Test non-contig arrays in Zip benchmarks, also with splits

55998bb

Using split tests performance of the Zip in parallelization, so that we can see if there are benefits to splitting arrays better.

TEST: In Zip benchmarks, use the index

90ef196

Using the index shows more directly the overhead of indexed zip

TEST: Add tests for Layout

a3d53d2

TEST: update format tests for layout changes

47b3654

bluss force-pushed the zip-strided-for-c-and-f branch from cd21da6 to 47b3654 Compare April 23, 2020 19:10

bluss merged commit 78bd1d8 into master Apr 23, 2020

bluss deleted the zip-strided-for-c-and-f branch April 23, 2020 20:02

bluss mentioned this pull request Oct 9, 2020

Status of project? #843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zip: Handle preferred memory layout of inhomogenous inputs better #809

Zip: Handle preferred memory layout of inhomogenous inputs better #809

bluss commented Apr 22, 2020

bluss commented Apr 22, 2020 •

edited

nilgoyette commented Apr 23, 2020

bluss commented Apr 23, 2020

bluss commented Apr 23, 2020

Zip: Handle preferred memory layout of inhomogenous inputs better #809

Zip: Handle preferred memory layout of inhomogenous inputs better #809

Conversation

bluss commented Apr 22, 2020

bluss commented Apr 22, 2020 • edited

nilgoyette commented Apr 23, 2020

bluss commented Apr 23, 2020

bluss commented Apr 23, 2020

bluss commented Apr 22, 2020 •

edited