Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zip: Handle preferred memory layout of inhomogenous inputs better #809

Merged
merged 7 commits into from Apr 23, 2020

Conversation

bluss
Copy link
Member

@bluss bluss commented Apr 22, 2020

For example, when we use Zip::from(a).and(b); the Zip will examine the inputs and try to determine if they are all contiguous (and in the same way); it can now also determine what tendency the inputs have, to further guide which axis should be used for the innermost loop, even if not all the inputs are contiguous.

This helps for example with indexed Zip on f-order producers. The index producer has no bias in either direction, so all the other inputs will determine the layout preference.

The improved layout preference also affects parallelism, because in some cases we can better choose which axis to split along to preserve locality better.

The Layout type was improved to make this possible. It now has flags for C/F-contig and for C/F-preference. The new layout bits are visible in the array debug output.

Fixes #749

bluss and others added 2 commits April 22, 2020 22:37
They were previously pub, just because we didn't have the pub(crate)
feature yet.
@bluss
Copy link
Member Author

bluss commented Apr 22, 2020

Benchmark after this PR (somewhat noisy results, so I don't have a comparable before-after).

The ones that improved in this PR, were all examples of "ff" that improved to match "cc":

  • slice_split_zip_ff
  • slice_zip_ff
  • zip_indexed_ff

The rest should have been unchanged.

Benchmarks contributed by @nilgoyette and some extra cases added by me.

test slice_split_zip_cc ... bench:   1,422,302 ns/iter (+/- 389,440)
test slice_split_zip_ff ... bench:   1,438,825 ns/iter (+/- 696,813)
test slice_zip_cc       ... bench:   1,464,805 ns/iter (+/- 525,807)
test slice_zip_ff       ... bench:   1,469,397 ns/iter (+/- 318,191)
test zip_cc             ... bench:     760,087 ns/iter (+/- 140,810)
test zip_ff             ... bench:     758,991 ns/iter (+/- 236,212)
test zip_indexed_cc     ... bench:   4,733,670 ns/iter (+/- 959,597)
test zip_indexed_ff     ... bench:   4,151,390 ns/iter (+/- 923,228)
test zip_mut_with_cc    ... bench:     725,328 ns/iter (+/- 120,358)
test zip_mut_with_ff    ... bench:     718,780 ns/iter (+/- 170,944)

The reason the index benchmarks are much slower than reported in #749, is that here the index values are black_box'ed, to make sure they are not optimized out. Both using and not using black box can make for misleading benchmarks, the truth can be somewhere in between (because it can inhibit some loop optimizations), but it tells us more. Ndarray docs state "Indexed zip has overhead."

@nilgoyette
Copy link
Collaborator

A big thank you for this PR! I tried doing it several months ago and abandoned, so I'm quite happy to have nerd sniped you ;)

I ran the benchmarks and, as you wrote, they are noisy so it's kind of hard to spot the tiny changes. I ran them 5-6 times each and picked the best results. Having said this, it looks like the 'cc' cases are a little slower than they were, almost nothing, and probably in the noise range, but the 'ff' cases are now equal and this is a great news for us because we are stuck with loading and writing images in fortran order.

test zip_mut_with_cc     ... bench:         326 ns/iter (+/- 13)
test zip_mut_with_ff     ... bench:         341 ns/iter (+/- 29
test slice_split_zip_cc  ... bench:     416,585 ns/iter (+/- 22,373)
test slice_split_zip_ff  ... bench:     394,320 ns/iter (+/- 16,938)
test slice_zip_cc        ... bench:     404,885 ns/iter (+/- 19,939)
test slice_zip_ff        ... bench:     398,475 ns/iter (+/- 10,017)
test zip_cc              ... bench:     283,030 ns/iter (+/- 16,391)
test zip_ff              ... bench:     285,967 ns/iter (+/- 21,100)
test zip_indexed_cc      ... bench:   1,318,215 ns/iter (+/- 12,108)
test zip_indexed_ff      ... bench:   1,122,380 ns/iter (+/- 10,241)
test zip_mut_with_cc     ... bench:     276,915 ns/iter (+/- 15,970)
test zip_mut_with_ff     ... bench:     280,205 ns/iter (+/- 31,782)

Just a note, there are now 2 benchmarks named zip_mut_with_cc and zip_mut_with_ff, in iter.rs and zip.rs.

@bluss
Copy link
Member Author

bluss commented Apr 23, 2020

Thanks for running benchmarks! Reducing the overhead of Zip would be interesting.

I'll try to deduplicate the benchmarks and maybe remove some of the mixed benchmarks here, I don't want to run them anyway, even if they are useful for comparison and perspective.

Since you ran all benchmarks I'll share my tip of only compiling and running what you need, which we use to cope with building rust: cargo bench --bench zip builds just the zip.rs benches. Also works fine for picking out a particular test file, for quicker dev cycle on that.

My develop machine has changed, and the new one is very flaky at benchmarks. I can maybe see the point of criterion now; I used to have a setup that made for stable and reproducible benchmarks before.

Using split tests performance of the Zip in parallelization, so that we
can see if there are benefits to splitting arrays better.
Using the index shows more directly the overhead of indexed zip
Support both unroll over c- and f-layout preferred axis in Zip inner loop
(the fallback when inputs are not all contiguous and same layout).

Keep a tendency score when building the Zip, so that we know if the
inputs are tending to be c- or f- layout.

This improves performance on the just added zip_indexed_ff benchmark, so
that it seems to match its (already fast) cc counterpart.
@bluss
Copy link
Member Author

bluss commented Apr 23, 2020

Clippy is inexplicably emitting a lint that we are allowing, reproduces locally. It doesn't touch the code in the pr, so it's a separate issue.

@bluss bluss merged commit 78bd1d8 into master Apr 23, 2020
@bluss bluss deleted the zip-strided-for-c-and-f branch April 23, 2020 20:02
@bluss bluss mentioned this pull request Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Iteration performance for 'f' same-order arrays
2 participants