New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fold
creates identity for each sublist
#1124
Comments
There are two main factors that I see here:
|
That is exactly what I want here. In the library I linked above this is done with a simple atomic counter that worker threads are using to pull the next piece of work to process. In that case inputs are read-only, so no mutexes or other heavier infrastructure is necessary and work is essentially split in sublists of size 1, such that threads are kept occupied in the most efficient way. I will check with #857 to see how it performs.
Was not aware of it, will give it a try as well, thanks!
Hm, is that documented somewhere? To me the fact that things can be processed in form of sublists of undefined size means that both should be the case and user can't really have any guarantees about order of things in which input is processed. Especially if better performance can be unlocked. |
Interesting, just did testing of the above example code I linked
I'd imagine a simple atomic counter like blst does for efficient split of work with single item granularity, while also creating at as many results to be folded as there is threads in thread pool. #857 is based on very old version of rayon, can you rebase it maybe (it is a very small change, so should be fine to do so). |
so, i did a log of the algorithm (took the code from the benches) and there is virtually no overhead i can see for splitting and merging. however i then ran the code on a larger machine with more cores. still no visible overheads for computations but there are some overheads inside rayon. my guess is that due to the very low number of tasks available (compared to threads) the atomics which are used for the sleep algorithm get under pressure. |
I rewrote code with I don't think atomics or mutexes make any difference here, the bottleneck in my case is math that is slow in case more elements need to be aggregated unnecessarily. |
i'm still not too sure about that. here are the logs.
|
Visualization looks cool, but I'm not sure how to read it to be completely honest. My point was that doing 128 folds is unnecessary. On my machine with logical CPU cores there is no point in doing 128 folds if 32 is sufficient. Whether it is slower due to memory pressure or something else is kind of irrelevant if it is possible to make it not do 128 folds in the first place. And it seems to me that if order doesn't matter at all (which is the case very often) the behavior (and performance) should be similar to |
There are a few places in the docs that talk about associativity and parallel non-determinism, but we do generally keep the relative order of items. e.g. a list of 4 items could be fold/reduced like any of these ordered partitions:
We don't promise anything about the order in which any inner part is called on their own -- e.g. the individual items could be produced in any order, and the reduction (a, b) could happen before or after (c, d) in the second one -- but we do keep them in order on the way out.
The devil is in trying to preserve that semblance of order on the way out, which Even with ordered
If you sent min 1, I don't expect any behavioral change at all! The idea with that is to set a level that amortizes the cost of creating your accumulator, but as you say that's a balancing act.
Sure, I pushed a rebase. But when we're only talking about 128 items for 32 threads, I fear it will still overestimate even the initial level splits, regardless of this change about re-splitting stolen work.
IMO, it's totally fine to reach for different primitives when the generic ("idiomatic") APIs don't meet your needs! I'm not sure why
The generic API doesn't know whether the order of reduction matters in your code, but we could explore more explicit APIs for you to communicate that, like an |
I'm sorry, I did try various (higher) values, posted 1 by accident.
I think there might be an optimization opportunity here due to usage of zero-size types. I am returning
The approach with |
It propagates panics through the same mechanism as the return value, |
I just noticed that
fold
callsidentity
many more times than number of threads in thread pool. In caseidentity
is a non-trivial operation, it might be quite expensive to call it unnecessarily.Would be great if its returned value was repurposed for folding of multiple sublists.
This discovery resulted from rewriting https://github.com/supranational/blst/blob/0d46eefa45fc1e57aceb42bba0e84eab3a7a9725/bindings/rust/src/lib.rs#L1110-L1147 with rayon (code there essentially does work stealing manually) and seeing significantly slower benchmark results.
Here is what I tried to do: https://github.com/nazar-pc/blst/blob/7d1074e7df3f31dc8b8acfde3260ac9ceac8f0d9/bindings/rust/src/lib.rs#L1092-L1128
The reason why I believe it is slower is that more
identity
elements are created and due to cryptography involved reducing more elements takes more time.Probably duplicate of #840 that original author decided to close, but I believe there is a value in ability to do something like this in rayon (or maybe there is and I just don't see it).
The text was updated successfully, but these errors were encountered: