Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling fairness between spawn and par_iter #1054

Open
remifontan opened this issue Jun 12, 2023 · 21 comments
Open

Scheduling fairness between spawn and par_iter #1054

remifontan opened this issue Jun 12, 2023 · 21 comments

Comments

@remifontan
Copy link

Hi,

I have noticed a surprising behaviour. My progream is spawning multiple tasks, and each of them is calling into some par_iter functions.

I would expect a spawn task to complete as soon as its par_iter has completed. However all tasks do seem to get stuck until all par_iter loops are completed.

It is a little tricky to explain, but fortunately I could reproduce this behaviour on rust playground.

Permalink to the playground

10 tasks are spawn. Each of them will increment a counter in parallel, and finally switch their state to "done" once they are done.

enum TaskState {
    Progress(u32),
    Done,
}

The main thread polls those tasks and prints their progress until all tasks are done.

I would expect each task to switch to "done" as soon as its inner iteration has completed. However, depending on the number of threads, number of par_iter iterations and possibly the architecture, some tasks get stuck until are done.

On my MacBook M1 Max Ventura 13.4. all tasks are stuck until they are all done.

submit task 0
submit task 1
submit task 2
submit task 3
submit task 4
submit task 5
submit task 6
submit task 7
submit task 8
submit task 9
poll tasks
   6    0    0    0    0    0    0    0    0    0 
  16    0    0    0    0    0    0    0    0    0 
  28    0    0    0    0    0    0    0    0    0 
...
 100  100  100  100  100  100  100  100   66   99 
 100  100  100  100  100  100  100  100   79  100 
 100  100  100  100  100  100  100  100   84  100 
 100  100  100  100  100  100  100  100   99  100 
 100  100  100  100  100  100  100  100  100  100 
Done Done Done Done Done Done Done Done Done Done

On rust playground, the behaviour is a similar.

I noticed this behaviour in my GUI.
Each of those tasks is doing some heavy calculations for a widget. I was hoping each widget will be able to update its UI as soon as its computations are done. But unfortunately they all get stuck until all computations are done.

I tried to use spawn_fifo but that did not make a difference in my case. Are there other options to change this behaviour.

Regards.

@jean-pierreBoth
Copy link

Hello RemiFontan and Rayon developpers,

I was going to try to open a very similar problem.
My situation:

  • Many spawned threads from a pool doing work.

  • One thread (io) feeding work to a dispatcher to workers and using a par iter.

  • workers spawned by dispatcher when work arrive, worker send result to collector when work done

  • This last collector thread do a pariter insertion into a database of mine on worker result reception

I logged the whole and found the collector get stuck at random time trying to insert
and i do not get log after insertion. So it cannot enter into the pariter

Temporary work around:
when the collector receive a msg it do a serial insertion just now (instead of buffering and doing later a // iter)

@cuviper
Copy link
Member

cuviper commented Jun 12, 2023

Rayon can barely be said to have a scheduler at all -- it just greedily looks for any available work to keep busy as much as possible. There's some heuristic to this, like local stealing defaults to LIFO to work on stuff that's hopefully in cache, while cross-thread stealing is FIFO to take stuff that the "victim" perhaps hasn't touched in a while. And the global queue (injected from outside the pool) is considered last of all, with the idea that we should finish what the pool is working on before starting anything else.

spawn_fifo is the closest thing we have to a scheduling tweak, but that only affects the order among other spawn_fifo calls, and really only matters related to the local LIFO queue.

With your example of spawn calling par_iter, the inner part will split into many jobs with join. When one of the parts gets stolen to another thread, then the thread that called join will look for other work in the meantime. This is nested on the stack, so now its new work has to finish before it can get back to the old one that was blocked. This is generally considered fine and good for processing throughput, but it's definitely not optimized for latency.

The tokio crate cares more about latency, and its async/await nature is an important part of that because all state is captured -- a thread can juggle multiple tasks without having them literally nested on the stack as rayon does.

@ZhennanWu
Copy link

ZhennanWu commented Jun 12, 2023

Not a rayon developer here, but was working on a similar scenario. I would guess the reason could be

  1. the worker thread that spawned the first par_iter task (let's call it Thread A) finishes its slice and found some items were stolen by other threads who are still executing.
  2. Thread A stealed a whole unexecuted second par_iter task.
  3. Now the first par_iter task can never complete before the second par_iter task. Similar blockings can happen between other par_iter tasks and form a dependency chain.

I have some workarounds for some limited scenarios but not all:

  1. If the work item inside the par_iter does not return anything, a manual ref-counting inside the smallest work item can work. The ref-count can send a Completed signal once reaching zero.
  2. If there is a fixed priority between different tasks, we can execute by priority

Edit: The above workarounds requires an external producer threads in terms of architecture.

@jean-pierreBoth
Copy link

Thanks for the hint to lifo aspect. It helped me a lot.
I put the // at end of collector when i new that workers has finished, it runs nicely.
Thanks for your work. Rayon and Crossbeam are very helpful to build threaded pieces of code!

@ZhennanWu
Copy link

Hi @cuviper, on a second thought, for a GUI system where tasks are generated continuously (or basically any system that generates tasks without waiting for them), is it possible that they can form a growing chain of blocking that eventually overflow the stack? I feel like some diagnostic APIs maybe needed to help avert this problem.

@cuviper
Copy link
Member

cuviper commented Jun 14, 2023

is it possible that they can form a growing chain of blocking that eventually overflow the stack?

Yes, I think that once it's possible for this to happen between two events, it's also possible for that to happen again indefinitely. My intuition is that the repeated probability of that will grow vanishingly small, but I'm not sure.

I feel like some diagnostic APIs maybe needed to help avert this problem.

Any idea what that might look like?

@ZhennanWu
Copy link

I feel like some diagnostic APIs maybe needed to help avert this problem.

Since this problem is beyond the scope of rayon, I was thinking of a backpressure mechanism to notify the users to stop congesting the threadpool.

A crude idea is that, if the user is continuously generating tasks, then they should also continuously (though less frequently) check how congested the threadpool is. In this scenario, rayon may only need to provide one/more atomic variable to roughly indicate the stack depth of the most congested worker. This is a very crude idea and I haven't really meet with this kind of problem, we can go for the long shot.

@remifontan
Copy link
Author

Thanks for all your replies, this is very interesting.

Out of curiosity, I reproduced that example in C++ with TBB. My assumptions are that Rayon and TBB work similarly, at least that's what I was thinking.

it seems that TBB suffers less from that problem. nested parallel_for do tend to block parent tasks a little less.

Their documentation explains heuristics used.
https://oneapi-src.github.io/oneTBB/main/tbb_userguide/How_Task_Scheduler_Works.html

I find that snippet interesting:
"Then rule 3 applies. It steals the oldest task spawned by another thread, which causes temporary breadth-first execution that converts potential parallelism into actual parallelism."

Is rayon following the same logic?

Here's my ugly c++ equivalent:
https://github.com/remifontan/tstbb

that gave me, on my mac:

% make && ./main
[100%] Built target main
   0    0    0    0    0    0    0    0    0    0 
   2    1    6    0    0    0    0    0    0    0 
   4    3   13    0    0    0    0    0    0    0 
   7    4   24    0    0    0    0    0    0    0 
   8    4   29    0    0    0    0    0    0    0 
  12    5   35    0    0    0    0    0    0    0 
  13    6   39    0    0    0    0    0    0    0 
  16    7   45    0    0    0    0    0    0    0 
  21    9   51    0    0    0    0    0    0    0 
  24   10   56    0    0    0    0    0    0    0 
  25   11   58    0    0    0    0    0    0    0 
  29   12   64    0    0    0    0    0    0    0 
  32   12   69    0    0    0    0    0    0    0 
  35   14   74    1    0    0    0    0    0    0 
  38   15   78    2    0    0    0    0    0    0 
  40   16   80    3    0    0    0    0    0    0 
  43   16   84    4    0    0    0    0    0    0 
  48   18   89    6    0    0    0    0    0    0 
  51   19   92    8    0    0    0    0    0    0 
  54   20   94   10    0    0    0    0    0    0 
  54   20   97   12    0    0    0    0    0    0 
  60   22   99   14    1    0    0    0    0    0 
  62   22   99   15    1    0    0    0    0    0 
  66   23  100   17    2    0    0    0    0    0 
  73   25  100   19    4    0    0    0    0    0 
  73   27 done   21    4    0    0    0    0    0 
  77   31 done   27    8    0    0    0    0    0 
  79   33 done   29    9    0    0    0    0    0 
  81   34 done   32   11    0    0    0    0    0 
  82   36 done   36   14    0    0    0    0    0 
  85   39 done   39   16    0    0    0    0    0 
  86   39 done   42   17    0    0    0    0    0 
  89   43 done   45   19    0    0    0    0    0 
  89   44 done   45   20    0    0    0    0    0 
  91   47 done   48   22    0    0    0    0    0 
  93   48 done   51   24    0    0    0    0    0 
  96   51 done   54   27    0    0    0    0    0 
  98   53 done   59   30    0    0    0    0    0 
  98   55 done   61   32    2    1    0    0    0 
 100   57 done   63   34    3    3    0    0    0 
 100   59 done   65   36    4    4    1    0    0 
done   61 done   67   38    5    5    2    0    0 
done   63 done   68   39    6    6    3    0    0 
done   64 done   70   41    7    6    4    0    0 
done   67 done   73   44    8    8    4    0    0 
done   69 done   74   46    9    9    5    0    0 
done   71 done   77   48   10   10    7    0    0 
done   73 done   79   50   11   11    8    0    0 
done   75 done   80   52   12   12    9    0    0 
done   77 done   82   54   14   13   10    0    0 
done   78 done   82   55   14   13   11    0    0 
done   79 done   84   59   17   14   12    0    0 
done   79 done   85   61   19   15   12    0    0 
done   80 done   85   63   20   16   13    0    0 
done   81 done   86   66   22   17   14    0    0 
done   83 done   87   69   24   18   15    0    0 
done   84 done   88   74   27   19   17    0    0 
done   85 done   89   77   29   20   18    0    0 
done   85 done   90   79   30   21   18    0    0 
done   87 done   92   84   33   22   20    0    0 
done   88 done   93   86   35   23   21    0    0 
done   89 done   94   90   38   25   22    0    0 
done   90 done   95   92   40   26   23    0    0 
done   91 done   96   92   41   26   25    0    0 
done   92 done   97   94   44   28   28    0    0 
done   93 done   98   95   46   28   32    0    0 
done   94 done   99   96   48   30   34    0    0 
done   96 done  100   97   50   33   36    0    0 
done   97 done  100   98   52   35   38    0    0 
done   97 done  100   99   54   38   39    0    0 
done   99 done  100  100   56   42   42    0    0 
done  100 done  100 done   58   45   44    1    0 
done  100 done  100 done   60   49   46    2    0 
done  100 done  100 done   62   53   47    3    0 
done done done  100 done   64   57   49    4    0 
done done done  100 done   65   58   50    4    0 
done done done  100 done   68   62   52    5    0 
done done done  100 done   70   67   54    6    0 
done done done  100 done   71   69   56    7    1 
done done done  100 done   76   72   60    9    3 
done done done  100 done   78   73   62   10    4 
done done done  100 done   80   75   63   11    5 
done done done  100 done   85   78   66   12    6 
done done done  100 done   87   79   68   13    7 
done done done  100 done   89   80   69   13    8 
done done done  100 done   93   83   72   15    9 
done done done  100 done   97   86   74   16   10 
done done done  100 done   97   87   76   17   10 
done done done  100 done   99   91   77   18   16 
done done done  100 done  100   92   77   18   18 
done done done  100 done  100   96   79   20   23 
done done done  100 done  100   97   80   22   27 
done done done  100 done done   98   81   25   32 
done done done  100 done done   99   82   27   38 
done done done  100 done done  100   83   29   43 
done done done done done done done   84   31   49 
done done done done done done done   85   32   52 
done done done done done done done   86   35   60 
done done done done done done done   87   35   62 
done done done done done done done   88   38   69 
done done done done done done done   88   39   75 
done done done done done done done   89   41   81 
done done done done done done done   95   45   87 
done done done done done done done   98   47   89 
done done done done done done done  100   49   96 
done done done done done done done  100   54   99 
done done done done done done done done   63  100 
done done done done done done done done   72  100 
done done done done done done done done   81 done 
done done done done done done done done   90 done 
done done done done done done done done   99 done 
done done done done done done done done  100 done 
done done done done done done done done done done

@ZhennanWu
Copy link

I find that snippet interesting: "Then rule 3 applies. It steals the oldest task spawned by another thread, which causes temporary breadth-first execution that converts potential parallelism into actual parallelism."

Is rayon following the same logic?

I believe so. For more detail you can refer to https://github.com/rayon-rs/rfcs/blob/master/accepted/rfc0001-scope-scheduling.md. From what I have read, TBB is using exactly the same strategy as default LIFO rayon, where local thread always tries to execute its newest task, but a thief would always steal the oldest task.

@remifontan
Copy link
Author

remifontan commented Jun 18, 2023

thanks for sharing the rayon doc. I can see the parallel between both documentation.

So, if I understand correctly, with rayon, I should expect a thread to steal the oldest task (front of the deque) as soon as its local task list is empty.

My little script (see the playground above) does not seem to show sign of work stealing.

I tried to initialising the thread pool into using more threads for force stealing. My machine has 10 cores. I tried with 20 threads, 40 threads. The behaviour did not change much. Most of the spawn tasks only completed once all nested par_iter completed.

However in the provided C++/TBB example, As soon as a parallel_for is done, its parent task resumes and has a chance to complete. Does that mean in my example, rayon does not do any work-stealing, while TBB does?

Is there a way to visualise what rayon does? like tracing of some sort?

edit:
I have uploaded a run of the TBB and rayons examples. With 60 spawn tasks doing 10 parallel iterations.

I believe we can see in the rayon run, two successful work stealing happening, but all other tasks are stuck. While in the TBB example every tasks gets process fairly soon after their inner parallel_for complete. Apologies for repeating the same thing again, but I wish I would understand what is happening. :-)

@ZhennanWu
Copy link

ZhennanWu commented Jun 18, 2023

First I want to point out that there is work stealing happening in your rayon results . It just seems that in that scenario, rayon steals "sub"-items produced by the par_iter much more frequently. And by "stealing", it does not imply that a thief can complete a task on behalf of the original adopter thread. On the contrary, if the orignial adopter is busy stealing from others, then the completion of that task may be indefinitely postponed as previous comments has pointed out.

I have no experience with TBB but I would point to some potential causes for the differences

  1. TBB did not specify whether there exists a global task queue, and how the priority is pulling from global task queue vs. stealing from other threads. Rayon choose to prioritize stealing over pulling from global queue. As long as there is any chance to steal, the thief would not pull from the global queue. From the logs you shared, it seems that TBB probably did not strictly do this.
    fn find_work(&self) -> Option<JobRef> {
    // Try to find some work to do. We give preference first
    // to things in our local deque, then in other workers
    // deques, and finally to injected jobs from the
    // outside. The idea is to finish what we started before
    // we take on something new.
    self.take_local_job()
    .or_else(|| self.steal())
    .or_else(|| self.registry.pop_injected_job(self.index))
    }
  2. I would advise on checking how TBB implement its par_for. I have no knowledge of it. Rayon's parallel iterator is first recursively bisected in a depth-first manner (while other threads are aggressively stealing bisected chunks left behind). Rayon will only execute a chunk if it reaches the bisection limit for the current thread.
  3. It may purely be a racing problem due to timings, which arise from slightly differnt code paths in both implementations.

@remifontan
Copy link
Author

Note, I'm not an expert at all in TBB and obviously I don't know much about the internals of rayon.

Looking at TBB's source code, this is my very limited understanding.

parallel_for seems to be creating and registering some sort of "wait_node" object that creates a parent-children relationship between the original stack and the task : for_task.my_parent = &wn;

run(...) @ include/oneapi/tbb/parallel_for.h

static void run(const Range& range, const Body& body, Partitioner& partitioner, task_group_context& context) {
    if ( !range.empty() ) {
            small_object_allocator alloc{};
            start_for& for_task = *alloc.new_object<start_for>(range, body, partitioner, alloc);

            // defer creation of the wait node until task allocation succeeds
            wait_node wn;
            for_task.my_parent = &wn;
            execute_and_wait(for_task, context, wn.m_wait, context);
        }
    }

the start_for eventually gets executed and finalized.
the finalized function is interesting with its function call: "fold_tree..." on the parent object : fold_tree<tree_node>(parent, ed);

finalize(...) @ include/oneapi/tbb/parallel_for.h

template<typename Range, typename Body, typename Partitioner>
void start_for<Range, Body, Partitioner>::finalize(const execution_data& ed) {
    // Get the current parent and allocator an object destruction
    node* parent = my_parent;
    auto allocator = my_allocator;
    // Task execution finished - destroy it
    this->~start_for();
    // Unwind the tree decrementing the parent`s reference count

    fold_tree<tree_node>(parent, ed);
    allocator.deallocate(this, ed);

}

now... I am clearly not understanding how it actually works under the hood. but I'm wondering whether this parent waiter object is the reason the parent tasks are resumed as soon as their parallel_for tasks are done... this is all speculation at this point.

@cuviper
Copy link
Member

cuviper commented Jun 18, 2023

Interesting -- I think that is a plausible trail. Rayon runs parallel iterators with recursive join calls, but that means each branch of that call tree will wait for all of its sub-branches to complete before returning. That's a lot of places where it could be pseudo-preempted into work stealing when blocked. If instead we modeled this as a scope that uses spawn for each subtask, then only the scope itself will block on completion, even if the algorithm was still splitting recursively in halves like the join model would.

So in essence, one waiter has less chance of getting "distracted" than a bunch of recursive waiters. Or that one may come back around to notice completion more timely than a bunch would.

We couldn't do that in general, because anything that has a meaningful reduction of results will need to apply that at each step. But cases where we use the internal NoopReducer (especially for_each) could transform to this detached scope+spawn model. (How to accomplish that distinction is potentially more difficult.)

@cuviper
Copy link
Member

cuviper commented Jun 19, 2023

Here's a proof of concept for the scope+spawn idea:
master...cuviper:rayon:scoped-reducer

This passes all tests, and reaches "Done" in your example much sooner. I haven't checked any benchmarks yet to see what other effect it has, but at the very least it does add allocation for the spawns.

@remifontan
Copy link
Author

remifontan commented Jun 20, 2023

this is great, it behaves exactly like I was hoping. I very much appreciate you took the time to make a proof of concept.

Also, I pulled your branch and tried it on my GUI, works great from the point of view of user experience. My widgets do get updated as soon as their spawn tasks are done, no more tasks getting stuck until all tasks are completed.

@cuviper
Copy link
Member

cuviper commented Jun 20, 2023

Unfortunately, it does appear to have high overhead. From the rayon-demo crate in this repo, the join_microbench benchmarks are written via parallel for_each -- on master I get:

test join_microbench::increment_all                                     ... bench:      32,080 ns/iter (+/- 2,053)
test join_microbench::increment_all_atomized                            ... bench:     380,153 ns/iter (+/- 48,515)
test join_microbench::increment_all_max                                 ... bench:      42,498 ns/iter (+/- 2,927)
test join_microbench::increment_all_min                                 ... bench:      11,196 ns/iter (+/- 1,167)
test join_microbench::increment_all_serialized                          ... bench:      11,836 ns/iter (+/- 695)

But with that scoped_reducer branch:

test join_microbench::increment_all                                     ... bench:     465,641 ns/iter (+/- 37,942)
test join_microbench::increment_all_atomized                            ... bench:   2,997,431 ns/iter (+/- 19,496)
test join_microbench::increment_all_max                                 ... bench:     539,590 ns/iter (+/- 40,075)
test join_microbench::increment_all_min                                 ... bench:       8,138 ns/iter (+/- 844)
test join_microbench::increment_all_serialized                          ... bench:      11,759 ns/iter (+/- 91)

There's one improvement (!), but three much worse regressions. This is even after I tweaked #1057 and #1058 trying to address some of the shortcoming. And while micro-benchmarks are always suspect, I also found that one of my "real world" Project Euler solutions using parallel for_each went from 6.8ms to 14.6ms.

I'm trying to think of ways to make this opt-in -- perhaps a for_each_spawn or something. But there's not a great way to integrate this either -- I don't really like that Reducer::HAS_EFFECT (or maybe WANTS_SPAWN?) in the API. Ideally it would be an internal detail, but I think that needs specialization to be able to do that at the part of code where the bridging join takes place.

Alternatively, you could try to write your GUI code directly using scope+spawn instead of parallel iterators.

@adamreichold
Copy link
Collaborator

adamreichold commented Jun 21, 2023

There's one improvement (!), but three much worse regressions.

Ouch. Is the overhead in allocating the task objects for spawn? This would explain why TBB appears to be playing pool allocation tricks for those task management objects via its small_object_allocator?

@cuviper
Copy link
Member

cuviper commented Jun 21, 2023

Looking at perf for join_microbench::increment_all (which uses default splitting heuristics), the vast majority of that overhead is just the update of the scope counter of outstanding spawns. It's 40% of the profile on lock incq and 30% on lock decq, plus 18% on a mov where ScopeLatch::set has a match on its enum variant.

I think the main lesson is that join is very good at avoiding any shared state. Maybe we can make the scope use a distributed counter? The shared scope state doesn't really care about the raw number per se -- it only wants to know when that reaches 0, all spawns complete.

@cuviper
Copy link
Member

cuviper commented Jun 21, 2023

plus 18% on a mov where ScopeLatch::set has a match on its enum variant.

I've gotten rid of this in #1059, so using scope+spawn on that now looks like:

test join_microbench::increment_all                                     ... bench:     407,860 ns/iter (+/- 28,345)
test join_microbench::increment_all_atomized                            ... bench:   2,671,980 ns/iter (+/- 24,333)
test join_microbench::increment_all_max                                 ... bench:     462,618 ns/iter (+/- 22,606)
test join_microbench::increment_all_min                                 ... bench:       7,960 ns/iter (+/- 815)
test join_microbench::increment_all_serialized                          ... bench:      11,486 ns/iter (+/- 61)

The shared counter still dominates perf -- now 45% lock incq and 40% lock decq.

@remifontan
Copy link
Author

Out of curiosity, by "distributed counter", did you mean something along this?
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0261r4.html

@cuviper
Copy link
Member

cuviper commented Jun 22, 2023

Something like that, yeah, but I'm not sure if those designs apply because they're de-prioritizing the reader, and we still need to know when it reaches zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants