yield_local can cause stack overflow #1064

Pr0methean · 2023-06-25T21:03:05Z

When a job calls yield_local, Rayon loads another job onto the stack on top of it. If lots of jobs are calling it, this can cause a stack overflow.

To fix this, Rayon should give each thread a flag that tracks whether yield_local is already on the stack. If it is and is called again, yield_local should return immediately. (There may need to be a third value in the Yield enum for this case.) Alternately, it could check available stack space.

Example stack trace:
stack_trace.txt

cuviper · 2023-07-03T22:14:13Z

I don't agree with your fix that yielding should never recurse.

It's in the nature of work-stealing that rayon jobs may be nested on each other, and yield_local is just volunteering the same kind of preemption that join or any other rayon call might incur when they get blocked. Yielding twice is not really any worse than the first yield, but your stack trace shows that you nested 128 times before you overflowed the stack.

I think oxipng is being too aggressive here:
https://github.com/shssoichiro/oxipng/blob/129f1e6f76de1cacd55a4d95f066c94dc0a44dd4/src/evaluate.rs#L91

    /// Wait for all evaluations to finish and return smallest reduction
    /// Or `None` if the queue is empty.
    #[cfg(feature = "parallel")]
    pub fn get_best_candidate(self) -> Option<Candidate> {
        let (eval_send, eval_recv) = self.eval_channel;
        // Disconnect the sender, breaking the loop in the thread
        drop(eval_send);
        // Yield to ensure evaluations are finished - this can prevent deadlocks when run within an existing thread pool
        while let Some(rayon::Yield::Executed) = rayon::yield_local() {}
        eval_recv.into_iter().min_by_key(Candidate::cmp_key)
    }

That's draining all local work before returning, when they only need to wait for their own evaluations. It might be cleaner if they looped on the channel try_recv until TryRecvError::Disconnected, so they don't run into your ohcd tasks sitting earlier in the local queue.

cuviper · 2023-07-03T22:17:49Z

I looks like they're already fixing something like that in shssoichiro/oxipng#527

Pr0methean · 2023-07-04T03:47:42Z

Maybe yield_local can be used safely. But the risk of a stack overflow with overuse needs to be documented, including the how to use it in a library crate without breaking consumers who call the library from other Rayon tasks.

Pr0methean added a commit to Pr0methean/rayon that referenced this issue Jun 25, 2023

Fix rayon-rs#1064

63c2780

This was referenced Jun 25, 2023

Parallel mode hangs when invoked from Rayon global thread pool shssoichiro/oxipng#517

Closed

Fix #1064 #1065

Closed

Pr0methean added a commit to Pr0methean/OcHd-RustBuild that referenced this issue Jun 25, 2023

Work around rayon-rs/rayon#1064

1b82d85

cuviper mentioned this issue Jul 3, 2023

Stop yielding once all evaluations are started shssoichiro/oxipng#527

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yield_local can cause stack overflow #1064

yield_local can cause stack overflow #1064

Pr0methean commented Jun 25, 2023 •

edited

cuviper commented Jul 3, 2023

cuviper commented Jul 3, 2023

Pr0methean commented Jul 4, 2023 •

edited

yield_local can cause stack overflow #1064

yield_local can cause stack overflow #1064

Comments

Pr0methean commented Jun 25, 2023 • edited

cuviper commented Jul 3, 2023

cuviper commented Jul 3, 2023

Pr0methean commented Jul 4, 2023 • edited

Pr0methean commented Jun 25, 2023 •

edited

Pr0methean commented Jul 4, 2023 •

edited