Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yield_local can cause stack overflow #1064

Open
Pr0methean opened this issue Jun 25, 2023 · 3 comments
Open

yield_local can cause stack overflow #1064

Pr0methean opened this issue Jun 25, 2023 · 3 comments

Comments

@Pr0methean
Copy link

Pr0methean commented Jun 25, 2023

When a job calls yield_local, Rayon loads another job onto the stack on top of it. If lots of jobs are calling it, this can cause a stack overflow.

To fix this, Rayon should give each thread a flag that tracks whether yield_local is already on the stack. If it is and is called again, yield_local should return immediately. (There may need to be a third value in the Yield enum for this case.) Alternately, it could check available stack space.

Example stack trace:
stack_trace.txt

Pr0methean added a commit to Pr0methean/rayon that referenced this issue Jun 25, 2023
Pr0methean added a commit to Pr0methean/OcHd-RustBuild that referenced this issue Jun 25, 2023
@cuviper
Copy link
Member

cuviper commented Jul 3, 2023

I don't agree with your fix that yielding should never recurse.

It's in the nature of work-stealing that rayon jobs may be nested on each other, and yield_local is just volunteering the same kind of preemption that join or any other rayon call might incur when they get blocked. Yielding twice is not really any worse than the first yield, but your stack trace shows that you nested 128 times before you overflowed the stack.

I think oxipng is being too aggressive here:
https://github.com/shssoichiro/oxipng/blob/129f1e6f76de1cacd55a4d95f066c94dc0a44dd4/src/evaluate.rs#L91

    /// Wait for all evaluations to finish and return smallest reduction
    /// Or `None` if the queue is empty.
    #[cfg(feature = "parallel")]
    pub fn get_best_candidate(self) -> Option<Candidate> {
        let (eval_send, eval_recv) = self.eval_channel;
        // Disconnect the sender, breaking the loop in the thread
        drop(eval_send);
        // Yield to ensure evaluations are finished - this can prevent deadlocks when run within an existing thread pool
        while let Some(rayon::Yield::Executed) = rayon::yield_local() {}
        eval_recv.into_iter().min_by_key(Candidate::cmp_key)
    }

That's draining all local work before returning, when they only need to wait for their own evaluations. It might be cleaner if they looped on the channel try_recv until TryRecvError::Disconnected, so they don't run into your ohcd tasks sitting earlier in the local queue.

@cuviper
Copy link
Member

cuviper commented Jul 3, 2023

I looks like they're already fixing something like that in shssoichiro/oxipng#527

@Pr0methean
Copy link
Author

Pr0methean commented Jul 4, 2023

Maybe yield_local can be used safely. But the risk of a stack overflow with overuse needs to be documented, including the how to use it in a library crate without breaking consumers who call the library from other Rayon tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants