Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rayon-core panics in 32-bit windows #827

Closed
emilio opened this issue Feb 25, 2021 · 10 comments
Closed

rayon-core panics in 32-bit windows #827

emilio opened this issue Feb 25, 2021 · 10 comments

Comments

@emilio
Copy link
Contributor

emilio commented Feb 25, 2021

Hi,

We updated rayon to 1.5 in https://bugzilla.mozilla.org/show_bug.cgi?id=1683294, and we're seeing more-frequent-than-usual crashes, particularly in 32-bit windows, deep in the guts of rayon, with the unwrap() here failing:

https://hg.mozilla.org/releases/mozilla-release/annotate/89345511871ef6489580b994be21189e84462393/third_party/rust/rayon-core/src/job.rs#l166

You can see one of the crash reports here or click "More reports" for the others.

Of course this might be other things other than a bug in rayon (think, some other memory corruption going on), but it seems it correlates with the rayon update, so I thought it could be useful to get some eyes from other people more knowledgeable in the rayon internals than me to see if these crashes ring a bell to someone.

@cuviper
Copy link
Member

cuviper commented Feb 25, 2021

It appears you updated from rayon 1.2 / rayon-core 1.6. The biggest difference in unsafe code would be the new scheduler introduced in rayon-core 1.8, although that didn't touch job.rs at all.

The next thing that maybe deserves scrutiny is crossbeam-deque, in case there's some race in passing jobs around.

@emilio
Copy link
Contributor Author

emilio commented Mar 11, 2021

@cuviper is there something in rayon 32-bit specific somehow? The only target-pointer-dependent thing I see is the SeqLock implementation in crossbeam, which doesn't seem particularly used by rayon nor crossbeam-deque. But I may be missing something...

@cuviper
Copy link
Member

cuviper commented Mar 12, 2021

Anything usize-related would also be more constrained on 32-bit -- for example there's the use of AtomicUsize in rayon-core/src/sleep/counters.rs where it represents multiple things at once. The "JEC" part has much less headroom on 32-bit, but it's also supposed to be fine for that to wrap.

@jrmuizel
Copy link

We tried switching to rayon 1.4 to narrow down what change may have introduced the problem and we get panics with overflow here:

[task 2021-03-11T23:21:57.481Z] 23:21:57     INFO - GECKO(3156) | Hit MOZ_CRASH(attempt to add with overflow) at /builds/worker/checkouts/gecko/third_party/rust/rayon-core/src/sleep/counters.rs:226

@cuviper
Copy link
Member

cuviper commented Mar 12, 2021

Ah, that was probably #797, fixed by #800, and published in rayon-core 1.8.1.

@emilio
Copy link
Contributor Author

emilio commented Mar 25, 2021

Ok, so we have a bit more data. We still see these crashes with rayon-core 1.8.1 / 1.4 (so far lower volume, but beta users are way less so...).

@cuviper
Copy link
Member

cuviper commented Mar 25, 2021

Is it possible to reproduce this in a more contained way? Perhaps in a direct stress-test of the crate using rayon?

@cuviper
Copy link
Member

cuviper commented Aug 5, 2021

On the possibility of this being in crossbeam-deque, there's a CVE for a data race: GHSA-pqqp-xmhj-wgcw. The symptom, "one or more tasks in the worker queue can be popped twice", does fit with the way rayon is failing, having a job's Option::take() return None as if it were already taken.

However, that CVE is not 32-bit specific, and your previous rayon-core 1.6 was using crossbeam-deque 0.7 which is also affected by that bug. So it's not a perfect match, but (wild guess) it's possible that the other changes in rayon-core were previously adding enough memory synchronization to avoid it.

@cuviper
Copy link
Member

cuviper commented May 11, 2022

@emilio Is this still a problem? That crash report link is not available anymore.

I just opened #934 to fix a use-after-free, which could account for any sort of memory corruption, but it is especially close to the job execute where you reported an unwrap failure. That particular race would only occur when the ThreadPool is exiting, but perhaps those crash reports were while closing the browser or something.

@emilio
Copy link
Contributor Author

emilio commented May 12, 2022

We're not seeing this anymore. Looking at the crash rates, this went away when we updated crossbeam-deque for GHSA-pqqp-xmhj-wgcw

@emilio emilio closed this as completed May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants