Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible flakiness in miri CI job. #82

Closed
Anders429 opened this issue Aug 10, 2022 · 8 comments
Closed

Possible flakiness in miri CI job. #82

Anders429 opened this issue Aug 10, 2022 · 8 comments
Labels
A - Scheduling Area: Parallel scheduling of systems. C - Continuous Integration Category: Ensuring the stability of the library. C - Unsound Category: Undefined behavior. F - rayon Feature: Parallel processing through the rayon library. P - High Priority: Urgent, needing immediate attention. S - Needs Investigation Status: Further investigation is needed.

Comments

@Anders429
Copy link
Owner

Happened on #81. miri failed, but succeeded on a rerun. It's worth investigating, as it means there could be some edge case undefined behavior being hit that miri can't catch all the time. From the logs, it seems to be originating in the Scheduler logic.

@Anders429 Anders429 added the C - Bug Category: Something is not functioning as expected. label Aug 10, 2022
@Anders429
Copy link
Owner Author

Another occurrence of this here. Seems to be the same problem: deallocation of an item while it is protected.

It's occurring at the load split in the schedule stage logic. When the load splits, we send a pointer to the world on both threads. The pointer is then turned into a &mut reference, which I thought should be safe because the data accessed by the views doesn't overlap, but I suspect the &mut reference to the World is causing the problem.

I need to revisit the code here to figure out if there's a better solution.

@Anders429
Copy link
Owner Author

Had another occurrence of this in a push on #88 here. But I also found that this isn't just isolated to this library. This rayon issue looks to be about the exact same error message.

@Anders429 Anders429 added S - Needs Investigation Status: Further investigation is needed. C - Continuous Integration Category: Ensuring the stability of the library. C - Unsound Category: Undefined behavior. P - High Priority: Urgent, needing immediate attention. F - rayon Feature: Parallel processing through the rayon library. A - Scheduling Area: Parallel scheduling of systems. and removed C - Bug Category: Something is not functioning as expected. labels Sep 15, 2022
@Anders429
Copy link
Owner Author

Oddly, I can't seem to reproduce this at all locally. It just seems to occur randomly on CI. I'd hate to have to disable miri for the rayon feature, but I'm not finding any other solutions at the moment.

@Anders429
Copy link
Owner Author

Not sure why, but I haven't seen this occur since #147 was merged. In the work on #147 and a few PRs earlier, this was occurring on every run. Now it occurs on none of them. Perhaps something changed on the nightly channel to fix this issue?

@Anders429
Copy link
Owner Author

Will keep a watch out for this occurring again. If I don't see any further issues in a few weeks, I'll close this.

@Anders429
Copy link
Owner Author

Upon further investigation, it seems the last failure was on the nightly channel from 2022-11-25. The next run, which started the current success streak, was on the nightly channel from 2022-11-26.

I'm not positive, but it seems like this might be related to this change in Miri. The issue that was being hit deals specifically with protection of stacked borrows, and that PR changes how protection of stacked borrows works, adding "weak protection". I can't pretend to know the details, but it seems like it may have resolved the issue. Looking at the changed files, I see that the error: Undefined Behavior: deallocating while item [SharedReadOnly for <8872699>] is protected by call 2475770 error which was being emitted by Miri has actually been changed (see here, it now accounts for strongly or weakly borrowed in the error message), which at least indicates to me that the change likely affected the case here.

I'm not sure how releases of Miri on nightly line up, but this seems to match? The issue stopped occurring 9 days ago, and that PR was merged 14 days ago. I assume there is some turn-around in the change being propagated to the rustup component, but I'm not sure how that process actually happens. At the very least, this makes me feel more confident that the issue is resolved. I will wait until the release of 0.5.0 (which will probably be in about a month, as that's the current cadence), and if no further issues arise, I'll close this.

@Anders429
Copy link
Owner Author

Anders429 commented Dec 21, 2022

Seems like I was wrong. The miri failure resurfaced again: https://github.com/Anders429/brood/actions/runs/3751952249/jobs/6373550656

For reference, this is the rustc version used in that run:

Run rustc +nightly --version --verbose
  rustc +nightly --version --verbose
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    CARGO_TERM_COLOR: always
    CARGO_INCREMENTAL: 0
rustc 1.68.0-nightly (1d12c3cea 2022-12-20)
binary: rustc
commit-hash: 1d12c3cea30b8ba4a0[96](https://github.com/Anders429/brood/actions/runs/3751947827/jobs/6373541877#step:3:105)50a9e9c46fe9fbe25f0b
commit-date: 2022-12-20
host: x86_64-unknown-linux-gnu
release: 1.68.0-nightly
LLVM version: 15.0.6

@Anders429
Copy link
Owner Author

Looks like this was fixed on rayon's side in rayon-rs/rayon#1011. As long as rayon-core version 1.10.2 or later is used, miri should no longer complain.

This means that miri runs should no longer be flakey, making CI failures much more trustworthy. I'll mark this as resolved, since I don't see any reason why this should still be a problem, unless there is something more I am unaware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A - Scheduling Area: Parallel scheduling of systems. C - Continuous Integration Category: Ensuring the stability of the library. C - Unsound Category: Undefined behavior. F - rayon Feature: Parallel processing through the rayon library. P - High Priority: Urgent, needing immediate attention. S - Needs Investigation Status: Further investigation is needed.
Projects
None yet
Development

No branches or pull requests

1 participant