Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lto = "thin" causes SIGSEGVs when ran under rustdoc #911

Closed
trofi opened this issue Jan 10, 2022 · 7 comments
Closed

lto = "thin" causes SIGSEGVs when ran under rustdoc #911

trofi opened this issue Jan 10, 2022 · 7 comments

Comments

@trofi
Copy link

trofi commented Jan 10, 2022

In xiph/rav1e#2851 NixOS found out the crash of rayon-1.5.1 on rav1e doctests. Here is extracted reproducer:

# cat Cargo.toml
[package]
name = "bug"
version = "0.1.0"
edition = "2018"

[dependencies]
rayon = "~1.5.1"

[profile.dev]
#debug = true
#incremental = true
#debug-assertions = true
lto = "thin"
opt-level = 2
overflow-checks = false
codegen-units = 256
// cat src/lib.rs
use rayon::iter::IntoParallelIterator;
use rayon::iter::ParallelIterator;

/// # Examples
///
/// ```
/// use bug::do_bug;
///
/// # fn main() {
/// bug::do_bug()
/// # }
/// ```
pub fn do_bug() {
  (0..1).into_par_iter().for_each(|_| {});
  (0..1).into_par_iter().for_each(|_| {});
  (0..1).into_par_iter().for_each(|_| {});
}

Reproducer run:

$ cargo test --doc
    Updating crates.io index
   Compiling autocfg v1.0.1
   Compiling crossbeam-utils v0.8.6
   Compiling lazy_static v1.4.0
   Compiling cfg-if v1.0.0
   Compiling crossbeam-epoch v0.9.6
   Compiling libc v0.2.112
   Compiling scopeguard v1.1.0
   Compiling rayon-core v1.9.1
   Compiling either v1.6.1
   Compiling memoffset v0.6.5
   Compiling rayon v1.5.1
   Compiling crossbeam-channel v0.5.2
   Compiling crossbeam-deque v0.8.1
   Compiling num_cpus v1.13.1
   Compiling bug v0.1.0 (/tmp/xx)
    Finished test [optimized + debuginfo] target(s) in 5.26s
   Doc-tests bug

running 1 test
test src/lib.rs - do_bug (line 6) ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.10s

$ cargo test --doc
    Finished test [optimized + debuginfo] target(s) in 0.01s
   Doc-tests bug

running 1 test
test src/lib.rs - do_bug (line 6) ... FAILED

failures:

---- src/lib.rs - do_bug (line 6) stdout ----
Test executable failed (terminated by signal).


failures:
    src/lib.rs - do_bug (line 6)

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.20s

error: test failed, to rerun pass '--doc'

Notes:

  • Sometimes it needs a few runs to crash
  • Cargo.toml enabled lto = "thin"; opt-level = 2 but be careful: rustdoc does not seem to pass opt-level = to test itself, only an lto part. That makes it a bit hard to reason about optimisation mix.

Attempt at fetching backtrace:

Thread 2 "rust_out" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffb9a360640 (LWP 662570)]
0x00007ffb9a35f8e8 in ?? ()
(gdb) bt
#0  0x00007ffb9a35f8e8 in ?? ()
#1  0x00007ffb9a3605f8 in ?? ()
#2  0x00007ffb9a35fa30 in ?? ()
#3  0x0000556a1c6df25c in rayon_core::registry::WorkerThread::set_current (thread=0x7ffb9a35f500) at /home/slyfox/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.1/src/registry.rs:636
#4  rayon_core::registry::main_loop (registry=..., index=0, worker=...) at /home/slyfox/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.1/src/registry.rs:807
#5  rayon_core::registry::ThreadBuilder::run (self=...) at /home/slyfox/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.1/src/registry.rs:55
#6  0x0000556a1c70206d in rayon_core::registry::{impl#2}::spawn::{closure#0} () at /home/slyfox/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.1/src/registry.rs:100
#7  std::sys_common::backtrace::__rust_begin_short_backtrace<rayon_core::registry::{impl#2}::spawn::{closure#0}, ()> (f=...) at /build/rustc-1.57.0-src/library/std/src/sys_common/backtrace.rs:123
#8  0x0000556a1c6e493c in std::thread::{impl#0}::spawn_unchecked::{closure#1}::{closure#0}<rayon_core::registry::{impl#2}::spawn::{closure#0}, ()> () at /build/rustc-1.57.0-src/library/std/src/thread/mod.rs:483
#9  core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked::{closure#1}::{closure#0}> (self=..., _args=<optimized out>)
    at /build/rustc-1.57.0-src/library/core/src/panic/unwind_safe.rs:271
#10 0x0000556a1c6e34cf in std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked::{closure#1}::{closure#0}>, ()> (data=<optimized out>)
    at /build/rustc-1.57.0-src/library/std/src/panicking.rs:403
#11 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked::{closure#1}::{closure#0}>> (f=...) at /build/rustc-1.57.0-src/library/std/src/panicking.rs:367
#12 0x0000556a1c6fbb30 in std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked::{closure#1}::{closure#0}>, ()> (f=...)
    at /build/rustc-1.57.0-src/library/std/src/panic.rs:133
#13 0x0000556a1c6da205 in std::thread::{impl#0}::spawn_unchecked::{closure#1}<rayon_core::registry::{impl#2}::spawn::{closure#0}, ()> () at /build/rustc-1.57.0-src/library/std/src/thread/mod.rs:482
#14 core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked::{closure#1}, ()> () at /build/rustc-1.57.0-src/library/core/src/ops/function.rs:227
#15 0x0000556a1c761515 in std::sys::unix::thread::Thread::new::thread_start ()
#16 0x00007ffb9a67cd40 in start_thread () from /nix/store/s9qbqh7gzacs7h68b2jfmn9l6q4jwfjz-glibc-2.33-59/lib/libpthread.so.0
#17 0x00007ffb9a46343f in clone () from /nix/store/s9qbqh7gzacs7h68b2jfmn9l6q4jwfjz-glibc-2.33-59/lib/libc.so.6

Echoing comment xiph/rav1e#2851 (comment) here:

"""
Specifically rayon-core-1.9.1/src/registry.rs:

    /// Sets `self` as the worker thread index for the current thread.
    /// This is done during worker thread startup.
    unsafe fn set_current(thread: *const WorkerThread) {
        WORKER_THREAD_STATE.with(|t| {
            assert!(t.get().is_null());
            t.set(thread);
        });
    }
...
unsafe fn main_loop(worker: Worker<JobRef>, registry: Arc<Registry>, index: usize) {
    let worker_thread = &WorkerThread {
        worker,
        fifo: JobFifo::new(),
        index,
        rng: XorShift64Star::new(),
        registry: registry.clone(),
    };
    WorkerThread::set_current(worker_thread);
...

Does rust guarantee that raw worker_thread will point to fully constructed object? Or could actual initialization move downward a bit?
"""

Thank you!

@cuviper
Copy link
Member

cuviper commented Jan 10, 2022

Does rust guarantee that raw worker_thread will point to fully constructed object? Or could actual initialization move downward a bit?

Yes it does -- and in response to one of the comments, no, unsafe fn doesn't relax that. Unsafety only lets you do more things, like deref pointers, but it doesn't change the semantics of safe code. Now, it's still possible that LLVM might reorder loads and stores, safe or not, but only if that is not otherwise observable. Setting the thread-local pointer makes that initialization visible, so I think it shouldn't be reordering that.

But it crashed within set_current, which doesn't even care about the contents of the pointer. That seems to me like some kind of corruption of the underlying thread-local state.

Is this reproducible outside of NixOS? e.g. on something more traditional like Fedora or Ubuntu?

@cuviper
Copy link
Member

cuviper commented Jan 10, 2022

Well, I reproduced it on Fedora 35, so it's not just a NixOS thing...

@trofi
Copy link
Author

trofi commented Jan 10, 2022

Does rust guarantee that raw worker_thread will point to fully constructed object? Or could actual initialization move downward a bit?

Yes it does -- and in response to one of the comments, no, unsafe fn doesn't relax that. Unsafety only lets you do more things, like deref pointers, but it doesn't change the semantics of safe code. Now, it's still possible that LLVM might reorder loads and stores, safe or not, but only if that is not otherwise observable. Setting the thread-local pointer makes that initialization visible, so I think it shouldn't be reordering that.

But it crashed within set_current, which doesn't even care about the contents of the pointer. That seems to me like some kind of corruption of the underlying thread-local state.

Aha, thank you! I still have no idea how lifetimes work with raw pointers, thus the wild speculations.

Is this reproducible outside of NixOS? e.g. on something more traditional like Fedora or Ubuntu?

Tested single-file reproduced from #911 (comment) on default x86_64 unubtu-21.10 with today's rust nightly. Crashes similarly the same way with cargo test --doc.

@cuviper
Copy link
Member

cuviper commented Jan 11, 2022

This looks like a codegen bug -- I get this code for setting WORKER_THREAD_STATE:

000000000004f9a0 <_ZN3std6thread5local17LocalKey$LT$T$GT$4with17h6114d0cd4be760acE>:
   4f9a0:	48 83 ec 18          	sub    $0x18,%rsp
   4f9a4:	48 89 34 24          	mov    %rsi,(%rsp)
   4f9a8:	ff 17                	call   *(%rdi)
   4f9aa:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
   4f9af:	48 83 f8 00          	cmp    $0x0,%rax
   4f9b3:	74 2c                	je     4f9e1 <_ZN3std6thread5local17LocalKey$LT$T$GT$4with17h6114d0cd4be760acE+0x41>
   4f9b5:	48 8b 44 24 08       	mov    0x8(%rsp),%rax
   4f9ba:	48 8b 00             	mov    (%rax),%rax
   4f9bd:	48 83 f8 00          	cmp    $0x0,%rax
            assert!(t.get().is_null());
   4f9c1:	74 48                	je     4fa0b <_ZN3std6thread5local17LocalKey$LT$T$GT$4with17h6114d0cd4be760acE+0x6b>
   4f9c3:	48 8d 3d e6 53 09 00 	lea    0x953e6(%rip),%rdi        # e4db0 <anon.ac26955e2c89c5d8e44f7f86cf1eaf1c.7.llvm.6007769190384261862>
   4f9ca:	48 8d 15 3f f8 0a 00 	lea    0xaf83f(%rip),%rdx        # ff210 <anon.ac26955e2c89c5d8e44f7f86cf1eaf1c.8.llvm.6007769190384261862>
   4f9d1:	48 8d 05 88 c6 fc ff 	lea    -0x33978(%rip),%rax        # 1c060 <_ZN4core9panicking5panic17h0ba7146865b2f9d6E>
   4f9d8:	be 23 00 00 00       	mov    $0x23,%esi
   4f9dd:	ff d0                	call   *%rax
   4f9df:	0f 0b                	ud2    
   4f9e1:	48 8d 3d a4 52 09 00 	lea    0x952a4(%rip),%rdi        # e4c8c <anon.ac26955e2c89c5d8e44f7f86cf1eaf1c.0.llvm.6007769190384261862>
   4f9e8:	48 8d 0d e9 f7 0a 00 	lea    0xaf7e9(%rip),%rcx        # ff1d8 <anon.ac26955e2c89c5d8e44f7f86cf1eaf1c.3.llvm.6007769190384261862>
   4f9ef:	4c 8d 05 ca f7 0a 00 	lea    0xaf7ca(%rip),%r8        # ff1c0 <anon.ac26955e2c89c5d8e44f7f86cf1eaf1c.2.llvm.6007769190384261862>
   4f9f6:	48 8d 05 13 ca fc ff 	lea    -0x335ed(%rip),%rax        # 1c410 <_ZN4core6result13unwrap_failed17h32ef6b3156e8fc57E>
   4f9fd:	be 46 00 00 00       	mov    $0x46,%esi
   4fa02:	48 8d 54 24 10       	lea    0x10(%rsp),%rdx
   4fa07:	ff d0                	call   *%rax
   4fa09:	0f 0b                	ud2    
   4fa0b:	48 8b 04 24          	mov    (%rsp),%rax
   4fa0f:	48 8b 7c 24 08       	mov    0x8(%rsp),%rdi
            t.set(thread);
   4fa14:	48 8b 30             	mov    (%rax),%rsi
   4fa17:	66 0f 1f 84 00 00 00 	nopw   0x0(%rax,%rax,1)
   4fa1e:	00 00 

000000000004fa20 <_ZN4core3ptr52drop_in_place$LT$std..thread..local..AccessError$GT$17hac4d05594a295cb7E.llvm.6007769190384261862>:
   4fa20:	c3                   	ret    
   4fa21:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
   4fa28:	00 00 00 
   4fa2b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

I stepped through and we do execute the end there around t.set(thread), and then it just falls off into the next function (after a NOP for padding). That's an immediate ret, but since we don't have the stack epilogue to fix our sub $0x18,%rsp prologue, the stack pointer is not ready with a return address. So it "returns" to garbage, which is why gdb's backtrace has a few bad frames first.

I tried to use cargo-bisect-rustc and it pointed me to the cargo update in rust-lang/rust#89802. It seems that's when they started applying LTO settings to rustdoc. From there, in its mentions, the issue rust-lang/rust#91671 seems quite related here, also involving rayon. That one is a compiler crash in LLVM, rather than the runtime crash we have here, but they might have the same root cause.

@trofi
Copy link
Author

trofi commented Jan 12, 2022

In case it's of any use here is a smaller reproducer extracted from comment 1, uses "only" rayon-core:

# Cargo.toml:
[package]
name = "bug"
version = "0.1.0"
edition = "2018"

[dependencies]
rayon-core = ">= 1.9"

[profile.dev]
debug = true
incremental = true
lto = "thin"
opt-level = 2
debug-assertions = false
overflow-checks = false
codegen-units = 256
// src/lib.rs:
use rayon_core::current_num_threads;

#[inline(never)]
pub fn bridge()
{
    let /* triggers bug */ _splits = crate::current_num_threads();
}

/// # Examples
///
/// ```
/// use bug::do_bug;
///
/// # fn main() {
/// bug::do_bug()
/// # }
/// ```
pub fn do_bug() {
  bridge();
  bridge();
  bridge();
  bridge();
}

Reproducer:

$ cargo test --doc
    Updating crates.io index
   Compiling autocfg v1.0.1
   Compiling crossbeam-utils v0.8.6
   Compiling lazy_static v1.4.0
   Compiling cfg-if v1.0.0
   Compiling libc v0.2.112
   Compiling crossbeam-epoch v0.9.6
   Compiling scopeguard v1.1.0
   Compiling rayon-core v1.9.1
   Compiling memoffset v0.6.5
   Compiling crossbeam-channel v0.5.2
   Compiling crossbeam-deque v0.8.1
   Compiling num_cpus v1.13.1
   Compiling bug v0.1.0 (/tmp/y)
    Finished test [optimized + debuginfo] target(s) in 2.49s
   Doc-tests bug

running 1 test
test src/lib.rs - do_bug (line 11) ... FAILED

failures:

---- src/lib.rs - do_bug (line 11) stdout ----
Test executable failed (terminated by signal).


failures:
    src/lib.rs - do_bug (line 11)

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.60s

error: test failed, to rerun pass '--doc'

I'll spend some time extracting the reproducer from rayon-core to get the idea what it takes to get bad code. If i succeed i'll report it to existing rust issue you linked.

@trofi
Copy link
Author

trofi commented Jan 13, 2022

Filed new issue against rust: rust-lang/rust#92869

@cuviper
Copy link
Member

cuviper commented Jan 13, 2022

Thanks! Let's close this as a compiler issue, not rayon's fault. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants