Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic panic in async-io thread with esp-sys-svc async TLS example #84

Open
jothan opened this issue Mar 21, 2024 · 16 comments
Open

Sporadic panic in async-io thread with esp-sys-svc async TLS example #84

jothan opened this issue Mar 21, 2024 · 16 comments

Comments

@jothan
Copy link

jothan commented Mar 21, 2024

Code: https://github.com/esp-rs/esp-idf-svc/blob/4fff46bba1be66ae3c6945d3b8bda30a589f6f6b/examples/tls_async.rs

Backtrace:
backtrace.txt

Crate versions:
Cargo.lock

I am running the example on a ESPcam board, with an esp32 (revision v3.0) Xtensa module (Tinker AI ESP32-S, to be exact).

The example usually crashes once or twice after reset before working.

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

[profile.de[v]
debug = true
opt-level = "s"
codegen-units = 1
@Vollbrecht
Copy link

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

@jothan
Copy link
Author

jothan commented Apr 9, 2024

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

To be honest, it seems to depend on code layout. It might trigger in other conditions, I have not managed to narrow it down. It smells like a race condition of some sort.

@notgull
Copy link
Member

notgull commented Apr 9, 2024

cc @ivmarkov Any idea why this would be happening?

@notgull notgull transferred this issue from smol-rs/async-io Apr 10, 2024
@notgull
Copy link
Member

notgull commented Apr 10, 2024

The issue actually appears to be in our locking implementation, specifically OnceCell.

@ivmarkov
Copy link

cc @ivmarkov Any idea why this would be happening?

No idea. Will try to reproduce once I'm back.

@jothan
Copy link
Author

jothan commented Apr 24, 2024

@ivmarkov have you had any luck reproducing this ? Let me know if you need further input and testing from my part.

@ivmarkov
Copy link

Couldn't get to there yet. Sorry about that! Very likely will try to do something related to async-io in the next couple of days.

@npmenard
Copy link

npmenard commented Apr 24, 2024

I am running in the same issue as described by @jothan

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters) but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address?
Interestingly if I add a dummy read before pinning like :

let event_listener = EventListener::new();
let _ = event_listener.is_listening();
pin!(event_listener);

The pinned address will be 32bit aligned pointing to a stack area where all bytes are 0 and the bug doesn't happen.

For reference here are the two assembly dump of the function
not working
working

Hope that helps @ivmarkov

@ivmarkov
Copy link

ivmarkov commented May 1, 2024

@npmenard Thank you for the detailed analysis!

I can reproduce the issue (though I haven't connected with GDB to examine the address), and I confirm that the dummy listener.is_listening(); is fixing the issue for me too - but only in a more trivial case I have. In a more complex setup, I have another crash - something related to thread locals - but let's put that aside for a moment.

(Also, not sure if that matters, but I'm consistently reproducing the issue with opt-levels "s" and "z" (haven't tried with other opt levels yet, will do)).

As for:

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters)

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct?
Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address?

Even more suspicious. This sounds a bit like a bug in the LLVM xtensa backend.

@MabezDev - sorry for pulling you here, but do you think you and/or Andrei can help? There are assembly dumps attached to the previous comment.
I can also try to simplify and provide a small(er) test case, though with async-io in the picture, it would be an ESP IDF IO example in the end...

@npmenard
Copy link

npmenard commented May 3, 2024

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct?
Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

It's the address that would be printed when formatting event_listener with {:p} after it was pinned for example :
log::info!( "pin_ev_ptr {:p}", event_listener);
In my case the cpu doesn't generate a trap when loading at an unaligned address. To illustrate see is_listening assembly dump and the registers status when we check self.listener.listener.is_some() (l32i.n a2, a2, 0). Register a2 contains the unaligned address and once the load is executed it will contain 3f when it should be 0.

I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by
log::info!( "pin_ev_ptr {:p} {:x}", event_listener, *(&(&**(&event_listener.pointer))) as *const _ as u32); are different.

Lastly I am attaching a backtrace showing the code path that writes at the location in stack where ultimatley the Pin will point to. It happens during WindowOverflow8 exception is triggered when setting a thread local variable.

Hopefully this helps a bit. Let me know if there is anything else I can do to help.

@jothan
Copy link
Author

jothan commented May 13, 2024

With the new Rust 1.78 for ESP32 (in pre-release in the rust-build project), debug assertions for std are now enabled and seem to catch a similar alignment problem with an assertion in core::ptr::read::precondition_check.

Full backtrace: backtrace.txt

Partial backtrace:

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception).                                                                                          
Debug exception reason: BREAK instr                                                                                                                           
Core  0 register dump:                                                                                                                                        
PC      : 0x400f0d06  PS      : 0x00060c36  A0      : 0x800f0ab0  A1      : 0x3ffcdfc0                                                                        
0x400f0d06 - core::panicking::panic_nounwind_fmt::runtime                                                                                                     
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/panicking.rs:91                                                                
A2      : 0x3ffce060  A3      : 0x00000040  A4      : 0x000001c0  A5      : 0x3ffce220                                                                        
A6      : 0x00060b23  A7      : 0x00060b20  A8      : 0x800f0d01  A9      : 0x3ffcdfa0                                                                        
A10     : 0x00000000  A11     : 0x00000040  A12     : 0x000000f8  A13     : 0x3ffce4e0                                                                        
A14     : 0x3fff3fc8  A15     : 0x400f0cf4  SAR     : 0x00000018  EXCCAUSE: 0x00000001                                                                        
0x400f0cf4 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/intrinsics.rs:2797                                                             
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000                                                                        
                                                                                                                                                              
                                                                                                                                                              
Backtrace: 0x400f0d03:0x3ffcdfc0 0x400f0aad:0x3ffcdfe0 0x400f0beb:0x3ffce5c0 0x400ee62d:0x3ffce860 0x400f14bb:0x3ffce880 0x400f148d:0x3ffce8b0 0x400d6e7e:0x3\
ffce8e0 0x400e3afe:0x3ffce930 0x400decc1:0x3ffced20 0x401a1419:0x3ffcf150 0x400d6a52:0x3ffcf170 0x400d479b:0x3ffcf230 0x400d6450:0x3ffcf350 0x400e82f1:0x3ffc\
f5a0 0x400da0c2:0x3ffcf880 0x400da994:0x3ffcf8a0 0x4011b067:0x3ffcf8d0 0x4011b04e:0x3ffcf8f0 0x40120ff3:0x3ffcf910 0x4012cf40:0x3ffcf930                      
0x400f0d03 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1282                                                                
0x400f0aad - core::ptr::read                                                                                                                                  
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1276                                                                
0x400f0beb - async_lock::once_cell::OnceCell<T>::get_or_try_init_blocking                                                                                     
    at /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-lock-3.3.0/src/once_cell.rs:441                                                  
0x400ee62d - async_lock::once_cell::OnceCell<T>::get_or_init_blocking         

@jothan
Copy link
Author

jothan commented May 13, 2024

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

@ivmarkov
Copy link

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

Why would OnceCell be doing something weird and moreover, this something only happens with esp idf + xtensa and not on x86 and (likely) arm? I still find it more likely that the root cause is a miscompilation bug in the llvm xtensa backend.

Will open a bug at the xtensa llvm fork shortly btw.

And will test (again) with riscv32imc on the esp idf, though I'm relatively sure the issue did not happen there.

@jothan
Copy link
Author

jothan commented May 13, 2024

@ivmarkov Doing something weird in this case is triggering the miscompilation. I don't doubt your analysis.

@ivmarkov
Copy link

Other than dereferencing a raw pointer, I don't see anything else weird, unfortunately...

@ivmarkov
Copy link

I know there is little to no progress here, for which I apologize. In the meantime, what I found to work for me is setting the version of async-io back to "2.0.0" (as in async-io = "=2.0.0").

We'll resume the work on finding the root cause shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants