Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backtraces in Rust on ESP-IDF #76

Open
ivmarkov opened this issue Aug 30, 2021 · 8 comments
Open

Backtraces in Rust on ESP-IDF #76

ivmarkov opened this issue Aug 30, 2021 · 8 comments

Comments

@ivmarkov
Copy link

ivmarkov commented Aug 30, 2021

Background:

Rust has a very useful utility called backtraces. Unlike Rust panics which are an abnormal situation - but still unwind the stack on big iron; for embedded, usually the panic_abort strategy is chosen -
backtraces are invoked programmatically, and allow one to capture programmatically and possibly display the stack of the thread which invoked the backtracing logic.

Usually, this is used in combination with Rust error handling, where some of the error libs, like the popular anyhow crate can be instructed to capture backtraces on first anyhow error construction.
(Backtrace capturing is not (yet?) enabled for standard Rust STD error types though.)

Enhancement:

Ideally, we should be working on filing a PR against the backtrace-rs Rust crate (also used as a GIT submodule in Rust STD) to support IP-based backtraces under ESP-IDF. (While symbolization of backtrace IP pointers to function names and source file directly by the ESP-IDF flashed firmware is not practical due to size constraints, these can be decoded later with UART monitor filter that has access to the ELF file.)
Couple of issues with that:

  • Xtensa
  • Q: While programmatic capturing of backtrace is supported with esp_backtrace_get_start and esp_backtrace_get_next_frame, the logic seems to capture "ghost" Rust IPs, which surely do not belong to the current call stack. Could it be that these are only reliable for C/GCC-generated code?
  • RiscV
    • The situation here seems to be unclear in that the above calls are not supported, and given that GCC (and Rust) compile code for RiscV targets without frame pointer support (FP) by default, reliable capture of backtrace IPs might never be possible (that is, since the eh_frame-based ESP-IDF configuration is advised not to be used in production due to huge code size of the flash binary)
    • Q: Would a programmatic invocation of stack memory capture (as used currently during pancis on RiscV) be a feasible replacement? The stack memory can be later decoded with - say - GDB?
    • Q: Given that RiscV in ESP-IDF master now also has core-dump support, what is the future of the "stack memory dump" approach? Would it become a dump of a minimal elf in coredump format which only contains the stack memory? So that the "gdb server" hack in ESP-IDF Monitor can be retired?
@N3xed
Copy link

N3xed commented Aug 30, 2021

I have some experience with backtraces on xtensa, I even ported the esp-idf's backtrace facility to rust here. And this seemed to work just fine. I'm not sure what you mean with "ghost Rust IPs".

In the no_std case printing such a backtrace can be very easily accomplished by setting your own panic_handler (which I've done successfully), and I think in the std case there is std::panic::set_hook for this purpose.

@ivmarkov
Copy link
Author

I have some experience with backtraces on xtensa, I even ported the esp-idf's backtrace facility to rust here. And this seemed to work just fine. I'm not sure what you mean with "ghost Rust IPs".

As usual, I can't reproduce it when I need it. But I do have some screenshots from when it used to happen.
What I mean by "ghost IPs" is that the ESP backtrace algorithm seems to detect IPs (or PCs) which are in the stack memory - this is my hypothesis - from previous callstacks, from previous functions' invocations. These are not necessarily overwritten with new data simply because of how Rust works - where it often returns big chunks of data on the stack. In these cases - I assume - the caller is just bumping down the SP to make room for the data to be returned, without zeroing out the relevant stack memory.

By the way, another problem that I see way too often (even right now) is "corrupted" backtraces, where the bottom part of the backtrace is not available, because the stack is "corrupted". Well, I'm pretty sure there is nothing corrupted there, because the threads are running with 10K+ stack sizes, which leads me to believe that the Xtensa backtracing algorithm is simply not very reliable, especially with Rust.

In the no_std case printing such a backtrace can be very easily accomplished by setting your own panic_handler (which I've done successfully), and I think in the std case there is std::panic::set_hook for this purpose.

But the problem is not at all how to get a backtrace on panics. At least, not with ESP-IDF that is, as the panics are handled by ESP-IDF automatically, and it dumps either a backtrace (Xtensa) or stack memory (RiscV).

The problem is how to implement - programmatically - a backtracing (or stack dump actually) on demand for RiscV, as that functionality is not available.

For stack dump, might be as simple as:

fn stack_dump(magic: u8) {
    let stack_addr = &magic as *const u8;
    // We have a pointer to the stack in stack_addr now, read from it the upper 1024K bytes and print these
   ... 
}

However, the above ideally should be an ESP-IDF API.

@ivmarkov
Copy link
Author

And the other problem is whether the "stack memory dump" is the way forward for RiscV, and whether it will remain just a binary dump of stack memory, or whether it will be wrapped in an elf coredump executable first. The latter would allow Espressif folks to get rid of the "GDB server" approach - if you look at what they are doing with the stack memory dump in the ESP IDF monitor...

@ivmarkov
Copy link
Author

@igrr ^^^

@igrr
Copy link

igrr commented Aug 31, 2021

The problem is how to implement - programmatically - a backtracing (or stack dump actually) on demand for RiscV, as that functionality is not available.

I think there are basically 3 options for this on RISC-V:

  • Include DWARF sections into the application binary and decode the CIEs and FDEs. There is support for that in ESP-IDF (CONFIG_ESP_SYSTEM_USE_EH_FRAME). The binary becomes 20-100% larger (which may or may not be a big issue depending on how large it was before that).
  • Having a GDB-server-like communication running between the host and the device, where the device can ask the host to do a backtrace on demand. This is probably conditional on having some JTAG connection running.
  • Compile everything with -fno-omit-frame-pointer, then backtracing is possible without DWARF information. Note though that ESP32-C3 ROM code and toolchain libraries (libc, libm, libstdc++) haven't been compiled with this option, so if there is a ROM or libc function in the callstack, the backtrace might get interrupted.

On Xtensa things are a lot easier since the ABI guarantees we can always find the previous frame, provided that no funny things like exceptions interrupt the call stack. And even then, if the exception/interrupt handler is careful enough in creating a frame on the stack, the backtrace can continue.

@ivmarkov
Copy link
Author

  • Include DWARF sections into the application binary and decode the CIEs and FDEs. There is support for that in ESP-IDF (CONFIG_ESP_SYSTEM_USE_EH_FRAME). The binary becomes 20-100% larger (which may or may not be a big issue depending on how large it was before that).

Might be an option indeed, need to try out and see how much the size grows. Was hoping we won't get there.

  • Having a GDB-server-like communication running between the host and the device, where the device can ask the host to do a backtrace on demand. This is probably conditional on having some JTAG connection running.

By "GDB server" I meant this hack actually, where the app outputs either a raw stack memory dump (then the GDB server hack is necessary) or the app outputs an elf-encoded coredump which contains only one elf section - the one that contains the stack memory of the freertos task that panicked/programmatically requested the stack dump.

  • Compile everything with -fno-omit-frame-pointer, then backtracing is possible without DWARF information. Note though that ESP32-C3 ROM code and toolchain libraries (libc, libm, libstdc++) haven't been compiled with this option, so if there is a ROM or libc function in the callstack, the backtrace might get interrupted.

This implies that Rust code also needs to be compiled with -fno-omit-frame-pointer. no big deal as we are using cargo build-std for now, which recompiles everything anyway.

What is more worrying is this GCC bug report which seems to imply that
-fno-omit-frame-pointer has no effect for RiscV.

On Xtensa things are a lot easier since the ABI guarantees we can always find the previous frame, provided that no funny things like exceptions interrupt the call stack. And even then, if the exception/interrupt handler is careful enough in creating a frame on the stack, the backtrace can continue.

In theory, yes. In practice, I often see partial stack frames where the bottom of the stack is lost with the infamous <- CORRUPTED flag. Or even "ghost" frames that are surely not there. Are we sure that the Rust code itself is creating proper frames, always, and that the frame traversal algorithm in ESP-IDF is resilient enough?

@igrr
Copy link

igrr commented Aug 31, 2021

Are we sure that the Rust code itself is creating proper frames, always, and that the frame traversal algorithm in ESP-IDF is resilient enough?

There's not much of an algorithm, really — get the SP from the current frame, use it to read the next frame, repeat until finished. Regarding the Rust compiler creating proper frames — I don't know, I was hoping that would handled by LLVM. If you manage to find a reproducer, please share it and I'll take a look.

@ivmarkov
Copy link
Author

Might be difficult as Rust is involved, but here it goes:

  • Clone rust-esp32-std-hello from Git
  • Clone esp-idf-svc from Git
  • Insert unsafe { esp_idf_sys::abort() }; in esp-idf-svc, here.
  • Wire rust-esp32-std-hello to the patched esp-idf-svc by inserting esp-idf-svc = { path = "../esp-idf-svc" } here.
  • Download the Latest Rust compiler binaries and Espressif clang

Partial backtraces in debug:

  • Build rust-esp32-std-hello in debug with cargo build
  • Flash with espflash /dev/tty.usb... target/xtensa-esp32-espidf/debug/rust-esp32-std-hello (Might need to install espflash first with cargo install espflash)
  • Run the ESP-IDF monitor (I run something slightly different but that doesn't matter)
  • You get a partial backtrace:
abort() was called at PC 0x400da779 on core 0
=> 0x400da779: esp_idf_svc::wifi::EspWifi::do_get_scan_infos at /Users/imarkov/projects/esp32/esp-idf-svc/src/wifi.rs:631

Backtrace:0x4008832e:0x3ffbc340 0x40088b3d:0x3ffbc360 0x40090586:0x3ffbc380 0x400da779:0x3ffbc3f0 0x400db084:0x3ffbc438 |<-CORRUPTED
=> 0x4008832e: panic_abort at /Users/imarkov/.platformio/packages/framework-espidf/components/esp_system/panic.c:356
=> 0x40088b3d: esp_system_abort at /Users/imarkov/.platformio/packages/framework-espidf/components/esp_system/system_api.c:112
=> 0x40090586: abort at /Users/imarkov/.platformio/packages/framework-espidf/components/newlib/abort.c:46
=> 0x400da779: esp_idf_svc::wifi::EspWifi::do_get_scan_infos at /Users/imarkov/projects/esp32/esp-idf-svc/src/wifi.rs:631
=> 0x400db084: <esp_idf_svc::wifi::EspWifi as embedded_svc::wifi::Wifi>::scan at /Users/imarkov/projects/esp32/esp-idf-svc/src/wifi.rs:870

Missing are the stackframes from Rust src/main.rs as well as the stack frames from the ESP-IDF app bootstrapping process (app_main & stuff from esp-idf-svc related to bootstrapping the Rust main function)

Partial backtraces + ghost frames in release:

  • Build rust-esp32-std-hello in release with cargo build --release
  • Flash with espflash /dev/tty.usb... target/xtensa-esp32-espidf/release/rust-esp32-std-hello (Might need to install espflash first with cargo install espflash)
  • Run the ESP-IDF monitor (I run something slightly different but that doesn't matter)
  • You get a backtrace which contains frames that should not be there:
abort() was called at PC 0x400d77f5 on core 0
=> 0x400d77f5: std::thread::spawn at /Users/imarkov/projects/esp32/rust-esp154/rust/build/x86_64-apple-darwin/stage2/lib/rustlib/src/rust/library/std/src/thread/mod.rs:621

Backtrace:0x4008832a:0x3ffbaa00 0x40088b39:0x3ffbaa20 0x40090582:0x3ffbaa40 0x400d77f5:0x3ffbaab0 0x400d806d:0x3ffbaaf8 |<-CORRUPTED
=> 0x4008832a: panic_abort at /Users/imarkov/.platformio/packages/framework-espidf/components/esp_system/panic.c:356
=> 0x40088b39: esp_system_abort at /Users/imarkov/.platformio/packages/framework-espidf/components/esp_system/system_api.c:111
=> 0x40090582: abort at /Users/imarkov/.platformio/packages/framework-espidf/components/newlib/abort.c:42 (discriminator 1)
**??? really???*** => 0x400d77f5: std::thread::spawn at /Users/imarkov/projects/esp32/rust-esp154/rust/build/x86_64-apple-darwin/stage2/lib/rustlib/src/rust/library/std/src/thread/mod.rs:621
**??? really???*** => 0x400d806d: core::char::methods::encode_utf8_raw at /Users/imarkov/projects/esp32/rust-esp154/rust/build/x86_64-apple-darwin/stage2/lib/rustlib/src/rust/library/core/src/char/methods.rs:1627
      (inlined by) core::char::methods::<impl char>::encode_utf8 at /Users/imarkov/projects/esp32/rust-esp154/rust/build/x86_64-apple-darwin/stage2/lib/rustlib/src/rust/library/core/src/char/methods.rs:637
      (inlined by) alloc::string::String::push at /Users/imarkov/projects/esp32/rust-esp154/rust/build/x86_64-apple-darwin/stage2/lib/rustlib/src/rust/library/alloc/src/string.rs:1142
      (inlined by) <alloc::string::String as core::fmt::Write>::write_char at /Users/imarkov/projects/esp32/rust-esp154/rust/build/x86_64-apple-darwin/stage2/lib/rustlib/src/rust/library/alloc/src/string.rs:2717

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

4 participants