Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression between 0.33.2 and 0.33.1 #235

Open
crespum opened this issue Aug 24, 2023 · 25 comments
Open

Regression between 0.33.2 and 0.33.1 #235

crespum opened this issue Aug 24, 2023 · 25 comments

Comments

@crespum
Copy link

crespum commented Aug 24, 2023

I'm using sunrise crate to do some computation. Everything works fine with esp-idf-sys="0.33.1" but the program crashes here (it's just a bunch of float math operations) with version 0.33.2.

Here's the coredump of the thread that crashes:

==================== THREAD 1 (TCB: 0x3fc9d5d4, name: 'main') =====================
#0  0x420ea5a2 in fmod (x=<optimized out>, y=<optimized out>) at /builds/idf/crosstool-NG/.build/riscv32-esp-elf/src/newlib/newlib/libm/math/w_fmod.c:70
#1  0x420127a4 in sunrise::anomaly::solar_mean_anomaly (day=<optimized out>) at src/anomaly.rs:28
#2  sunrise::sunrise::sunrise_sunset (latitude=42.232819999999997, longitude=-8.7226400000000002, year=<optimized out>, month=<optimized out>, day=<optimized out>) at src/sunrise.rs:41
#3  0x4200e5bc in dark_sky_meter::night_checker::get_night_checker_data (time=..., latitude=42.232819999999997, longitude=-8.7226400000000002) at src/night_checker.rs:38
#4  0x42007ac4 in dark_sky_meter::main () at src/main.rs:148

I didn't manage to track the bug deeper. Any idea of what could be going on with esp-idf-sys?

@punkto
Copy link

punkto commented Sep 13, 2023

Could this be related to issue "crosstool-NG: math & printf related errors (IDFGH-4643)"?

@ivmarkov
Copy link
Collaborator

I don't think this is in anyway related to esp-idf-sys. If you are still having the issue, can you please open a bug against the esp-rs/rust repo and mention that you have miscompilation issues with xtensa? (You have not mentioned xtensa, but I'm assuming this is the MCU you are using.)

@crespum
Copy link
Author

crespum commented Oct 17, 2023

I'm actually using an ESP32-C3 (RISC-V). Anyway, I have to test the new release of esp-idf-sys and see if that fixes the issue.

@ivmarkov
Copy link
Collaborator

But the argument remains - how can esp-idf-sys possibly interfere with your pure Rust calculations? I think you are looking at the wrong direction.

@crespum
Copy link
Author

crespum commented Oct 17, 2023

I totally agree with you. I just observed it happen with an specific esp-idf-sys version, that's why I decided to open the issue in this repo. Sorry if it's not the right place.

@ivmarkov
Copy link
Collaborator

Yes it is not the right repo so I closed the bug. But since the conversation had started:

==================== THREAD 1 (TCB: 0x3fc9d5d4, name: 'main') =====================
#0  0x420ea5a2 in fmod (x=<optimized out>, y=<optimized out>) at /builds/idf/crosstool-NG/.build/riscv32-esp-elf/src/newlib/newlib/libm/math/w_fmod.c:70
#1  0x420127a4 in sunrise::anomaly::solar_mean_anomaly (day=<optimized out>) at src/anomaly.rs:28
#2  sunrise::sunrise::sunrise_sunset (latitude=42.232819999999997, longitude=-8.7226400000000002, year=<optimized out>, month=<optimized out>, day=<optimized out>) at src/sunrise.rs:41
#3  0x4200e5bc in dark_sky_meter::night_checker::get_night_checker_data (time=..., latitude=42.232819999999997, longitude=-8.7226400000000002) at src/night_checker.rs:38
#4  0x42007ac4 in dark_sky_meter::main () at src/main.rs:148

How did you manage to take this coredump? It does not look like the standard backtrace that ESP-IDF generates?

@crespum
Copy link
Author

crespum commented Oct 17, 2023

@ivmarkov
Copy link
Collaborator

Well, other than increasing the stack of the task/thread doing the computation, I don't have other easy to try solutions. If stack size increase does not help, you need to fork the upstream crate and play with it by commenting bits and pieces of the computation until you figure out why it is crashing.

@crespum
Copy link
Author

crespum commented Jan 10, 2024

I'm trying to debug this issue as we can't update esp-idf-sys anymore. So far I've replicated the issue with a minimal example and using a local copy of the sunrise crate. It looks like it has to do with libm (newlib) and the heap or the stack, I'm not sure. I've increased the stack of the main task but it's still crashing.

Now I was trying to identify which commit between 0.33.1 and 0.33.2 causes the exception. I've downloaded the repo, replaced Cargo.toml with esp-idf-sys = { path = "../esp-idf-sys", features = ["binstart"] }, but I can't compile the project:

error: failed to select a version for `esp-idf-sys`.
    ... required by package `esp-idf-hal v0.41.2`
    ... which satisfies dependency `esp-idf-hal = "^0.41.2"` of package `dark-sky-meter v0.1.0 (/home/dark-sky-meter-fw)`
versions that meet the requirements `^0.33` are: 0.33.7, 0.33.6, 0.33.5, 0.33.4, 0.33.3, 0.33.2, 0.33.1, 0.33.0

the package `esp-idf-sys` links to the native library `esp_idf`, but it conflicts with a previous package which links to `esp_idf` as well:
package `esp-idf-sys v0.33.1 (/home/esp-idf-sys)`
    ... which satisfies path dependency `esp-idf-sys` of package `dark-sky-meter v0.1.0 (/home/dark-sky-meter-fw)`
Only one package in the dependency graph may specify the same links value. This helps ensure that only one copy of a native library is linked in the final binary. Try to adjust your dependencies so that only one package uses the links ='esp-idf-sys' value. For more information, see https://doc.rust-lang.org/cargo/reference/resolver.html#links.

failed to select a version for `esp-idf-sys` which could resolve this conflict

Is there a guide I can follow to use a local copy of esp-idf-sys?

@Vollbrecht
Copy link
Collaborator

if you want to test out a local git branch you need to use the [patch.crates-io] feature and point it to your local / remote git version you wanna use, so all dependancys that uses esp-idf-sys are using the same version.

@ivmarkov
Copy link
Collaborator

ivmarkov commented Jan 10, 2024

You are supposed to use the [patch.crates-io] Cargo section instead to patch-in a crate dependency with your own copy.

What you are doing instead would mean that you'll be ending up with TWO esp-idf-sys crates in your cargo tree - the regular one from crates.io which is pulled from esp-idf-svc/hal, and then your own version. This might or might not run - depends on the concrete case. For esp-idf-sys this is not possible (and anyway not what you are trying to achieve), as esp-idf-sys wants to be a singleton.

@crespum
Copy link
Author

crespum commented Jan 10, 2024

Thank you both! The exception is caused by commit 6102063, which makes changes to the linker. I'm exploring cargo_drivers.rs to understand what's going on.

@ivmarkov
Copy link
Collaborator

ivmarkov commented Jan 10, 2024

I'm not really sure what is the root problem. Out of stack/heap memory? Something else?

The code you are looking at only matters for ESP IDF 4.4.X. If you are on ESP IDF 5.X you can't be affected by it. As for what it does - look at the comments. The issue at hand is thorough-fully documented (for once).

@crespum
Copy link
Author

crespum commented Jan 10, 2024

If I had to guess, it's a heap problem. And yeah, we are using ESP_IDF_VERSION = "v4.4.5".

@ivmarkov
Copy link
Collaborator

Can't you assign more heap to confirm/deny your hypothesis? Or as I said a few months ago - play with the crate until you isolate the problem?

@ivmarkov
Copy link
Collaborator

Also it is a bit strange if you think the culprit is the heap, as this issue started with "it crashes on a bunch of math operations", Can't you take these math operations, and execute them on the MCU without the sunrise crate?

@crespum
Copy link
Author

crespum commented Jan 10, 2024

How can I assign more heap?

I've already played with the crate to isolate the problem but it led me nowhere. Let me explain: I have a main function with only calls sunrise::sunrise_sunset(...):

This works (i.e. it doesn't raise an exception):

pub fn sunrise_sunset(
    latitude: f64,
    longitude: f64,
    year: i32,
    month: u32,
    day: u32,
) -> (i64, i64) {
    let day: f64 = mean_solar_noon(longitude, year, month, day);
    let solar_anomaly: f64 = solar_mean_anomaly(day);
    let equation_of_center: f64 = equation_of_center(solar_anomaly);
    // let ecliptic_longitude: f64 = ecliptic_longitude(solar_anomaly, equation_of_center, day);
    // let solar_transit: f64 = solar_transit(day, solar_anomaly, ecliptic_longitude);
    // let declination: f64 = declination(ecliptic_longitude);
    // let hour_angle: f64 = hour_angle(latitude, declination);
    // let frac: f64 = hour_angle / 360.;
    // (
    //     julian_to_unix(solar_transit - frac),
    //     julian_to_unix(solar_transit + frac),
    // )
    (0, 0)
}

But this crashes:

pub fn sunrise_sunset(
    latitude: f64,
    longitude: f64,
    year: i32,
    month: u32,
    day: u32,
) -> (i64, i64) {
    let day: f64 = mean_solar_noon(longitude, year, month, day);
    let solar_anomaly: f64 = solar_mean_anomaly(day);
    let equation_of_center: f64 = equation_of_center(solar_anomaly);
    let ecliptic_longitude: f64 = ecliptic_longitude(solar_anomaly, equation_of_center, day);
    // let solar_transit: f64 = solar_transit(day, solar_anomaly, ecliptic_longitude);
    // let declination: f64 = declination(ecliptic_longitude);
    // let hour_angle: f64 = hour_angle(latitude, declination);
    // let frac: f64 = hour_angle / 360.;
    // (
    //     julian_to_unix(solar_transit - frac),
    //     julian_to_unix(solar_transit + frac),
    // )
    (0, 0)
}

The crate contains a bunch of mathematical operations, so that's why I think it has to do with memory.

@ivmarkov
Copy link
Collaborator

The crate contains a bunch of mathematical operations, so that's why I think it has to do with memory.

This has nothing to do with the heap. It looks more like an ISA problem.

Can you - as a start - just copy the above code - plus whatever other functions are called - directly into a minimal binary crate?
That puts the sunrise crate out of the equation and creates a very minimal "reproducible" example. Then please publish it somewhere.

Also, what exactly does the "crash" look like? Rust panic? Invalid instruction? Invalid pointer? Can you also remove your coredump-to-uart option so that we can see the standard, default ESP IDF panic handler, and what exactly it displays?

@Vollbrecht
Copy link
Collaborator

Do you use hardware with external PSRAM or only using internal sram? Notice that the c3 has no hardware floating point support, so all calculations are done in pure software, and f64 tends to be expansive here. Just a general remark.

@crespum
Copy link
Author

crespum commented Jan 10, 2024

You both have made some good points. Check out this minimal example. I managed to reproduce it without even using the external crate but just a for loop with some basic float operations.

Full output: error-235.log

@ivmarkov
Copy link
Collaborator

I don't know when I would be able to test this so in the meantime some ideas and suggestions:

The very fact that you have to execute the same statement in a loop to achieve a crash looks strange. Does it also crash if you manually unroll the loop?

My point is, it might be a flaky hardware too, especially if it turns out that executing the same thing a number of times makes it crash. Therefore:

  • Have you tried executing this on another esp32c3 board? Ideally, a different one?
  • Have you tried to mitigate any brownout problems, i.e. not powerful enough power supply, which might cause the hardware to behave strangely? As in, wrong stuff read from flash/IRAM?

@ivmarkov
Copy link
Collaborator

ivmarkov commented Jan 10, 2024

The original ESP IDF issue referred by @punkto at the beginning of the thread is precisely a power issue, not a software one btw.

@crespum
Copy link
Author

crespum commented Jan 11, 2024

It doesn't crash when unrolling the loop. To give more context, when creating the example I was just trying to emulate what the external crate does, which are a bunch of float operations somehow interconnected.

We have a custom board (powered through both battery and USB), but I've replicated the behavior on an ESP32-C3-DevKitM-1U. Besides, the exception arises after a software change (6102063), doesn't that rule out a hardware issue -or at least make it very unlikely-?

@ivmarkov ivmarkov reopened this Jan 12, 2024
@ivmarkov
Copy link
Collaborator

@igrr Any ideas? It seems we are facing runtime crashes when utilizing FP on riscv targets when linked with a GCC toolchain newer than 5.1 (as in, the toolchain which is available from ESP IDF 5.1 onwards).

(As to why we are using a newer GCC toolchain for linking even with ESP IDF 4.4.X - long story - but the crux is due to the zicr changes to the RISCV ISA 2.1 spec, where GCC and LLVM diverged, and GCC is following the new (backward incompatible) 2.1 spec, where zicr and zifencei are not considered part of "riscv32imc", while LLVM continues to follow the older 2.0 version, where these are considered in. Which creates linking errors when using an older GCC toolchain.)

@MabezDev
Copy link
Member

Maybe related to esp-rs/rust#176? What happens if you build esp-idf with clang, instead of GCC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

5 participants