Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow initial epoch progress causes time to overflow and program panics #447

Closed
DevonMorris opened this issue Jun 28, 2023 · 7 comments
Closed

Comments

@DevonMorris
Copy link

Describe the bug

Hi! I was just checking out your examples and experienced a bug. On slow compute and long first epoch, duration overflows on epoch progress bar causing the program to panic.

To Reproduce

Essentially, I just ran the mnist example on main

cargo run --example mnist --release --features ndarray-blas-openblas

Expected behavior

I think this is obvious, it should not panic.

Screenshots

=============
thread '<unnamed>' panicked at 'overflow when adding durations', library/core/src/time.rs:920:31
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
=== PANIC ===
A fatal error happened, you can check the experiment logs here => '/tmp/burn-example-mnist/experiment.log'
=============
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /home/devon/.cargo/registry/src/index.crates.io-6f17d22bba15001f/indicatif-0.17.5/src/draw_target.rs:132:65
stack backtrace:
   0:     0x556bcae4e36a - std::backtrace_rs::backtrace::libunwind::trace::h9a6b80bbf328ba5d
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x556bcae4e36a - std::backtrace_rs::backtrace::trace_unsynchronized::hd162ec543a11886b
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x556bcae4e36a - std::sys_common::backtrace::_print_fmt::h78a5099be12f51a6
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x556bcae4e36a - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::ha1c5390454d74f71
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x556bcae74f9f - core::fmt::write::h9ffde816c577717b
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/fmt/mod.rs:1254:17
   5:     0x556bcae4b3e5 - std::io::Write::write_fmt::h88186074961638e4
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/io/mod.rs:1698:15
   6:     0x556bcae4e135 - std::sys_common::backtrace::_print::h184198273ed08d59
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x556bcae4e135 - std::sys_common::backtrace::print::h1b4d8e7add699453
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x556bcae4fc4e - std::panicking::default_hook::{{closure}}::h393bcea75423915a
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:269:22
   9:     0x556bcae4f9f5 - std::panicking::default_hook::h48c64f31d8b3fd03
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:288:9
  10:     0x556bcaa8cca0 - burn_train::learner::log::update_panic_hook::{{closure}}::h732293ffa8b8bb5f
  11:     0x556bcae50334 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hb9b860f5a1175bda
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/alloc/src/boxed.rs:1987:9
  12:     0x556bcae50334 - std::panicking::rust_panic_with_hook::hafdc493a79370062
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:695:13
  13:     0x556bcae500a9 - std::panicking::begin_panic_handler::{{closure}}::h0a64bc82e36bedc7
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:582:13
  14:     0x556bcae4e7d6 - std::sys_common::backtrace::__rust_end_short_backtrace::hc203444fb7416a16
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:150:18
  15:     0x556bcae4fe02 - rust_begin_unwind
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
  16:     0x556bca8f0ab3 - core::panicking::panic_fmt::h0f6ef0178afce4f2
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
  17:     0x556bca8f0f63 - core::result::unwrap_failed::h8090202169109f9c
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/result.rs:1687:5
  18:     0x556bcaac21d3 - indicatif::draw_target::ProgressDrawTarget::width::h3e78ea98464f73ac
  19:     0x556bcaac77bc - indicatif::state::BarState::draw::h9abbe3e351e87f07
  20:     0x556bcaac73b2 - indicatif::state::BarState::finish_using_style::h7b94d1440da51327
  21:     0x556bcaac510c - indicatif::progress_bar::ProgressBar::finish::hfa43b91ae9321bbc
  22:     0x556bcaa8e848 - <burn_train::metric::dashboard::cli::CLIDashboardRenderer as core::ops::drop::Drop>::drop::h752503421c544f77
  23:     0x556bca9ee6bd - core::ptr::drop_in_place<burn_train::metric::dashboard::cli::CLIDashboardRenderer>::hf0a3b1156513db1d
  24:     0x556bca9ed509 - core::ptr::drop_in_place<burn_train::metric::dashboard::base::Dashboard<burn_train::learner::classification::ClassificationOutput<burn_autodiff::backend::ADBackendDecorator<burn_ndarray::backend::NdArrayBackend<f32>>>,burn_train::learner::classification::ClassificationOutput<burn_ndarray::backend::NdArrayBackend<f32>>>>::h263d2960503a8317
  25:     0x556bca9aafa1 - core::ptr::drop_in_place<burn_train::callback::async_callback::CallbackThread<burn_train::learner::classification::ClassificationOutput<burn_autodiff::backend::ADBackendDecorator<burn_ndarray::backend::NdArrayBackend<f32>>>,burn_train::learner::classification::ClassificationOutput<burn_ndarray::backend::NdArrayBackend<f32>>>>::hf9ef061b063ed5cb
  26:     0x556bca9a196b - burn_train::callback::async_callback::CallbackThread<T,V>::run::hdc6cda587267de17
  27:     0x556bcaa4bb15 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2979fba468267f1f
  28:     0x556bca9a98e2 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h4dc312110bc02dfd
  29:     0x556bcae54695 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::ha1f2224656a778fb
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/alloc/src/boxed.rs:1973:9
  30:     0x556bcae54695 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::haa29ed9703f354b7
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/alloc/src/boxed.rs:1973:9
  31:     0x556bcae54695 - std::sys::unix::thread::Thread::new::thread_start::h33b6dae3e3692197
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys/unix/thread.rs:108:17
  32:     0x7f25b1a94b43 - start_thread
                               at ./nptl/./nptl/pthread_create.c:442:8
  33:     0x7f25b1b26a00 - clone3
                               at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  34:                0x0 - <unknown>
thread panicked while panicking. aborting.
Aborted (core dumped)

Desktop (please complete the following information):

  • OS: Pop-OS 22.04

Additional context
I'm guessing the problem lies here

But not 100% sure

@nathanielsimard
Copy link
Member

I don't think it's a conversion to u64, since I don't think it can panic. It seems like there is an overflow when adding durations. It's pretty weirds and I'm not able to reproduce. Does it panic everytime you run the example?

@DevonMorris
Copy link
Author

DevonMorris commented Jun 30, 2023

Yes, it panics every time. You'll also see that the epoch epoch and iteration progress bars look quite strange.

[Metrics]
  - Train Loss: epoch 0.76 - batch 0.29
  - Train Accuracy: epoch 80.35 % - batch 92.19 %

[Progress]
  - Iteration 151 Epoch 1/10
  - iteration [#################################>--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------] (12m)
  - epoch     [####################>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------] (1101y)

Edit: this is also on 77219f since I checked out the latest this morning.

@nathanielsimard
Copy link
Member

@DevonMorris can you verify that the fix works on your system: #450. I think the problem comes from the time estimation when the first epoch completion is almost instant.

@DevonMorris
Copy link
Author

That didn't work either

  - Train Loss: epoch 7.76e-2 - batch 2.68e-2
  - Train Accuracy: epoch 97.91 % - batch 100.00 %
  - Valid Loss: epoch 7.57e-2 - batch 4.63e-2
  - Valid Accuracy: epoch 97.64 % - batch 100.00 %

[Progress]
  - Iteration 245 Epoch 2/10
  - iteration [#############################################################>----------------------------------------------------------------------------------------------------------------------------------------------------------------------------] (11m)
  - epoch     [######################>---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------] (519786157860y)
  === PANIC ===
A fatal error happened, you can check the experiment logs here => '/tmp/burn-example-mnist/experiment.log'
=============
thread '<unnamed>' panicked at 'overflow when adding durations', library/core/src/time.rs:920:31
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
=== PANIC ===
A fatal error happened, you can check the experiment logs here => '/tmp/burn-example-mnist/experiment.log'
=============
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /home/devon/.cargo/registry/src/index.crates.io-6f17d22bba15001f/indicatif-0.17.5/src/draw_target.rs:132:65
stack backtrace:
   0:     0x55b90352b37a - std::backtrace_rs::backtrace::libunwind::trace::h9a6b80bbf328ba5d
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55b90352b37a - std::backtrace_rs::backtrace::trace_unsynchronized::hd162ec543a11886b
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55b90352b37a - std::sys_common::backtrace::_print_fmt::h78a5099be12f51a6
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55b90352b37a - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::ha1c5390454d74f71
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55b903551faf - core::fmt::write::h9ffde816c577717b
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/fmt/mod.rs:1254:17
   5:     0x55b9035283f5 - std::io::Write::write_fmt::h88186074961638e4
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/io/mod.rs:1698:15
   6:     0x55b90352b145 - std::sys_common::backtrace::_print::h184198273ed08d59
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x55b90352b145 - std::sys_common::backtrace::print::h1b4d8e7add699453
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x55b90352cc5e - std::panicking::default_hook::{{closure}}::h393bcea75423915a
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:269:22
   9:     0x55b90352ca05 - std::panicking::default_hook::h48c64f31d8b3fd03
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:288:9
  10:     0x55b903169ca0 - burn_train::learner::log::update_panic_hook::{{closure}}::h732293ffa8b8bb5f
  11:     0x55b90352d344 - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::hb9b860f5a1175bda
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/alloc/src/boxed.rs:1987:9
  12:     0x55b90352d344 - std::panicking::rust_panic_with_hook::hafdc493a79370062
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:695:13
  13:     0x55b90352d0b9 - std::panicking::begin_panic_handler::{{closure}}::h0a64bc82e36bedc7
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:582:13
  14:     0x55b90352b7e6 - std::sys_common::backtrace::__rust_end_short_backtrace::hc203444fb7416a16
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys_common/backtrace.rs:150:18
  15:     0x55b90352ce12 - rust_begin_unwind
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/panicking.rs:578:5
  16:     0x55b902fcdab3 - core::panicking::panic_fmt::h0f6ef0178afce4f2
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/panicking.rs:67:14
  17:     0x55b902fcdf63 - core::result::unwrap_failed::h8090202169109f9c
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/result.rs:1687:5
  18:     0x55b90319f1e3 - indicatif::draw_target::ProgressDrawTarget::width::h3e78ea98464f73ac
  19:     0x55b9031a47cc - indicatif::state::BarState::draw::h9abbe3e351e87f07
  20:     0x55b9031a43c2 - indicatif::state::BarState::finish_using_style::h7b94d1440da51327
  21:     0x55b9031a211c - indicatif::progress_bar::ProgressBar::finish::hfa43b91ae9321bbc
  22:     0x55b90316b848 - <burn_train::metric::dashboard::cli::CLIDashboardRenderer as core::ops::drop::Drop>::drop::h752503421c544f77
  23:     0x55b9030cb6bd - core::ptr::drop_in_place<burn_train::metric::dashboard::cli::CLIDashboardRenderer>::hf0a3b1156513db1d
  24:     0x55b9030ca509 - core::ptr::drop_in_place<burn_train::metric::dashboard::base::Dashboard<burn_train::learner::classification::ClassificationOutput<burn_autodiff::backend::ADBackendDecorator<burn_ndarray::backend::NdArrayBackend<f32>>>,burn_train::learner::classification::ClassificationOutput<burn_ndarray::backend::NdArrayBackend<f32>>>>::h263d2960503a8317
  25:     0x55b903087fa1 - core::ptr::drop_in_place<burn_train::callback::async_callback::CallbackThread<burn_train::learner::classification::ClassificationOutput<burn_autodiff::backend::ADBackendDecorator<burn_ndarray::backend::NdArrayBackend<f32>>>,burn_train::learner::classification::ClassificationOutput<burn_ndarray::backend::NdArrayBackend<f32>>>>::hf9ef061b063ed5cb
  26:     0x55b90307e96b - burn_train::callback::async_callback::CallbackThread<T,V>::run::hdc6cda587267de17
  27:     0x55b903128b15 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2979fba468267f1f
  28:     0x55b9030868e2 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h4dc312110bc02dfd
  29:     0x55b9035316a5 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::ha1f2224656a778fb
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/alloc/src/boxed.rs:1973:9
  30:     0x55b9035316a5 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::haa29ed9703f354b7
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/alloc/src/boxed.rs:1973:9
  31:     0x55b9035316a5 - std::sys::unix::thread::Thread::new::thread_start::h33b6dae3e3692197
                               at /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/std/src/sys/unix/thread.rs:108:17
  32:     0x7fc034494b43 - start_thread
                               at ./nptl/./nptl/pthread_create.c:442:8
  33:     0x7fc034526a00 - clone3
                               at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  34:                0x0 - <unknown>
thread panicked while panicking. aborting.
Aborted (core dumped)

I, unfortunately, don't have time to look deeper right now to see if this is a specific issue with my setup. So if you are unable to reproduce, it may be a issue with my local setup.

@nathanielsimard
Copy link
Member

I think I can reproduce by waiting for the epoch to complete. The problem comes from the ETA calculated from the indicatif lib, the easy solution is just to remove the ETA or to calculate it ourself.

@cswaney
Copy link

cswaney commented Jul 1, 2023

Yep. See console-rs/indicatif#556. Verifying that modifying the progress bar template solves:

// let bar = "[{wide_bar:.cyan/blue}] ({eta})";
let bar = "[{wide_bar:.cyan/blue}]";

@nathanielsimard
Copy link
Member

I'm going to remove the (eta) from the progress bar and create an issue to add it back when the indicatif issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants