Estimates become extremely large if progress updates are infrequent #556

teor2345 · 2023-06-29T02:29:22Z

We're using indicatif via howudoin, to display events that update every few minutes. Sometimes there can be delays of up to 10 minutes.

We're seeing extremely large estimates when there aren't any events for a few minutes.

This is the underlying cause of the panics in #554 in our application. There aren't any updates for a few minutes, so the estimate becomes billions of years. Eventually, it is outside the range of Duration, which panics.

Is it possible to make EXPONENTIAL_WEIGHTING_SECONDS configurable, or use an algorithm that doesn't have this exponentially increasing behaviour when there aren't any updates?
(I have read the discussion in #394 and related tickets.)

Here's an example of the beginning of an exponential increase:

The text was updated successfully, but these errors were encountered:

djc · 2023-06-29T04:55:23Z

@afontenot would be great if you have any ideas how to avoid this.

afontenot · 2023-07-07T01:27:06Z

Sure, this was something that came up in the development of the new algorithm. I had initially planned to make the behavior around this configurable in two ways (which I'll describe below), but we ended up deciding to leave it out in favor of having good defaults.

The issue here is that given the assumptions made by the algorithm, a very large ETA is entirely reasonable if no progress has occurred in e.g. 2 minutes. The weighting of the exponential function is such that the most recent 15 seconds provide most (but not all) the data in the average. The reason for this is that it's designed to be reactive on time scales that matter to a person continually watching progress - for example, on a file transfer. It's not tuned for generating good estimates for long, intermittent activities.

On a technical level, this is the result of two decisions:

The specific weighting of the algorithm (15 seconds provides 90% of the weight). This was originally going to be configurable.
The "live update" behavior, meaning that the estimate updates whenever a tick occurs regardless of whether any new progress occurred during the tick. This works out okay for progress bar consumers who want a manual tick, and works out great for the "file transfer" type use cases I mentioned, because in those cases you want a revised estimate in the event of a stall. (If the network cable got unplugged, the transfer is never going to complete.) Unfortunately, it's much less helpful in the case of a steady tick combined with intermittent progress. I believe I mentioned during development that some would find this behavior annoying and that there should probably be a setting to disable it. When you have predictable intermittent stalls, it's less annoying to just wait for progress to continue rather than having the progress rate estimate exponentially approach zero.

Of these two, I'd say the first is most directly implicated here. Even if you implemented the second feature, you'd see annoying jumps in the estimate with a progress stall of 10 minutes. The exponential smoothing that the algorithm is designed to provide would have basically no effect because the time scale is much too small.

I think it would not be unreasonable to try to make this configurable. Everything should just work if you set the value to 20 minutes or even higher. (With very high settings, there's not much down-weighting of older data, so you get behavior approximating a linear average since the beginning of progress, which is often appropriate for these "predictable intermittent stall" cases.)

teor2345 · 2023-07-09T19:34:04Z

I think it would not be unreasonable to try to make this configurable. Everything should just work if you set the value to 20 minutes or even higher.

Thanks, that would be helpful for us.

We expect progress every 75 seconds for one of our progress bars, and every 10 seconds - 3 minutes for the other.

djc · 2023-07-09T20:50:51Z

Requiring configuration for this kind of thing seems like an anti-pattern to me: requiring users to give us information that they then have to benchmark and keep up to date, when it feels like there is some algorithm we could use to avoid the current edge case behavior.

Can we, for example, define some boundary where we switch to different tuning parameters?

teor2345 · 2023-07-11T00:41:16Z

Requiring configuration for this kind of thing seems like an anti-pattern to me: requiring users to give us information that they then have to benchmark and keep up to date, when it feels like there is some algorithm we could use to avoid the current edge case behavior.

I agree.

Can we, for example, define some boundary where we switch to different tuning parameters?

Can we dynamically change the weighting based on the average/median time between the most recent N progress updates? If needed, we could exclude the last 1-2 updates, because they might represent a disconnection or other instability. (A median would do this automatically.)

This would work for us, because each of our progress bars has two different modes:

Blocks: initially multiple times per second, then every 75 seconds
Checkpoints: initially every 5-30 seconds, then it is finished
Chain Forks: initially every 75 seconds, then no updates for 7500 seconds (we disabled the ETA because it was meaningless, and we're unlikely to restore it even if the estimate is fixed)

djc · 2023-07-27T10:02:07Z

@afontenot would you be able to spend more time on this? If not, that's fine too, I can dig into it more.

SolidTux · 2023-10-26T14:04:51Z

Requiring configuration for this kind of thing seems like an anti-pattern to me: ...

I would highly appreciate it if there would be an option to turn off the exponential weighting at all. I guess if the decay rate is configurable, one could set it to a very high value as you have mentioned, but I fear that I would have to set them high enough that putting the number of seconds into an exponential could cause problems.

I have programs that run up to a few days, with steps sometimes taking hours. The steps are very consistent in length, so the exponential weighting provides no benefit at all. Also, there is no way to further subdivide the steps as the most time is spent in one call to lapack.

Without the steady tick, the elapsed time does not get updated often enough, e.g. there is no way to see, how long the program is running for before the first step is completed.

cswaney mentioned this issue Jul 1, 2023

Slow initial epoch progress causes time to overflow and program panics tracel-ai/burn#447

Closed

nathanielsimard mentioned this issue Jul 1, 2023

Add time estimate to progress bar when training a model tracel-ai/burn#454

Closed

sharkdp mentioned this issue Aug 21, 2023

ETA times go up instead of down #580

Open

alexheretic mentioned this issue Aug 23, 2023

ETA can become excessively high and cause an overflow alexheretic/ab-av1#146

Open

teor2345 mentioned this issue Sep 25, 2023

change(ui): Enable the progress bar feature by default, but only show progress bars when the config is enabled ZcashFoundation/zebra#7615

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimates become extremely large if progress updates are infrequent #556

Estimates become extremely large if progress updates are infrequent #556

teor2345 commented Jun 29, 2023

djc commented Jun 29, 2023

afontenot commented Jul 7, 2023

teor2345 commented Jul 9, 2023

djc commented Jul 9, 2023 •

edited

teor2345 commented Jul 11, 2023

djc commented Jul 27, 2023

SolidTux commented Oct 26, 2023

Estimates become extremely large if progress updates are infrequent #556

Estimates become extremely large if progress updates are infrequent #556

Comments

teor2345 commented Jun 29, 2023

djc commented Jun 29, 2023

afontenot commented Jul 7, 2023

teor2345 commented Jul 9, 2023

djc commented Jul 9, 2023 • edited

teor2345 commented Jul 11, 2023

djc commented Jul 27, 2023

SolidTux commented Oct 26, 2023

djc commented Jul 9, 2023 •

edited