Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect seconds_per_step calculation #535

Closed
wants to merge 1 commit into from

Conversation

afontenot
Copy link
Contributor

@afontenot afontenot commented Apr 24, 2023

seconds_per_step looks at a ring buffer with samples taken each tick. Each sample is

<duration of the tick> / <number of progress steps>

This results in an incorrect seconds_per_step, when this is calculated as a raw average of the buffer values. This is because we have no reason to assume any tick is a more representative sample than any other tick, but this will treat ticks with very few steps as more representative than steps with many ticks, because the value of the denominator is lost in the sample.

The result is a very poor ETA / bitrate estimation when step production follows a burst-wait pattern, since ticks with few or no steps are to be expected.

A better estimate could be achieved by averaging over the step-rate of each tick, but this would still not be ideal since it would treat short ticks as equally representative as long ticks in the average.

Instead, we make two changes:

First, we change the buffer to store the total number of steps rather than the difference between steps. This necessitates changing the new / reset code to set the first buffer item to zero.

Second, we create a second ring buffer to store the timestamp of each tick. As a result, we can calculate the seconds_per_step rate by simply looking at the earliest values in the buffer, and returning (now() - first_time) / (last_steps - first_steps).

By using now() instead of last_time in the above equation, we also fix an additional issue. Since we only add items to the buffer when new steps occur, the progress bar freezes and the ETA does not update when progress stalls. By always using the current time, we get a more accurate seconds_per_step estimate with no additional overhead.

Fixes #534.

seconds_per_step looks at a ring buffer with samples taken each
tick. Each sample is

    <duration of the tick> / <number of progress steps>

This results in an incorrect seconds_per_step, when this is
calculated as a raw average of the buffer values. This is because
we have no reason to assume any tick is a more representative sample
than any other tick, but this will treat ticks with very few steps
as *more* representative than steps with many ticks, because the
value of the denominator is lost.

The result is a very poor ETA / bitrate estimation when step
production follows a burst-wait pattern, since ticks with few or
no steps are to be expected.

A better estimate could be achieved by averaging over the step-rate
of each tick, but this would still not be ideal since it would
treat short ticks as equally representative as long ticks in the
average.

Instead, we make two changes:

First, we change the buffer to store the total number of steps
rather than the difference between steps. This necessitates changing
the new / reset code to set the first buffer item to zero.

Second, we create a second ring buffer to store the timestamp of
each tick. As a result, we can calculate the `seconds_per_step`
rate by simply looking at the earliest values in the buffer, and
returning `(now() - first_time) / (last_steps - first_steps)`.

By using `now()` instead of `last_time` in the above equation, we
also fix an additional issue. Since we only add items to the buffer
when new steps occur, the progress bar freezes and the ETA does not
update when progress stalls. By always using the latest time, we get
a more accurate seconds_per_step estimate with no additional
overhead.
@afontenot
Copy link
Contributor Author

The following two changes required modifying the tests:

changing the new / reset code to set the first buffer item to zero.

Means that a newly initialized ring buffer contains 1 element, not 0.

using now() instead of last_time in the above equation

Depending on the latest time when calculating seconds_per_step mean that the test which checks that the rate calculation is correct can't be exact any more. Instead, I check that it is between the rates expected for two very close Instant values - immediately before and after the call to seconds_per_step.

@djc
Copy link
Collaborator

djc commented Apr 26, 2023

I'm not able to review this in detail for the next week or two, but here is some early feedback:

  • Given the sensitivity of this code, it would be great if you could add a whole bunch more comments to the parts of the Estimator that you changed to elaborate what is going on and why.
  • It looks like you modified existing tests to work but didn't add much test coverage? In particular, it would be nice to explicitly test behavior for the motivating use case of ticks with more steps weighing more than ticks with fewer steps as well as the additional issue you noted (potential freezes).
  • Tests for any overflow/underflow/divide-by-zero scenarios would also be great to have.

Thanks again for working on this, your contribution is much appreciated!

@afontenot
Copy link
Contributor Author

@djc thanks, I'll add some comments and improve the tests

Quick note on this:

Tests for any overflow/underflow/divide-by-zero scenarios would also be great to have.

Yep, the status quo here is a bit of a mess actually. You can easily end up dividing by zero if e.g. time passes but no steps occur. It actually relies ensuring that the numerator in the rate calculation is always zero if the denominator is, so that you get a NaN float, and then explicitly tests for this elsewhere in the code.

I believe I imitated the existing behavior in this area exactly, which is the reasoning behind the strange code here:

    let mut duration: f64 = 0.0;
    if steps != 0 {
        duration = duration_to_secs(now - self.step_instants[first_pos]);
    }

But it would be good to test for this, or maybe change the behavior to not rely on NaN floats. Of course the latter would require having an opinion on the correct value to show for the ETA in the case that time has passed but no progress has been made. As I didn't have an opinion, I fell back on maintaining the status quo.

@afontenot
Copy link
Contributor Author

Just a heads up that I have put improvements for this PR on hold until my work on an exponentially weighted average can see some review. If we end up switching the estimator to use that approach, this PR will become irrelevant and need to be closed.

@afontenot
Copy link
Contributor Author

Going to close this for now as it looks like we will be able to merge some form of an exponentially weighted average, which solves this problem in a different way.

@afontenot afontenot closed this May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

With burst-heavy writes, eta and bitrate show incorrect values when using enable_steady_tick
2 participants