Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With burst-heavy writes, eta and bitrate show incorrect values when using enable_steady_tick #534

Closed
afontenot opened this issue Apr 23, 2023 · 7 comments

Comments

@afontenot
Copy link
Contributor

TL;DR: if you use enable_steady_tick, and set_position on every write, a burst-oriented workload will show incorrect values for ETA and bitrate - not just heavily fluctuating, but impossibly low values given the true speed of the connection.


I have the following two assumptions about how the progress bar behaves. These might need to be corrected.

  1. When enable_steady_tick is enabled, indicatif only ticks on the clock you set. Based on the docs.
  2. Rate-dependent metrics like ETA and bitrate are determined by looking at a sliding window of the last 15 ticks. That is to say, every time there's a tick, the progress is stored. The rate is just the amount of progress made made in the last 15 ticks, divided by the amount of time it took for 15 ticks to elapse. Based on this contributor comment.

I was debugging a progress bar issue with magic-wormhole.rs that was hard to reproduce. The issue seems to stem from how burst-heavy my system was. I regularly see the program fill the outgoing TCP buffer, and then wait for half a second or more before continuing. The way the program works is that it calls set_position every time it writes into the buffer. It sets the steady tick in indicatif to 250 ms. Based on this, I expect the progress bar to fluctuate a decent amount, but on average it should be close to the true bitrate of the file transfer.

In fact, I see extremely strong fluctuations, and the indicatif progress bar is almost always significantly below the real speed of the transfer (based on the file size and transfer time).

Based on the assumptions I gave above, I recreated a moving average (using matplotlib in Python) across writes from the last 15 * 250 ms.

Figure_1

As you can see, there's some undesirable variability here (it would be nice if we could set indicatif to average over 60 samples or something like that), but there are no extreme dips in the moving average. On the contrary, indicatif frequently shows me bitrates under 1 Mbps for this connection, and appears to tick irregularly as well! Something seems to have gone badly wrong with the estimation.

Reproducing issues like this in a complex program is difficult, because it relies on buffer size, network conditions, etc. For this reason, I've written a simple replay tool for indicatif. This is a single-file script (using rust-script) that allows you to replay logs from indicatif using a trivial log format. This should enable anyone debugging this issue to recreate exactly what I am seeing.

#!/usr/bin/env rust-script
//! ```cargo
//! [dependencies]
//! indicatif="0.17.3"
//! ```

use std::env;
use std::fs::File;
use std::io::{self, BufRead}; // BufRead provides .lines() trait for BufReader
use std::path::Path;
use std::thread;
use std::time::{Duration, Instant};

use indicatif::{ProgressBar, ProgressStyle};

fn read_file<P>(filepath: P, pb: Option<ProgressBar>) -> u64
where
    P: AsRef<Path>,
{
    let mut file_size: u64 = 0;
    let mut last_read_size: u64 = 0;
    let mut time = Instant::now();
    let file = File::open(filepath).unwrap();
    for line in io::BufReader::new(file).lines() {
        let line = line.unwrap();
        let mut line_iter = line.split(" ");
        let delay: u64 = line_iter.next().unwrap().parse().unwrap();
        time += Duration::from_micros(delay);
        if let Some(read_size_str) = line_iter.next() {
            last_read_size = read_size_str.parse().unwrap();
        }
        file_size += last_read_size;
        if let Some(ref progress_bar) = pb {
            thread::sleep(time - Instant::now());
            progress_bar.set_position(file_size);
        }
    }
    file_size
}

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() < 2 || args.len() > 3 {
        println!("Usage: indicatif-replay <logfile> [file size]");
        return;
    }

    let file_size: u64;

    if args.len() == 3 {
        file_size = args[2].parse().unwrap();
    } else {
        file_size = read_file(&args[1], None);
        println!("Automatically detected file size: {}", file_size);
    }

    let pb = ProgressBar::new(file_size);
    pb.set_style(
        ProgressStyle::default_bar()
            .template("{percent}% [{wide_bar}] {bytes} ({bytes_per_sec}) eta {eta}")
            .unwrap()
            .progress_chars("#>-"),
    );
    pb.enable_steady_tick(Duration::from_millis(250));

    read_file(&args[1], Some(pb));
}

Here is my log file:

indicatif.log

@DonSheddow
Copy link

Nice work with the detailed repro instructions! Using your log file, I see the same behaviour as you, where it dips below 1000KB/s after about 15-20 seconds. For comparison, I modified your script to use my own moving average calculation: https://gist.github.com/DonSheddow/2cbcd2f334ee1244ae974cff5348a068. This doesn't display the same behaviour - the bitrate stays between 2 and 3 MB/s.

What I don't understand is that when the tick interval is set to 250ms, a rolling buffer of the last 16 ticks should be the same as a rolling buffer of the last 4 seconds. So in principle my MovingAvgRate should be the same as the builtin bitrate calculation... The only difference I can see is that the builtin estimator calculates the slope of each step, then averages them to find the number of seconds per step, then takes the reciprocal of that to get the bitrate; maybe all of the floating point operations introduces a non-negligible precision loss?

@afontenot
Copy link
Contributor Author

The only difference I can see is that the builtin estimator calculates the slope of each step, then averages them to find the number of seconds per step, then takes the reciprocal of that to get the bitrate

Haha, thanks for pointing me right at the error. This is a basic math mistake.

$$\sum \limits_{i=1}^{n} \dfrac{1}{a_i} \neq \frac{1}{\sum \limits_{i=1}^{n} a_i}$$

In other words, what we have in the buffer is the amount of time taken per step over each tick. To take some arbitrary sample data, this looks something like:

0.25 secs / 1000 steps, 0.25 seconds / 750 steps, 0.25 seconds / 800 steps, etc

When you average these values, you treat the denominators as equivalent. This weights samples with fewer steps heavier than samples with more steps. So for example, the data above becomes

0.00025 secs / step, 0.000333 secs / step, 0.0003125 seconds / step

Averaging these ticks gives you 0.3 secs / thousand steps, which seems reasonable enough. E.g. if there are 10000 steps to go through, that's 3 seconds.

Now suppose you have a fourth tick, where you pass essentially no data, let's say 0.25 seconds / 1 step. Averaging these ticks gives you 63 seconds per thousand steps, or 630 seconds for the 10000 step job. Obviously this makes no sense.

You are I are doing it correctly (assuming steady ticks), by starting with the reciprocal. This treats "short" ticks as equivalent to any other tick, so it will be somewhat inaccurate if the ticks aren't steady. So with this approach we have:

4000 steps / sec, 3000 steps / sec, 3200 steps / sec, 4 steps / sec

Averaging these, we get 2551 steps / sec, or 3.9 seconds for 10000 steps. This is obviously the better estimate.

That said, you can do it properly by also tracking the time consumed for each tick in a separate buffer. I'll submit a pull request that does this.

@DonSheddow
Copy link

Haha, thanks for pointing me right at the error. This is a basic math mistake.

$$\sum \limits_{i=1}^{n} \dfrac{1}{a_i} \neq \frac{1}{\sum \limits_{i=1}^{n} a_i}$$

Aha, well spotted! So even though the rate is the reciprocal of seconds per step, the average rate is not the reciprocal of the average of seconds per step. Looking further into it, what the builtin estimator is actually calculating is the harmonic mean of the rate. This works correctly only when each sample counts the same amount of steps (whereas the arithmetic mean is correct when each sample spans the same amount of time). So if you only did bar.inc(1), the builtin estimator would work correctly.

I think the best way to calculate the slope is just storing the position in the buffer. Just subtract the end position from the start position and divide by the elapsed time, then you get the correct rate, without having to worry about uneven ticks. Plus you save quite a few arithmetic operations. Example:
Figure_1
If you have a long tick, followed by a short tick (in orange), the actual slope (in blue) can be calculated just from the start and end position.

@afontenot
Copy link
Contributor Author

@DonSheddow good idea. I initially discounted that approach, because it looked hard to reconcile with the existing state code. But upon a second look, it's really not that bad, mostly just requires changing a test. I've updated the pull request to do this instead. Thanks!

@djc
Copy link
Collaborator

djc commented Apr 26, 2023

@afontenot @DonSheddow thank you for working on this issue! Would be happy if you were able to look at #394 which collects a bunch of other feedback from folks on issues/limitations in the current estimator. I have been thinking that it would be good to adopt some kind of exponential moving average but haven't been able to spend enough time to work out the details. Would love a well-reasoned proposal to improve on the current state!

@afontenot
Copy link
Contributor Author

afontenot commented Apr 26, 2023

@djc I'll have a look at that. I think handling this properly would rely on having an opinion on e.g. the correct number of seconds, not ticks, to average over - regardless of whether you used a flat or exponential average. Some sort of default is needed here.

Just for clarity's sake: my PR is not intended to change the current behavior away from "the average rate over the last 15 ticks". Instead it just corrects a math error in how that was done (as discussed in the comments above). So you could (when you have the time to review it) merge the PR fixing the issue, and revisit the approach at any point in the future. Maybe we can improve things right now, but I just wanted to clarify the difference between what I reported here and the issues discussed in #394.

@afontenot
Copy link
Contributor Author

This issue is fixed indirectly, because the approach to calculating ETA was changed in #539.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants