Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent overflow on counter adding #1474

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

xieyuschen
Copy link

@xieyuschen xieyuschen commented Mar 18, 2024

Possible Overflow

Currently, the Add API of Counter and the similar implementation has the overflow problem. I write a test case to demonstrate it by the test case TestCounterAddExcess in the commit d222f97. And the output verifies the overflow does happen as expected.

--- FAIL: TestCounterAddExcess (0.00s)
    /home/xieyuschen/client_golang/prometheus/counter_test.go:44: oops! overflow
    /home/xieyuschen/client_golang/prometheus/counter_test.go:62: failed because the internal overflow
    /home/xieyuschen/client_golang/prometheus/counter_test.go:68: verify the overflow case
FAIL
FAIL	github.com/prometheus/client_golang/prometheus	0.007s

Proposal

I think it's nice to fix it as it's a possible case. I haven't checked much about the history discussions about this topic. As a result, I would like to discuss with the maintainers to see whether you would like to accept such changes or not.

We need to check one more additional cases for the possible overflows, as the uint() cast already prevents the precise lost during conversion(for example, 2<<64-1).

  • the addition overflows

One step further, we can add the valInt value into the float set valBits and then reset the valInt to zero.

Regards.


// if the v is an unsigned integer
// and the precise doesn't lose during uint cast and cast it back to float
for v == float64(ival) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing to note here is that the casting here varies in different platform. for example, in linux, the v won't be equal with float64(uint64(v)). but in macos m1, it equals. It means the checking here is needed and cannot be skipped

The verification code is:

func TestVerifyRounding(t *testing.T) {
	var f float64 = 1<<64 - 1
	if f == float64(uint64(f)) {
		t.Fatal("values are the same after casting")
	} else {
		t.Fatal("values are different")
	}
}

LInux output could be found in the pipeline here.
image

and the macos outputs:
image

@xieyuschen
Copy link
Author

Hi @ArthurSens @beorn7 could I know your opinions for this fix?

@beorn7
Copy link
Member

beorn7 commented Mar 25, 2024

I don't maintain this repository anymore, and I currently lack capacity to go into detail here. This is for the current maintainers @ArthurSens @kakkoyun @bwplotka .

@bwplotka
Copy link
Member

Thanks! This is definitely useful to handle, just I don't have time this week to take a detailed look. Maybe @ArthurSens ?

@xieyuschen
Copy link
Author

Thanks! This is definitely useful to handle, just I don't have time this week to take a detailed look. Maybe @ArthurSens ?

If it's not intended to work as the current behavior, I think I could start the code change as well so it reduces the communication efforts. Thanks:)

@bwplotka
Copy link
Member

Yup 💪🏽

@ArthurSens
Copy link
Member

Sorry, I would need some time to understand better the overflow problem. (times like this that made me wish I had a computer science degree 😕)

Do I understand correctly that this happens only with counters (i.e. not with gauges, summaries, and histograms)? I also wonder if this is something that possible scrapers can handle correctly... Do we know if Prometheus can parse correctly without losing precision as well?

@xieyuschen
Copy link
Author

xieyuschen commented Mar 27, 2024

Sorry, I would need some time to understand better the overflow problem. (times like this that made me wish I had a computer science degree 😕)

Do I understand correctly that this happens only with counters (i.e. not with gauges, summaries, and histograms)? I also wonder if this is something that possible scrapers can handle correctly... Do we know if Prometheus can parse correctly without losing precision as well?

  1. for unsigned integer overflow, could refer to the integer overflow topic in Go Spec. In short, adding 1 to maxuint(1<,64-1) will overflow and the result is 0, for example, could refer to the output here

  2. Counter has this issue as it uses integer to store the value if we pass integer by Add or call Inc. The gauge doesn't have problem I think as it uses float64 to store the value already in valBits and correctly Add. The histogram doesn't have this problem as the atomicAddFloat considers such cases. Summary doesn't have this issue as well.

  3. Regarding the scraper, it cannot help for this case as the server doesn't distinguish the different fundamental metric types, hence, if it's overflow the server simply treat it as the value changes.
    For example, if i use the following code, the initial metric value will be 8000 and after ten seconds its value becomes 7999 due to overflow.

func main() {
	c := prometheus.NewCounter(prometheus.CounterOpts{
		Name: "verify_scraper_overflow",
	})
	_ = prometheus.DefaultRegisterer.Register(c)
	c.Add(8000)
	go func() {
		time.Sleep(10 * time.Second)
		c.Add(math.MaxUint64)
	}()
	l, err := net.Listen("tcp", "localhost:8081")
	if err != nil {
		panic(err)
	}

	http.Serve(l, promhttp.InstrumentMetricHandler(
		prometheus.DefaultRegisterer,
		promhttp.HandlerFor(
			prometheus.Gatherers(append(
				[]prometheus.Gatherer{prometheus.DefaultGatherer},
			)),
			promhttp.HandlerOpts{},
		),
	))
	return
}

The result is so the scraper doesn't handle the overflow at all. By the way, because the prometheus client reports the metrics defined in dto.MetricFamil, the scraper will respect the value inside it.

$ date +"%T" && curl localhost:8081/metrics
15:03:55
# HELP verify_scraper_overflow 
# TYPE verify_scraper_overflow counter
verify_scraper_overflow 8000
$ date +"%T" && curl localhost:8081/metrics
15:04:21
# HELP verify_scraper_overflow 
# TYPE verify_scraper_overflow counter
verify_scraper_overflow 7999
  1. Do we know if Prometheus can parse correctly without losing precision as well?
    For float numbers, losing precision is un-avoidable due to the IEEE754. However, as the dto proto api is float, the precise losing is not a big concern for client side to concern during addition as long as client side provides the correct number(without overflow).
    For the losing precision, I have met it several months ago and i wrote a blog about it, could read it if you want to learn more.

@xieyuschen
Copy link
Author

Oh, there is another additional case regarding the float operation which covers all fundamental metric types(counter, gauge, histogram and summary).
As we know the IEEE754 uses a different way to represent the float, it might be a roudning error once the float is larger than 2<<53. For example, the following code print 18014398509481984 twice due to the rounding error.

func main() {
	f := float64(2 << 53)
	fmt.Println(uint64(f))
	fmt.Println(uint64(f + 1))
}

This means that once the metrics' value exceed the value, if we keep calling Inc, the value won't increase anymore due to the rounding error. Additional check for this is needed such as check the result and original number

@ArthurSens
Copy link
Member

Thanks a lot for such a detailed explanation ❤️, it really helped me understand the problem!

In the meantime, I was also thinking about how common this problem is. The project has a few years already and we haven't seen many reports suggesting that overflow caused bugs in production. Is this a problem at your current work or just solving a theoretical problem?

I'm afraid we'll make our codebase harder to read/understand, while we don't solve real problems (or a very niched problem)

@xieyuschen
Copy link
Author

Thanks a lot for such a detailed explanation ❤️, it really helped me understand the problem!

In the meantime, I was also thinking about how common this problem is. The project has a few years already and we haven't seen many reports suggesting that overflow caused bugs in production. Is this a problem at your current work or just solving a theoretical problem?

I'm afraid we'll make our codebase harder to read/understand, while we don't solve real problems (or a very niched problem)

  1. Regarding "how common the question is", i would rather say it's rare and it's a mild issue. For monitoring, the minor precision losing and overflow problems are already easily overwhelm by the numerous metrics in a long time series, let alone the metric vectors. It's hard to be aware by users themselves due to their usage.

  2. "Is this a problem at your current work or just solving a theoretical problem?", How I found this issue is because i want to know the impl of prometheus go client, and found some similar code which causes a bug in my previous programming. So I think it's a theoretical problem and it's not a trouble in the practice.

  3. "I'm afraid we'll make our codebase harder to read/understand, while we don't solve real problems (or a very niched problem)": I agree with you as this is not really helpful(fix for rare cases) but makes the code harder to maintain. For me, it's also fine to let the PR be rejected because I agree this is a tricky problem.

@bwplotka
Copy link
Member

Nice work! ...and yes @ArthurSens point is solid, do we have an important use case (e.g. somebody hitting this issue and was impacted) that justifies complexity. Here the complexity is not too bad (second pr #1478 is much bigger though), but there is also an aspect of efficiency, this is ultimately a hot path, and we (mostly @beorn7) did massive work to ensure this is as fast as possible. This PR is changing hot path of simple atomic add to have bit more instructions. Perhaps running micro benchmark would help decide?

But before we move forward - the consequence of overflow is that counter will rotate and start from zero. Given zero is literally the safe mechanisms of Prometheus counters that means a counter reset, isn't this totally safe behaviour?

There might be nuances with created timestamp here, but otherwise this will result in total valid and precise observation on e.g Prometheus rates, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants