fix: prevent overflow on counter adding #1474

xieyuschen · 2024-03-18T14:38:43Z

Possible Overflow

Currently, the Add API of Counter and the similar implementation has the overflow problem. I write a test case to demonstrate it by the test case TestCounterAddExcess in the commit d222f97. And the output verifies the overflow does happen as expected.

--- FAIL: TestCounterAddExcess (0.00s)
    /home/xieyuschen/client_golang/prometheus/counter_test.go:44: oops! overflow
    /home/xieyuschen/client_golang/prometheus/counter_test.go:62: failed because the internal overflow
    /home/xieyuschen/client_golang/prometheus/counter_test.go:68: verify the overflow case
FAIL
FAIL	github.com/prometheus/client_golang/prometheus	0.007s

Proposal

I think it's nice to fix it as it's a possible case. I haven't checked much about the history discussions about this topic. As a result, I would like to discuss with the maintainers to see whether you would like to accept such changes or not.

We need to check one more additional cases for the possible overflows, as the uint() cast already prevents the precise lost during conversion(for example, 2<<64-1).

the addition overflows

One step further, we can add the valInt value into the float set valBits and then reset the valInt to zero.

Regards.

xieyuschen · 2024-03-19T08:29:08Z

prometheus/counter.go


+	// if the v is an unsigned integer
+	// and the precise doesn't lose during uint cast and cast it back to float
+	for v == float64(ival) {


one thing to note here is that the casting here varies in different platform. for example, in linux, the v won't be equal with float64(uint64(v)). but in macos m1, it equals. It means the checking here is needed and cannot be skipped

The verification code is:

func TestVerifyRounding(t *testing.T) { var f float64 = 1<<64 - 1 if f == float64(uint64(f)) { t.Fatal("values are the same after casting") } else { t.Fatal("values are different") } }

LInux output could be found in the pipeline here.

and the macos outputs:

xieyuschen · 2024-03-25T06:13:29Z

Hi @ArthurSens @beorn7 could I know your opinions for this fix?

beorn7 · 2024-03-25T13:09:00Z

I don't maintain this repository anymore, and I currently lack capacity to go into detail here. This is for the current maintainers @ArthurSens @kakkoyun @bwplotka .

bwplotka · 2024-03-26T12:41:26Z

Thanks! This is definitely useful to handle, just I don't have time this week to take a detailed look. Maybe @ArthurSens ?

xieyuschen · 2024-03-26T12:43:29Z

Thanks! This is definitely useful to handle, just I don't have time this week to take a detailed look. Maybe @ArthurSens ?

If it's not intended to work as the current behavior, I think I could start the code change as well so it reduces the communication efforts. Thanks:)

bwplotka · 2024-03-26T13:00:37Z

Yup 💪🏽

ArthurSens · 2024-03-26T18:40:56Z

Sorry, I would need some time to understand better the overflow problem. (times like this that made me wish I had a computer science degree 😕)

Do I understand correctly that this happens only with counters (i.e. not with gauges, summaries, and histograms)? I also wonder if this is something that possible scrapers can handle correctly... Do we know if Prometheus can parse correctly without losing precision as well?

xieyuschen · 2024-03-27T07:13:06Z

Sorry, I would need some time to understand better the overflow problem. (times like this that made me wish I had a computer science degree 😕)

Do I understand correctly that this happens only with counters (i.e. not with gauges, summaries, and histograms)? I also wonder if this is something that possible scrapers can handle correctly... Do we know if Prometheus can parse correctly without losing precision as well?

for unsigned integer overflow, could refer to the integer overflow topic in Go Spec. In short, adding 1 to maxuint(1<,64-1) will overflow and the result is 0, for example, could refer to the output here
Counter has this issue as it uses integer to store the value if we pass integer by Add or call Inc. The gauge doesn't have problem I think as it uses float64 to store the value already in valBits and correctly Add. The histogram doesn't have this problem as the atomicAddFloat considers such cases. Summary doesn't have this issue as well.
Regarding the scraper, it cannot help for this case as the server doesn't distinguish the different fundamental metric types, hence, if it's overflow the server simply treat it as the value changes.
For example, if i use the following code, the initial metric value will be 8000 and after ten seconds its value becomes 7999 due to overflow.

func main() {
	c := prometheus.NewCounter(prometheus.CounterOpts{
		Name: "verify_scraper_overflow",
	})
	_ = prometheus.DefaultRegisterer.Register(c)
	c.Add(8000)
	go func() {
		time.Sleep(10 * time.Second)
		c.Add(math.MaxUint64)
	}()
	l, err := net.Listen("tcp", "localhost:8081")
	if err != nil {
		panic(err)
	}

	http.Serve(l, promhttp.InstrumentMetricHandler(
		prometheus.DefaultRegisterer,
		promhttp.HandlerFor(
			prometheus.Gatherers(append(
				[]prometheus.Gatherer{prometheus.DefaultGatherer},
			)),
			promhttp.HandlerOpts{},
		),
	))
	return
}

The result is so the scraper doesn't handle the overflow at all. By the way, because the prometheus client reports the metrics defined in dto.MetricFamil, the scraper will respect the value inside it.

$ date +"%T" && curl localhost:8081/metrics
15:03:55
# HELP verify_scraper_overflow 
# TYPE verify_scraper_overflow counter
verify_scraper_overflow 8000
$ date +"%T" && curl localhost:8081/metrics
15:04:21
# HELP verify_scraper_overflow 
# TYPE verify_scraper_overflow counter
verify_scraper_overflow 7999

Do we know if Prometheus can parse correctly without losing precision as well?
For float numbers, losing precision is un-avoidable due to the IEEE754. However, as the dto proto api is float, the precise losing is not a big concern for client side to concern during addition as long as client side provides the correct number(without overflow).
For the losing precision, I have met it several months ago and i wrote a blog about it, could read it if you want to learn more.

xieyuschen · 2024-03-27T08:05:43Z

Oh, there is another additional case regarding the float operation which covers all fundamental metric types(counter, gauge, histogram and summary).
As we know the IEEE754 uses a different way to represent the float, it might be a roudning error once the float is larger than 2<<53. For example, the following code print 18014398509481984 twice due to the rounding error.

func main() {
	f := float64(2 << 53)
	fmt.Println(uint64(f))
	fmt.Println(uint64(f + 1))
}

This means that once the metrics' value exceed the value, if we keep calling Inc, the value won't increase anymore due to the rounding error. Additional check for this is needed such as check the result and original number

ArthurSens · 2024-03-27T18:58:42Z

Thanks a lot for such a detailed explanation ❤️, it really helped me understand the problem!

In the meantime, I was also thinking about how common this problem is. The project has a few years already and we haven't seen many reports suggesting that overflow caused bugs in production. Is this a problem at your current work or just solving a theoretical problem?

I'm afraid we'll make our codebase harder to read/understand, while we don't solve real problems (or a very niched problem)

xieyuschen · 2024-03-28T04:34:00Z

Thanks a lot for such a detailed explanation ❤️, it really helped me understand the problem!

In the meantime, I was also thinking about how common this problem is. The project has a few years already and we haven't seen many reports suggesting that overflow caused bugs in production. Is this a problem at your current work or just solving a theoretical problem?

I'm afraid we'll make our codebase harder to read/understand, while we don't solve real problems (or a very niched problem)

Regarding "how common the question is", i would rather say it's rare and it's a mild issue. For monitoring, the minor precision losing and overflow problems are already easily overwhelm by the numerous metrics in a long time series, let alone the metric vectors. It's hard to be aware by users themselves due to their usage.
"Is this a problem at your current work or just solving a theoretical problem?", How I found this issue is because i want to know the impl of prometheus go client, and found some similar code which causes a bug in my previous programming. So I think it's a theoretical problem and it's not a trouble in the practice.
"I'm afraid we'll make our codebase harder to read/understand, while we don't solve real problems (or a very niched problem)": I agree with you as this is not really helpful(fix for rare cases) but makes the code harder to maintain. For me, it's also fine to let the PR be rejected because I agree this is a tricky problem.

bwplotka · 2024-03-28T09:34:24Z

Nice work! ...and yes @ArthurSens point is solid, do we have an important use case (e.g. somebody hitting this issue and was impacted) that justifies complexity. Here the complexity is not too bad (second pr #1478 is much bigger though), but there is also an aspect of efficiency, this is ultimately a hot path, and we (mostly @beorn7) did massive work to ensure this is as fast as possible. This PR is changing hot path of simple atomic add to have bit more instructions. Perhaps running micro benchmark would help decide?

But before we move forward - the consequence of overflow is that counter will rotate and start from zero. Given zero is literally the safe mechanisms of Prometheus counters that means a counter reset, isn't this totally safe behaviour?

There might be nuances with created timestamp here, but otherwise this will result in total valid and precise observation on e.g Prometheus rates, no?

xieyuschen added 2 commits March 18, 2024 22:32

demo: overflow case for counter

d222f97

fix: fix the possible overflow

3332118

xieyuschen force-pushed the main branch from 404b2a0 to 3332118 Compare March 18, 2024 14:40

xieyuschen commented Mar 19, 2024

View reviewed changes

xieyuschen mentioned this pull request Mar 27, 2024

Improve: counter overflow and precise lost #1478

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent overflow on counter adding #1474

fix: prevent overflow on counter adding #1474

xieyuschen commented Mar 18, 2024 •

edited

xieyuschen Mar 19, 2024 •

edited

xieyuschen commented Mar 25, 2024

beorn7 commented Mar 25, 2024

bwplotka commented Mar 26, 2024

xieyuschen commented Mar 26, 2024

bwplotka commented Mar 26, 2024

ArthurSens commented Mar 26, 2024

xieyuschen commented Mar 27, 2024 •

edited

xieyuschen commented Mar 27, 2024

ArthurSens commented Mar 27, 2024

xieyuschen commented Mar 28, 2024

bwplotka commented Mar 28, 2024

fix: prevent overflow on counter adding #1474

Are you sure you want to change the base?

fix: prevent overflow on counter adding #1474

Conversation

xieyuschen commented Mar 18, 2024 • edited

Possible Overflow

Proposal

xieyuschen Mar 19, 2024 • edited

Choose a reason for hiding this comment

xieyuschen commented Mar 25, 2024

beorn7 commented Mar 25, 2024

bwplotka commented Mar 26, 2024

xieyuschen commented Mar 26, 2024

bwplotka commented Mar 26, 2024

ArthurSens commented Mar 26, 2024

xieyuschen commented Mar 27, 2024 • edited

xieyuschen commented Mar 27, 2024

ArthurSens commented Mar 27, 2024

xieyuschen commented Mar 28, 2024

bwplotka commented Mar 28, 2024

xieyuschen commented Mar 18, 2024 •

edited

xieyuschen Mar 19, 2024 •

edited

xieyuschen commented Mar 27, 2024 •

edited