New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent overflow on counter adding #1474
base: main
Are you sure you want to change the base?
Conversation
|
||
// if the v is an unsigned integer | ||
// and the precise doesn't lose during uint cast and cast it back to float | ||
for v == float64(ival) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing to note here is that the casting here varies in different platform. for example, in linux, the v won't be equal with float64(uint64(v))
. but in macos m1, it equals. It means the checking here is needed and cannot be skipped
The verification code is:
func TestVerifyRounding(t *testing.T) {
var f float64 = 1<<64 - 1
if f == float64(uint64(f)) {
t.Fatal("values are the same after casting")
} else {
t.Fatal("values are different")
}
}
LInux output could be found in the pipeline here.
Hi @ArthurSens @beorn7 could I know your opinions for this fix? |
I don't maintain this repository anymore, and I currently lack capacity to go into detail here. This is for the current maintainers @ArthurSens @kakkoyun @bwplotka . |
Thanks! This is definitely useful to handle, just I don't have time this week to take a detailed look. Maybe @ArthurSens ? |
If it's not intended to work as the current behavior, I think I could start the code change as well so it reduces the communication efforts. Thanks:) |
Yup 💪🏽 |
Sorry, I would need some time to understand better the overflow problem. (times like this that made me wish I had a computer science degree 😕) Do I understand correctly that this happens only with counters (i.e. not with gauges, summaries, and histograms)? I also wonder if this is something that possible scrapers can handle correctly... Do we know if Prometheus can parse correctly without losing precision as well? |
func main() {
c := prometheus.NewCounter(prometheus.CounterOpts{
Name: "verify_scraper_overflow",
})
_ = prometheus.DefaultRegisterer.Register(c)
c.Add(8000)
go func() {
time.Sleep(10 * time.Second)
c.Add(math.MaxUint64)
}()
l, err := net.Listen("tcp", "localhost:8081")
if err != nil {
panic(err)
}
http.Serve(l, promhttp.InstrumentMetricHandler(
prometheus.DefaultRegisterer,
promhttp.HandlerFor(
prometheus.Gatherers(append(
[]prometheus.Gatherer{prometheus.DefaultGatherer},
)),
promhttp.HandlerOpts{},
),
))
return
} The result is so the scraper doesn't handle the overflow at all. By the way, because the prometheus client reports the metrics defined in
|
Oh, there is another additional case regarding the float operation which covers all fundamental metric types(counter, gauge, histogram and summary). func main() {
f := float64(2 << 53)
fmt.Println(uint64(f))
fmt.Println(uint64(f + 1))
} This means that once the metrics' value exceed the value, if we keep calling |
Thanks a lot for such a detailed explanation ❤️, it really helped me understand the problem! In the meantime, I was also thinking about how common this problem is. The project has a few years already and we haven't seen many reports suggesting that overflow caused bugs in production. Is this a problem at your current work or just solving a theoretical problem? I'm afraid we'll make our codebase harder to read/understand, while we don't solve real problems (or a very niched problem) |
|
Nice work! ...and yes @ArthurSens point is solid, do we have an important use case (e.g. somebody hitting this issue and was impacted) that justifies complexity. Here the complexity is not too bad (second pr #1478 is much bigger though), but there is also an aspect of efficiency, this is ultimately a hot path, and we (mostly @beorn7) did massive work to ensure this is as fast as possible. This PR is changing hot path of simple atomic add to have bit more instructions. Perhaps running micro benchmark would help decide? But before we move forward - the consequence of overflow is that counter will rotate and start from zero. Given zero is literally the safe mechanisms of Prometheus counters that means a counter reset, isn't this totally safe behaviour? There might be nuances with created timestamp here, but otherwise this will result in total valid and precise observation on e.g Prometheus rates, no? |
Possible Overflow
Currently, the
Add
API of Counter and the similar implementation has the overflow problem. I write a test case to demonstrate it by the test case TestCounterAddExcess in the commit d222f97. And the output verifies the overflow does happen as expected.Proposal
I think it's nice to fix it as it's a possible case. I haven't checked much about the history discussions about this topic. As a result, I would like to discuss with the maintainers to see whether you would like to accept such changes or not.
We need to check one more additional cases for the possible overflows, as the
uint()
cast already prevents the precise lost during conversion(for example, 2<<64-1).One step further, we can add the
valInt
value into the float setvalBits
and then reset thevalInt
to zero.Regards.