nomad process dies with `panic: counter cannot decrease in value` #15861

rbastiaans-tc · 2023-01-24T16:42:36Z

Nomad version

Nomad v1.3.1 (2b054e38e91af964d1235faa98c286ca3f527e56)

Operating system and Environment details

Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

Issue

The Nomad agent process died with a message: panic: counter cannot decrease in value

See full panic output below.

Reproduction steps

Unknown

Expected Result

Nomad stay running

Actual Result

Nomad process exits

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Sun 2023-01-22 16:57:52 UTC nomadclient013 nomad[3403]: ==> Newer Nomad version available: 1.4.3 (currently running: 1.3.1)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: panic: counter cannot decrease in value
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: goroutine 7598137 [running]:
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*counter).Add(...)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/counter.go:109
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc000280080, 0xc00235bf60)
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/go_collector_go117.go:147 +0x5ec
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:446 +0x102
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Sun 2023-01-22 16:58:41 UTC nomadclient013 nomad[3403]:         github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:457 +0x4e8
Sun 2023-01-22 16:58:41 UTC nomadclient013 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sun 2023-01-22 16:58:41 UTC nomadclient013 systemd[1]: nomad.service: Failed with result 'exit-code'.
Sun 2023-01-22 16:58:43 UTC nomadclient013 systemd[1]: nomad.service: Service RestartSec=2s expired, scheduling restart.
Sun 2023-01-22 16:58:43 UTC nomadclient013 systemd[1]: nomad.service: Scheduled restart job, restart counter is at 1.

Nomad config

data_dir = "/var/lib/nomad"

client {
    enabled = true
    options {
        docker.auth.config = "/etc/docker/dockercfg.json"
    }
    meta {
        az = "az-1"
    }
}

vault {
    enabled = true
    tls_skip_verify = true
    address = "https://xxxx"
    namespace = "xxxx"
}

telemetry {
    publish_allocation_metrics = true
    publish_node_metrics = true
    prometheus_metrics = true
     
}

consul {
    tags = ["no-http"]
}

The text was updated successfully, but these errors were encountered:

lgfa29 · 2023-01-24T21:50:14Z

Hi @rbastiaans-tc 👋

Thanks for the report. Unfortunately we will need more information in order to understand what's going wrong.

Looking at the code this panic is raised by the Prometheus client library when the counter attempts to add a negative value.

This library is used by go-metrics in the IncrCounter and IncrCounterWithLabels methods, but all uses of these methods in Nomad pass either 1 or the len(...) of something, which is always positive.

So in order to understand what's going on with your cluster I built custom binaries that output more information when the panic occurs. You can find them at the bottom of this page: https://github.com/hashicorp/nomad/actions/runs/4000378647

One very important thing is that these binaries are for test only and should not be run in production, so if you could, please copy the data_dir of the agent presenting this problem somewhere else and run the custom binary pointing to the new directory.

The changes included in these binaries can be viewed here:

Another thing that could be relevant, what CPU architecture are you using?

rbastiaans-tc · 2023-01-25T11:42:46Z

Hi @rbastiaans-tc 👋

@lgfa29 I'm not sure how soon and if we can run that on short term. As far as I know now, unfortunately we only had this issue occur once, in our production cluster, so far.

Upgrading Nomad version even in our dev-environment is not that easy, because we are suffering from another Nomad bug where Nomad agent restarts causes jobs using CSI Volumes to get lost allocations #13028. Therefor we cannot do in-place upgrades that easily.

Even if we could put your modified version in place easily, it might take a long time for this bug to re-occur.

I was planning to upgrade to 1.3.8 first to hopefully get rid of some of these bugs associated with CSI volumes.

Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.

Wouldn't catching that panic be a relatively easy fix? I will see about running that modified version, but I'm not sure if I can do that anytime soon.

Another thing that could be relevant, what CPU architecture are you using?

We are running x86_64 or "amd64" version of Debian.

$ uname -a
Linux xxx 4.19.0-20-cloud-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux

$ file nomad
nomad: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=8aa832ffba2495157c039cc24548a9738520f98e, for GNU/Linux 3.2.0, not stripped

lgfa29 · 2023-01-26T19:38:26Z

Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.

Besides the root cause of the panic, I would imagine the Nomad process should never crash because of a issue with just telemetry or monitoring metrics only. Especially in production environments.

Wouldn't catching that panic be a relatively easy fix?

I definitely agree with this, but it's not so simple to fix this from the Nomad side. The telemetry library we use is called in several places and, as far as I can tell, we always send values that are supposed to be positive.

I will have to discuss with the rest of the team how to best handle this. It would probably require changes to the go-metrics module.

rbastiaans-tc · 2023-01-27T10:17:13Z

Ah no worries, I thought this something that was always happening when your agent started so it could be tested quickly.

Ah perhaps that wasn't clear from my report. It happened after running Nomad for a long time, on only 1 machine in the cluster so far.

Thanks so much for looking into this @lgfa29

lgfa29 · 2023-01-27T21:04:54Z

Got it, thanks! Yeah, we've been using this library for a while and haven't seen any reports about this before. I also haven't heard from other teams that also use it. So it seems like it was a unfortunate, but very rare, situation.

rbastiaans-tc · 2023-03-03T13:08:05Z

This happened again last night @lgfa29

Different machine, same Nomad version.

Mar  2 23:35:58 nomadclient002 nomad[8008]: panic: counter cannot decrease in value
Mar  2 23:35:58 nomadclient002 nomad[8008]: goroutine 8369928 [running]:
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*counter).Add(...)
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/counter.go:109
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*goCollector).Collect(0xc000058180, 0xc001c11f60)
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/go_collector_go117.go:147 +0x5ec
Mar  2 23:35:58 nomadclient002 nomad[8008]: github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:446 +0x102
Mar  2 23:35:58 nomadclient002 nomad[8008]: created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
Mar  2 23:35:58 nomadclient002 nomad[8008]: #011github.com/prometheus/client_golang@v1.12.0/prometheus/registry.go:538 +0xb4d
Mar  2 23:35:58 nomadclient002 systemd[1]: nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar  2 23:35:58 nomadclient002 systemd[1]: nomad.service: Failed with result 'exit-code'.

tgross · 2023-03-03T21:10:19Z

It occurs to me that any stack trace isn't going to be that useful b/c it's getting messages over a channel, so this is just the prom goroutine. I wonder if we're starting from the wrong position here. If we know we aren't trying to decrement a counter, maybe something else is? The prometheus client gets metrics from the go runtime. I don't see anything obvious in go_collector.go or in the golang issues for MemStats, but it'd be worth some investigation here I think.

lgfa29 · 2023-03-10T22:11:59Z

Ah that's right, we emit metrics for the runtime environment, and probably other things I'm not remembering right now 😅

I think the best option we have is to move forward with hashicorp/go-metrics#146, that would prevent crashes and also give us more information about which metric is behaving unexpectedly.

rbastiaans-tc · 2023-05-08T09:37:56Z

@tgross @lgfa29 Can we please get that metrics PR merged and that into a Nomad release?

This is still happening for me on occasion and in production environments it's not really great, in combination with that CSI volume bug. It means that whenever Nomad dies of this panic, also all our jobs running there using CSI volumes get killed and rescheduled.

So if we could get a fix at least for this panic it will already help us, regardless of the CSI volume bug which seems more difficult to tackle. This metrics PR seems like an easier win.

tgross · 2023-10-19T13:04:33Z

@lhossack has reported seeing the logging we added to detect this case in #18804, so that's a good start on next steps for this issue.

The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: #15861 Fixes: #18804

tgross · 2023-10-23T14:40:48Z

Fix in #18835. See my comment here #18804 (comment) for a detailed breakdown of the problem.

The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: #15861 Fixes: #18804

The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Backport-of: #18835

The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see hashicorp#15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: hashicorp#15861 Fixes: hashicorp#18804

chenk008 · 2024-04-02T06:35:26Z

I think it is related to prometheus/client_golang#969

lgfa29 · 2024-04-04T00:00:07Z

Hi @chenk008 👋

Have you encountered this issue after #18835 was released in Nomad 1.6.3?

chenk008 · 2024-04-05T14:40:21Z

@lgfa29 Sorry for disturb you, i have the same error in my project with client_golang 1.12.0. After searching, I only found this result, so I left a comment. The goroutine stack is in client_golang, it is a bug in client_golang 1.12.0.

lgfa29 · 2024-04-05T22:11:34Z

No worries @chenk008, I mostly trying to understand the exact issue you've experience.

Would you be able to open a new issue and post the exact error from log output and Nomad version you're using?

Thanks!

rbastiaans-tc added the type/bug label Jan 24, 2023

lgfa29 added theme/crash theme/metrics stage/needs-investigation labels Jan 24, 2023

lgfa29 self-assigned this Jan 24, 2023

lgfa29 added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 24, 2023

lgfa29 mentioned this issue Jan 26, 2023

prometheus: prevent panic when incrmenting counter hashicorp/go-metrics#146

Merged

lgfa29 moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 26, 2023

lgfa29 removed their assignment Feb 1, 2023

lgfa29 moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage Feb 1, 2023

lgfa29 mentioned this issue May 9, 2023

deps: update go-metrics to prevent panic #17133

Merged

This was referenced May 11, 2023

Backport of deps: update go-metrics to prevent panic into release/1.3.x #17151

Merged

Backport of deps: update go-metrics to prevent panic into release/1.4.x #17152

Merged

Backport of deps: update go-metrics to prevent panic into release/1.5.x #17153

Merged

shoenig mentioned this issue Aug 14, 2023

Annotate go mod replace statements #11826

Open

tgross mentioned this issue Oct 19, 2023

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value #18804

Closed

tgross self-assigned this Oct 23, 2023

tgross moved this from Needs Roadmapping to Triaging in Nomad - Community Issues Triage Oct 23, 2023

tgross mentioned this issue Oct 23, 2023

metrics: prevent negative counter from iowait decrease #18835

Merged

tgross removed the stage/needs-investigation label Oct 23, 2023

tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Oct 23, 2023

tgross added this to the 1.7.0 milestone Oct 23, 2023

tgross closed this as completed in #18835 Oct 24, 2023

Nomad - Community Issues Triage automation moved this from In Progress to Done Oct 24, 2023

tgross mentioned this issue Oct 24, 2023

metrics: prevent negative counter from iowait decrease (backport 1.6.x) #18849

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nomad process dies with `panic: counter cannot decrease in value` #15861

nomad process dies with `panic: counter cannot decrease in value` #15861

rbastiaans-tc commented Jan 24, 2023 •

edited

lgfa29 commented Jan 24, 2023

rbastiaans-tc commented Jan 25, 2023

lgfa29 commented Jan 26, 2023

rbastiaans-tc commented Jan 27, 2023

lgfa29 commented Jan 27, 2023

rbastiaans-tc commented Mar 3, 2023

tgross commented Mar 3, 2023 •

edited

lgfa29 commented Mar 10, 2023

rbastiaans-tc commented May 8, 2023

tgross commented Oct 19, 2023

tgross commented Oct 23, 2023

chenk008 commented Apr 2, 2024

lgfa29 commented Apr 4, 2024

chenk008 commented Apr 5, 2024

lgfa29 commented Apr 5, 2024

nomad process dies with panic: counter cannot decrease in value #15861

nomad process dies with panic: counter cannot decrease in value #15861

Comments

rbastiaans-tc commented Jan 24, 2023 • edited

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Nomad config

lgfa29 commented Jan 24, 2023

rbastiaans-tc commented Jan 25, 2023

lgfa29 commented Jan 26, 2023

rbastiaans-tc commented Jan 27, 2023

lgfa29 commented Jan 27, 2023

rbastiaans-tc commented Mar 3, 2023

tgross commented Mar 3, 2023 • edited

lgfa29 commented Mar 10, 2023

rbastiaans-tc commented May 8, 2023

tgross commented Oct 19, 2023

tgross commented Oct 23, 2023

chenk008 commented Apr 2, 2024

lgfa29 commented Apr 4, 2024

chenk008 commented Apr 5, 2024

lgfa29 commented Apr 5, 2024

nomad process dies with `panic: counter cannot decrease in value` #15861

nomad process dies with `panic: counter cannot decrease in value` #15861

rbastiaans-tc commented Jan 24, 2023 •

edited

tgross commented Mar 3, 2023 •

edited