Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly reports smartctl_device_smart_status=0 on drives with passing status #229

Open
antifuchs opened this issue May 11, 2024 · 4 comments

Comments

@antifuchs
Copy link
Contributor

antifuchs commented May 11, 2024

I've just upgraded to 2cc2249 from 0768a40, and something is wrong in the reporting of SMART status of SATA-connected SSDs. It reports smartctl_device_smart_status=0 on these, but I believe the values should 1, according to what smartctl's JSON output reports for the drives:

Best to show an example:

:;    curl -s http://100.87.138.39:9633/metrics | grep smartctl_device_smart_status
# HELP smartctl_device_smart_status General smart status
# TYPE smartctl_device_smart_status gauge
smartctl_device_smart_status{device="nvme0"} 1
smartctl_device_smart_status{device="sda"} 0
smartctl_device_smart_status{device="sdb"} 1
smartctl_device_smart_status{device="sdc"} 1
smartctl_device_smart_status{device="sdd"} 1
smartctl_device_smart_status{device="sde"} 1
smartctl_device_smart_status{device="sdf"} 1
smartctl_device_smart_status{device="sdg"} 1
smartctl_device_smart_status{device="sdh"} 1
smartctl_device_smart_status{device="sdi"} 1
smartctl_device_smart_status{device="sdj"} 0
smartctl_device_smart_status{device="sdk"} 0
smartctl_device_smart_status{device="sdl"} 1
:;    for d in sda sdj sdk ; do echo -n "$d: " ; sudo smartctl --json -a /dev/$d | jq .smart_status.passed ; done
sda: true
sdj: true
sdk: true
:;    for d in sda sdj sdk ; do echo -n "$d: " ; sudo smartctl --json -a /dev/$d | jq .model_name ; done
sda: "SuperMicro SSD"
sdj: "Samsung SSD 870 EVO 2TB"
sdk: "Samsung SSD 870 EVO 2TB"

I'm not sure what's going on there, but something is wrong and it's making my disk badness monitoring go off spuriously /:

@k0ste
Copy link
Contributor

k0ste commented May 11, 2024

@antifuchs, the problem may be less mysterious if you show the debug log

@antifuchs
Copy link
Contributor Author

Running:

smartctl_exporter \
     --log.level=debug \
     --smartctl.path=/nix/store/whfmc5r1irm9j3n9glzxc77cl50241y2-smartmontools-7.4/bin/smartctl \
     --smartctl.interval=10m \
     --web.listen-address=127.0.0.1:9633 2>&1 | tee ~mess/debug-log

yields this (which doesn't look particularly enlightening tbh):

ts=2024-05-11T19:34:38.019Z caller=main.go:167 level=info msg="Starting smartctl_exporter" version="(version=, branch=, revision=unknown)"
ts=2024-05-11T19:34:38.019Z caller=main.go:168 level=info msg="Build context" build_context="(go=go1.22.2, platform=linux/amd64, user=, date=, tags=unknown)"
ts=2024-05-11T19:34:38.020Z caller=readjson.go:79 level=debug msg="Scanning for devices"
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sda
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdb
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdc
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdd
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sde
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdf
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdg
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdh
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdi
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdj
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdk
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=sdl
ts=2024-05-11T19:34:38.046Z caller=main.go:128 level=info msg="Found device" name=nvme0
ts=2024-05-11T19:34:38.046Z caller=main.go:172 level=info msg="Number of devices found" count=13
ts=2024-05-11T19:34:38.046Z caller=main.go:185 level=info msg="Start background scan process"
ts=2024-05-11T19:34:38.047Z caller=main.go:186 level=info msg="Rescanning for devices every" rescanInterval=10m0s
ts=2024-05-11T19:34:38.069Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sda duration=21.995655ms
ts=2024-05-11T19:34:38.069Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sda family=unknown model=unknown
ts=2024-05-11T19:34:38.094Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdb duration=24.664627ms
ts=2024-05-11T19:34:38.094Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdb family=unknown model=unknown
ts=2024-05-11T19:34:38.129Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdc duration=34.2836ms
ts=2024-05-11T19:34:38.130Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdc family=unknown model=unknown
ts=2024-05-11T19:34:38.157Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdd duration=26.83563ms
ts=2024-05-11T19:34:38.157Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdd family=unknown model=unknown
ts=2024-05-11T19:34:38.183Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sde duration=25.518334ms
ts=2024-05-11T19:34:38.184Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sde family=unknown model=unknown
ts=2024-05-11T19:34:38.212Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdf duration=27.646302ms
ts=2024-05-11T19:34:38.212Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdf family=unknown model=unknown
ts=2024-05-11T19:34:38.247Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdg duration=34.147328ms
ts=2024-05-11T19:34:38.247Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdg family=unknown model=unknown
ts=2024-05-11T19:34:38.275Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdh duration=27.762252ms
ts=2024-05-11T19:34:38.275Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdh family=unknown model=unknown
ts=2024-05-11T19:34:38.309Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdi duration=33.025595ms
ts=2024-05-11T19:34:38.309Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdi family=unknown model=unknown
ts=2024-05-11T19:34:38.333Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdj duration=23.642763ms
ts=2024-05-11T19:34:38.333Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdj family=unknown model=unknown
ts=2024-05-11T19:34:38.354Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdk duration=20.821869ms
ts=2024-05-11T19:34:38.355Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdk family=unknown model=unknown
ts=2024-05-11T19:34:38.388Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=sdl duration=32.682864ms
ts=2024-05-11T19:34:38.388Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdl family=unknown model=unknown
ts=2024-05-11T19:34:38.415Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=nvme0 duration=25.873639ms
ts=2024-05-11T19:34:38.415Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=nvme0 family=unknown model="Samsung SSD 980 PRO 2TB"
ts=2024-05-11T19:34:38.417Z caller=tls_config.go:313 level=info msg="Listening on" address=127.0.0.1:9633
ts=2024-05-11T19:34:38.417Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9633
ts=2024-05-11T19:34:41.304Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sda family=unknown model=unknown
ts=2024-05-11T19:34:41.304Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdb family=unknown model=unknown
ts=2024-05-11T19:34:41.305Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdc family=unknown model=unknown
ts=2024-05-11T19:34:41.305Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdd family=unknown model=unknown
ts=2024-05-11T19:34:41.305Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sde family=unknown model=unknown
ts=2024-05-11T19:34:41.306Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdf family=unknown model=unknown
ts=2024-05-11T19:34:41.306Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdg family=unknown model=unknown
ts=2024-05-11T19:34:41.307Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdh family=unknown model=unknown
ts=2024-05-11T19:34:41.307Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdi family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdj family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdk family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=sdl family=unknown model=unknown
ts=2024-05-11T19:34:41.308Z caller=smartctl.go:100 level=debug msg="Collecting metrics from" device=nvme0 family=unknown model="Samsung SSD 980 PRO 2TB"

@k0ste
Copy link
Contributor

k0ste commented May 11, 2024

Seems your system is also affected with #205, because your NVMe device metrics was reads correctly
You use packages from distro? It's will be better, if distro use releases tarball, instead development repo

@antifuchs
Copy link
Contributor Author

yeah, I have been building from source - that worked while the repo was semi-maintained (and I had pull reqs outstanding), but doesn't anymore. I will reconsider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants