node_filesystem_{free,avail}_bytes reporting values larger than node_filesystem_size_bytes #1672

treydock · 2020-04-10T13:13:30Z

Host operating system: output of `uname -a`

$ uname -r
3.10.0-957.41.1.el7.x86_64

node_exporter version: output of `node_exporter --version`

$ node_exporter --version
node_exporter, version 1.0.0-rc.0 (branch: master, revision: a57f2465794ec60c40674706acc6c2ace12c1358)
  build user:       tdockendorf@pitzer-rw02.ten.osc.edu
  build date:       20200327-18:45:58
  go version:       go1.13.8

node_exporter command line flags

This is NFS root which produces lots of bind mounts so that is why we have a lot of filesystem ignores.

ExecStart=/usr/bin/node_exporter \
--collector.filesystem.ignored-fs-types=^(gpfs|nfs|nfs4|rootfs|tmpfs|cvmfs2|iso9660|autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$ \
--collector.textfile.directory=/var/lib/node_exporter/textfile_collector \
--collector.systemd.unit-whitelist=.+\.service \
--collector.systemd.unit-blacklist=(mmsdrserv)\.service \
--collector.netclass.ignored-devices=^(eth1|eth2|eth3|ib0|lo)$ \
--collector.filesystem.ignored-mount-points=^/(var/spool|var/log|var/lib/oprofile|var/account|var/cache/opensm|var/cache/ibutils|var/mmfs|var/adm/ras|var/lib/fail2ban|opt/puppetlabs/puppet/cache/clientbucket|opt/puppetlabs/puppet/cache/state|opt/dell/srvadmin/var|var/lib/identityfinder|var/lib/.identityfinder|var/cache/man|var/gdm|var/lib/xkb|var/lib/dbus|var/lib/nfs|var/lib/postfix|var/lib/gssproxy|var/singularity|var/lib/pcp/tmp|etc/lvm/cache|etc/lvm/archive|etc/lvm/backup|var/cache/foomatic|var/cache/logwatch|var/cache/httpd/ssl|var/cache/httpd/proxy|var/cache/php-pear|var/cache/systemtap|var/db/nscd|var/lib/dav|var/lib/dhcpd|var/lib/dhclient|var/lib/php|var/lib/pulse|var/lib/rsyslog|var/lib/ups|var/tmp|var/db/sudo|var/spool/cron|etc/sysconfig/iptables.d|etc/puppetlabs/mcollective|var/lib/node_exporter/textfile_collector|etc/adjtime|var/lib/arpwatch|var/lib/NetworkManager|var/cache/alchemist|var/lib/gdm|var/lib/iscsi|var/lib/ntp|var/lib/xen|var/empty/sshd/etc/localtime|var/lib/random-seed|var/lib/samba|etc/ofed-mic.map|opt/ipcm|usr/bin/turbostat|var/lib/pcp/pmdas/perfevent|var/lib/pcp/pmdas/infiniband|etc/sysconfig/network-scripts|etc/fstab|etc/pam.d|etc/security/access|etc/security/limits.d|etc/X11/xorg.conf.d|var/lib/sss|var/lib/logrotate||dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/.+|cvmfs/.+|run/.+)($|/) \
--collector.systemd \
--collector.cpu.info \
--collector.mountstats \
--collector.ntp \
--no-collector.hwmon \
--no-collector.mdadm \
--no-collector.nfsd \
--no-collector.softnet \
--no-collector.thermal_zone \
--no-collector.zfs

Are you running node_exporter in Docker?

Not via Docker.

What did you do that produced an error?

Look at a graph in Grafana that uses these metrics. The filesystem avail bytes is an extremely large number and much larger than size bytes.

What did you expect to see?

I would never expect avail or free bytes for a filesystem to exceed the size.

What did you see instead?

The orange line is the avail bytes and the green line that appears to be near 0 is size in bytes.

The size in bytes is 879510155264 which is accurate but the avail bytes is so much larger the scale makes size in bytes look near zero.

The text was updated successfully, but these errors were encountered:

discordianfish · 2020-04-17T09:40:06Z

That is odd.. What does df say about /tmp on these systems?

treydock · 2020-04-17T14:34:59Z

[root@o0297 ~]# df /tmp
Filesystem             1K-blocks    Used Available Use% Mounted on
/dev/mapper/vg0-lv_tmp 858896636 1176452 857720184   1% /tmp

The metric for total size remained accurate while the avail/free was the one that was higher than total size. These are HPC compute nodes so it's possible this happened when /tmp was full due to some user doing something they shouldn't but hard to say for sure since the monitoring numbers we rely on were incorrect.

discordianfish · 2020-04-20T08:11:43Z

Would be useful to get the raw output of statfs from here: https://github.com/prometheus/node_exporter/blob/master/collector/filesystem_linux.go#L78

Do you see any errors in the node-exporter log? Maybe the mountpoint got stuck leading to this miscalculation. But the code is pretty straight forward, so not sure what is going on here. Maybe some float overflow (https://github.com/prometheus/node_exporter/blob/master/collector/filesystem_linux.go#L109) but I doubt that.

treydock · 2020-04-20T12:30:26Z

I've looked at the code and also can't imagine how this would become a problem as the code is essentially taking values returned by the kernel and doing simple math to get bytes from blocks.

There are no relevant errors in logs. The only logs from node_exporter are from issues generating mountinfo but that's an issue with procfs (prometheus/procfs#282)

Apr  9 03:23:58 o0297 node_exporter: level=error ts=2020-04-09T07:23:58.801Z caller=collector.go:161 msg="collector failed" name=mountstats duration_seconds=0.007737361 err="failed to parse mountinfo: couldn't find enough fields in mount string: 108 53 0:34 / /var/lib/nfs/rpc_pipefs rw,relatime - rpc_pipefs sunrpc rw"

Handle cases where, owing to multiplying two `uint64` integers and typecasting it to `float64`, the overall precision is lost when the values concerned exceed the `floatMantissa64` (1 << 53) before or after the operation (which is well within the acceptable `uint64` range). Fixes: prometheus#1672 Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>

discordianfish added bug require/feedback labels Apr 17, 2020

rexagod mentioned this issue Mar 19, 2024

collector/filesystem: Handle Statfs_t overflows #2965

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node_filesystem_{free,avail}_bytes reporting values larger than node_filesystem_size_bytes #1672

node_filesystem_{free,avail}_bytes reporting values larger than node_filesystem_size_bytes #1672

treydock commented Apr 10, 2020

discordianfish commented Apr 17, 2020

treydock commented Apr 17, 2020

discordianfish commented Apr 20, 2020

treydock commented Apr 20, 2020

node_filesystem_{free,avail}_bytes reporting values larger than node_filesystem_size_bytes #1672

node_filesystem_{free,avail}_bytes reporting values larger than node_filesystem_size_bytes #1672

Comments

treydock commented Apr 10, 2020

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

discordianfish commented Apr 17, 2020

treydock commented Apr 17, 2020

discordianfish commented Apr 20, 2020

treydock commented Apr 20, 2020

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`