Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken retrieval of OpenBSD CPU metrics #1241

Closed

Conversation

chrissnell
Copy link

@chrissnell chrissnell commented Feb 4, 2022

CPU metrics are broken on OpenBSD 7.0-CURRENT as described in #1239. I believe that the problem is the multiplication of the CPU incrementer by 2 when the code walks through the available CPUs to gather CPU time metrics. The code makes uses of the KERN_CPTIME2 sysctl to fetch per-CPU time metrics.

Looking at the kernel code behind this particular sysctl, we have this.. As you can see, it's calling a CPU_INFO_FOREACH macro and iterating through the results until it has the info for the requested CPU. This macro is defined per-architecture and in every one that I found, it's just iterating through cpu_info structs defined here.

I can't find anything that leads me to believe that this Go code should be incrementing by anything other than 1.

In the entire OpenBSD codebase, there's only one use of the KERN_CPTIME2 sysctl and that's in snmpd's source code, which iterates through the CPUs and it does this by incrementing the counter by 1.

This particular code ^ hasn't changed since 2016 and I'll bet it still works. :-)

Here's another clue. As I mentioned, the guts of this syscall is just iterating through cpu_info struct. I looked around for other uses of this struct and found code for the new OpenBSD debugger. There are different implementations for different architectures, but for amd64, it also increments through CPUs by 1.

Here is a PR to fix this. I only have one OpenBSD machine to test on, a simple Dell Optiplex workstation.

@chrissnell
Copy link
Author

chrissnell commented Feb 4, 2022

Test usage of the patched library on my machine:

package main

import (
	"fmt"
	"log"

	"github.com/shirou/gopsutil/v3/cpu"
)

func main() {
	perCPUTimes, err := cpu.Times(true)
	if err != nil {
		log.Fatalln("error getting cpu times:", err)
	}

	for k, v := range perCPUTimes {
		fmt.Printf("cpu %v: %+v\n", k, v)
	}

}
% ./cpu
cpu 0: {"cpu":"cpu0","user":103.8,"system":0.0,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":216.2,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}
cpu 1: {"cpu":"cpu1","user":126.5,"system":0.0,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":478.4,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}
cpu 2: {"cpu":"cpu2","user":80.6,"system":0.0,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":359.1,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}
cpu 3: {"cpu":"cpu3","user":174.5,"system":0.0,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":133.0,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}

@shirou
Copy link
Owner

shirou commented Feb 5, 2022

Thank you for your surveying and PR! I have not checked OpenBSD 7 source code, but is it changed from #647 ? What the value is the hw.smt on your machine?

@chrissnell-okta
Copy link

chrissnell-okta commented Feb 5, 2022

I don't think anything in the OpenBSD source has changed. In my case, hw.smt was set to 0 by default on my fresh install of 7.0-CURRENT.

When I set hw.cpu to 1, your code as it is today does indeed work. So, here's what I can conclude:

  1. The smt() function works as intended and accurately detects whether SMT is enabled.
  2. The multiplication of the incrementer by 2 when SMT is not detected is incorrect. This is unncecessary and causes the sysctl to query an invalid CPU.
  3. To fix it, simply don't multiply the incrementer by 2. This eliminates the need for the smt() function.

I have no idea why @omar-polo found this code change necessary. It appears that he may have been testing under QEMU--it's not clear.

@chrissnell-okta
Copy link

chrissnell-okta commented Feb 5, 2022

Some info about my machine. It appears that my old Core i5-3570 does not support hyperthreading and thus, the setting of hw.smt=0 is probably correct.

Is it possible that this conditional is just incorrect? Perhaps it should look like this:

if hasSMT {
   j *= 2
}

In other words, skip every other CPU when SMT is enabled.

cpu0: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz, 3392.86 MHz, 06-3a-09
cpu0: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu0: 256KB 64b/line 8-way L2 cache
cpu0: smt 0, core 0, package 0
mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges
cpu0: apic clock running at 99MHz
cpu0: mwait min=64, max=64, C-substates=0.2.1.1, IBE
cpu1 at mainbus0: apid 2 (application processor)
cpu1: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz, 3392.32 MHz, 06-3a-09
cpu1: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu1: 256KB 64b/line 8-way L2 cache
cpu1: smt 0, core 1, package 0
cpu2 at mainbus0: apid 4 (application processor)
cpu2: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz, 3392.31 MHz, 06-3a-09
cpu2: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu2: 256KB 64b/line 8-way L2 cache
cpu2: smt 0, core 2, package 0
cpu3 at mainbus0: apid 6 (application processor)
cpu3: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz, 3392.31 MHz, 06-3a-09
cpu3: FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE,SSE3,PCLMUL,DTES64,MWAIT,DS-CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,x2APIC,POPCNT,DEADLINE,AES,XSAVE,AVX,F16C,RDRAND,NXE,RDTSCP,LONG,LAHF,PERF,ITSC,FSGSBASE,SMEP,ERMS,MD_CLEAR,IBRS,IBPB,STIBP,L1DF,SSBD,SENSOR,ARAT,XSAVEOPT,MELTDOWN
cpu3: 256KB 64b/line 8-way L2 cache
cpu3: smt 0, core 3, package 0

@omar-polo
Copy link
Contributor

omar-polo commented Feb 6, 2022

No, I think that check is necessary. For example, on my machine:

% sysctl hw.smt
hw.smt=0
% sysctl hw.ncpu
hw.ncpu=8
% sysctl hw.ncpuonline
hw.ncpuonline=4

then, looking at top(1)

CPU0:  1.0% user,  9.9% nice,  2.6% sys,  3.2% spin,  8.6% intr, 74.8% idle
CPU2:  2.6% user,  9.4% nice,  4.5% sys,  2.0% spin,  0.0% intr, 81.4% idle
CPU4:  1.3% user,  9.5% nice,  3.9% sys,  2.0% spin,  0.0% intr, 83.3% idle
CPU6:  0.9% user, 10.1% nice,  3.2% sys,  1.8% spin,  0.0% intr, 83.9% idle

since we retrieve the number of cpus using hw.ncpuonline (which return 4 in my case) the j *= 2 ensures that we retrieve the info for cpu number 0, 2, 4, and 6 (i.e. the real CPUs that are online)

AFAICS it still works correctly here:

% cat foo.go
package main

import (
        "fmt"
        "log"

        "github.com/shirou/gopsutil/v3/cpu"
)

func stats(percpu bool) {
        times, err := cpu.Times(percpu)
        if err != nil {
                log.Fatal(err)
        }
        for k, v := range times {
                fmt.Printf("cpu %d: %+v\n", k, v)
        }
}

func main() {
        stats(true)
        fmt.Println("----")
        stats(false)
}
% go run foo.go
cpu 0: {"cpu":"cpu0","user":849.4,"system":8691.5,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":2266.5,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}
cpu 1: {"cpu":"cpu2","user":2220.0,"system":8177.6,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":3939.6,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}
cpu 2: {"cpu":"cpu4","user":1102.6,"system":8293.4,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":3374.7,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}
cpu 3: {"cpu":"cpu6","user":802.9,"system":8812.8,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":2815.7,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}
----
cpu 0: {"cpu":"cpu-total","user":1243.7,"system":8493.8,"idle":0.0,"nice":0.0,"iowait":0.0,"irq":3099.1,"softirq":0.0,"steal":0.0,"guest":0.0,"guestNice":0.0}

However, it's been a while since I worked on that code, and at glance there are a couple of things I don't like/shouldn't be needed anymore. I'll check out your pr and do some test soon, thanks :)

@chrissnell
Copy link
Author

chrissnell commented Feb 6, 2022

OK, I think I see what's going on here. @omar-polo both you and I have four physical cores and SMT = 0. The difference is that your hw.ncpu reports 8, while mine reports 4. However, the code isn't checking ncpu, it's checking `ncpuonline:

https://github.com/shirou/gopsutil/blob/master/cpu/cpu_openbsd.go#L164-L168

The problem is that the loop is based off ncpuonline, which is the same for both of us. We should simply divide ncpu by ncpuonline and set the iterator to that value. For you, this means j = j*2 and for me, it's just j.

Sound good? I can make a fix.

@chrissnell
Copy link
Author

chrissnell commented Feb 6, 2022

@omar-polo if you set sysctl hw.smt=1, does that change the value for ncpu/ncpuonline at all?

@omar-polo
Copy link
Contributor

@chrissnell please try #1244, I think I've got it right this time :)

The problem is in the assumption that if SMT is disabled (the default) then only even CPUs are online. I've changed my assumption to mirror what top does: https://github.com/openbsd/src/blob/master/usr.bin/top/machine.c#L258-L269

it uses KERN_CPUSTATS which returns a struct cpustats that is able to tell us if that core is online or not. I've tested it on both my i386 and amd64 and it seems to work.

Anyway, to reply to your question, yes, changing hw.stm alters the output from hw.ncpuonline (which is raised from 4 to 8 when smt is enabled)

@chrissnell-okta
Copy link

Sorry, I'm struggling with this: does anyone know how to replace the shirou/gopsutil dependency with the @omar-polo fork?

I tried a few things with go.mod but it doesn't seem to work. I'm attempting to build telegraf with the forked repo.

@omar-polo
Copy link
Contributor

omar-polo commented Feb 9, 2022

I don't use telegraf so I can't confirm, but appending

replace github.com/shirou/gopsutil/v3 => ./gopsutil

to telegraf' go.mod and cloning my fork in ./gopsutil (in the root directory of telegraf) should work.

EDIT: i'm not sure if /v3 is needed after github.com/shirou/gopsutil... sorry, I don't play with replace very ofter

@chrissnell-okta
Copy link

Sadly, the replace did not work. Tried a lot of things but it continues to use this repo, not your fork.

Can we just merge this as-is?

@shirou
Copy link
Owner

shirou commented Feb 24, 2022

I think #1244 includes this PR. So we can close this PR. Please let me know if I am wrong.

@omar-polo
Copy link
Contributor

Yes, my pr has the proper fix for this issue. Sorry for the delay, I've been a bit busy, I'll look into closing the final points for #1244 right now :)

@Lomanic
Copy link
Collaborator

Lomanic commented Apr 1, 2022

Closing as per discussion and #1244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants