ebpf: Update CPU freq calculation to improve performance #427

marceloamaral · 2022-11-29T12:13:38Z

This PR introduces a 2.4X performance improvement for Kepler regarding the CPU profiling.

We have been discussing the performance regression of Kepler in #365, with further analysis in #391.

The performance analysis showed that garbage collector was the source of the performance degradation.
After some evaluation, we identified that the functions calculating and collection the CPU frequency were some of the most expensive calls.

To improve the performance, this PR first removes the function that calculates the CPU frequency per container and implements it using BPF.
Second, the node CPU frequency is also collected by BPF, but when BPF is not available, Kepler can still collect it by reading kernel files. Although, it is a very expensive operation.

Before the update, the pprof CPU profiling was showing a total of 4.89s of CPU utilization.

After the update, the pprof CPU profiling was showing a total of 1.96s of CPU utilization.

In summary:

Before the update, Kepler was using ~12% CPU and dropped to ~1.5%.
The goprof CPU profile was showing ~2.62s for runtime.cgocall, after the update it dropped to 1.21s.

marceloamaral · 2022-11-29T13:53:59Z

Example of the CPU frequency in the log:

For the processes CPU frequency:

I1129 10:11:43.393591       1 container_hc_collector.go:109] system_processes, comm:kworker/44:0, freq:1000000, inst:35614
I1129 10:11:43.393684       1 container_hc_collector.go:109] system_processes, comm:migration/48, freq:2300000, inst:88109
I1129 10:11:43.393734       1 container_hc_collector.go:109] system_processes, comm:migration/4, freq:1000000, inst:218511
I1129 10:11:43.393770       1 container_hc_collector.go:109] system_processes, comm:kworker/88:0, freq:1000000, inst:49713
I1129 10:11:43.393805       1 container_hc_collector.go:109] system_processes, comm:kworker/73:0, freq:1000000, inst:25600
I1129 10:11:43.393841       1 container_hc_collector.go:109] system_processes, comm:kworker/59:2, freq:1000000, inst:55641
I1129 10:11:43.393924       1 container_hc_collector.go:109] system_processes, comm:containerd-shim, freq:0, inst:2639840
I1129 10:11:43.393964       1 container_hc_collector.go:109] system_processes, comm:kworker/112:0, freq:1000000, inst:20193
I1129 10:11:43.394006       1 container_hc_collector.go:109] system_processes, comm:ksoftirqd/0, freq:1100000, inst:1203258
I1129 10:11:43.397728       1 container_hc_collector.go:109] 69ef7e11ba2f2a898bd30e248e8b722b03c6c21668e85bd152380afb4975bce3, comm:snmpd, freq:1000000, inst:6819013
I1129 10:11:43.397848       1 container_hc_collector.go:109] system_processes, comm:kworker/129:0, freq:1000000, inst:186264
I1129 10:11:43.397892       1 container_hc_collector.go:109] system_processes, comm:kworker/5:0, freq:0, inst:259759
I1129 10:11:43.397914       1 container_hc_collector.go:109] system_processes, comm:kworker/46:1, freq:1000000, inst:374557
I1129 10:11:43.397935       1 container_hc_collector.go:109] system_processes, comm:ksoftirqd/19, freq:1500000, inst:130141

For the node CPU frequency

I1129 11:03:53.467639       1 node_energy_collector.go:83] c.NodeCPUFrequency map[0:1400000 1:3900000 2:1000000 3:1400000 4:1000000 5:1000000 6:1000000 7:1000000 8:1000000 9:1000000 10:1000000 11:1000000 12:1000000 13:1000000 14:1000000 15:1000000 16:1000000 17:1000000 18:1000000 19:1000000 20:3900000 21:1900000 22:1000000 23:1000000 24:1000000 25:2900000 26:1000000 27:1000000 28:1000000 29:1000000 30:3900000 31:1000000 32:3900000 33:1000000 34:1000000 35:1000000 36:1000000 37:1000000 38:1000000 39:1000000 40:2900000 41:1000000 42:1000000 43:1000000 44:1900000 45:1300000 46:2200000 47:2600000 48:1300000 49:1100000 50:1600000 51:2600000 52:1000000 53:1100000 54:1100000 55:1000000 56:1800000 57:1000000 58:1000000 59:3500000 60:1000000 61:1000000 62:1000000 63:1000000 64:1000000 65:1000000 66:1000000 67:1000000 68:1000000 69:1000000 70:1000000 71:1000000 72:1000000 73:1000000 74:1000000 75:1000000 76:1000000 77:1000000 78:1000000 79:1900000 80:1000000 81:1000000 82:1000000 83:1400000 84:1000000 85:1000000 86:1000000 87:1000000 88:1000000 89:1000000 90:1000000 91:1000000 92:1000000 93:1000000 94:1000000 95:1000000 96:1000000 97:1000000 98:3900000 99:1000000 100:1000000 101:1100000 102:1000000 103:1000000 104:1000000 105:1000000 106:1000000 107:1000000 108:1000000 109:1000000 110:3900000 111:1000000 112:1000000 113:1000000 114:1000000 115:1000000 116:1000000 117:1000000 118:1000000 119:1000000 120:2400000 121:1000000 122:3200000 123:1000000 124:1000000 125:1100000 126:1000000 127:1000000 128:1300000 129:1000000 130:3500000 131:1100000 132:3900000 133:3700000 134:3200000 135:1000000 136:2100000 137:1000000 138:1000000 139:1000000 140:1000000 141:1000000 142:2100000 143:1000000 144:1000000 145:2200000 146:1000000 147:1000000 148:1000000 149:1000000 150:1000000 151:1000000 152:1000000 153:1000000 154:1000000 155:1000000 156:1000000 157:1000000 158:1000000 159:1000000]

rootfs · 2022-11-29T21:12:07Z

the tracepoint doesn't work on my setup. To narrow down the scenarios, @marceloamaral can you try the following and see if you can capture any cpu frequency data?

# perf list 'power:cpu_frequency'

List of pre-defined events (to be used in -e):

  power:cpu_frequency                                [Tracepoint event]

# perf record -e power:cpu_frequency -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.943 MB perf.data ]
# perf script

rootfs · 2022-11-30T00:28:52Z

@marceloamaral Since we have per core CPU cycles reading from PMU, we can convert the cycles into frequency

marceloamaral · 2022-11-30T05:54:49Z

@rootfs
I also do not see data using perf

# perf list 'power:cpu_frequency'

List of pre-defined events (to be used in -e):

Metric Groups:

# perf record -e power:cpu_frequency -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.538 MB perf.data ]
# perf script

I think I cannot see it because this tracing is disabled for perf.

# cat /sys/kernel/debug/tracing/events/power/cpu_frequency/enable
0

But the BPF code can get the CPU frequency.

Can you try to run the code of this PR?

marceloamaral · 2022-11-30T06:00:13Z

Since we have per core CPU cycles reading from PMU, we can convert the cycles into frequency

Yes if we can get the reference cycles. freq=cycles/ref_cycles.
Some CPUs haveref_cycles=100MHZ but it might change and I did not find how to discover it.

But the limitation of using cycles is that an environment that does not have PMU will not collect the frequency.
The tracepoint is a more generic solution.

rootfs · 2022-11-30T15:34:38Z

@marceloamaral the frequency is always 0 with the patch on my physical machine

[root@rhel kepler]# patch -p1 < 427.diff 
patching file bpfassets/perf_event/perf_event.c
patching file pkg/bpfassets/attacher/bcc_attacher.go
patching file pkg/bpfassets/attacher/bcc_attacher_stub.go
patching file pkg/bpfassets/perf_event_bindata.go
patching file pkg/collector/container_hc_collector.go
patching file pkg/collector/metric/container_metric.go
patching file pkg/collector/metric/utils.go
patching file pkg/collector/metric/utils_test.go
patching file pkg/collector/metric_collector.go
patching file pkg/collector/node_energy_collector.go
patching file pkg/collector/prometheus_collector.go
patching file pkg/power/acpi/acpi.go
[root@rhel kepler]# make _build_local
gofmt -e -d -s -l -w pkg/ cmd/
[root@rhel kepler]# echo 1 > /sys/kernel/debug/tracing/events/power/cpu_frequency/enable
[root@rhel kepler]# _output/bin/_/kepler -v 5 2>&1 |grep avg
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00

marceloamaral · 2022-12-01T10:27:33Z

@rootfs I found the problem that cpu frequency is 0.
Accordingly to the bcc book:

When the system is set to the performance governor, this tool shows nothing as there are 
no more frequency changes to instrument: the CPUs are pinned at the highest frequency.

But note that, when the governor policy is set to performance, it does not mean that the cpu frequency does not change.
It is just not changing via the governor policy, something else changes it.

# for i in $(seq 0 5) ; do  cat /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq; sleep 1; done
1577682
3396041
1030914
1207322
1013511

I will continue investigate how to implement this via BPF

marceloamaral · 2022-12-02T09:02:27Z

I figured out how to calculate CPU frequency using hardware counters.
As I mentioned before, we need the reference cycles and base frequency of the CPU, and I found out where this information is.

There are some differences in the CPU frequency measured by the 3 approaches that we have.

cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 1005 - 1003
1 - 1000 - 999 - 1001
2 - 1000 - 1000 - 1000
3 - 1000 - 1000 - 1000
4 - 1000 - 1003 - 1000
5 - 1000 - 1000 - 1050
6 - 1000 - 1000 - 1005
7 - 1000 - 1000 - 1020
8 - 1050 - 2231 - 1000
9 - 1150 - 999 - 1020
cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 1000 - 1000
1 - 1000 - 1001 - 1000
2 - 1000 - 1000 - 1000
3 - 1000 - 1009 - 1000
4 - 1000 - 1000 - 1026
5 - 2450 - 1873 - 1000
6 - 1000 - 1002 - 1000
7 - 1000 - 1001 - 1000
8 - 1000 - 1002 - 999
9 - 1037 - 998 - 1011
cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 999 - 1001
1 - 1000 - 1030 - 1170
2 - 1000 - 997 - 1000
3 - 1000 - 1004 - 1000
4 - 1000 - 1009 - 1000
5 - 1000 - 1005 - 1000
6 - 1000 - 1000 - 1000
7 - 1000 - 1002 - 1000
8 - 1003 - 1003 - 1000
9 - 1000 - 1000 - 1000
cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 1001 - 1000
1 - 1000 - 1015 - 1000
2 - 1000 - 991 - 1000
3 - 1612 - 2299 - 1000
4 - 1000 - 1000 - 1000
5 - 1000 - 1203 - 1000
6 - 1000 - 1005 - 1000
7 - 1000 - 1006 - 1034
8 - 1000 - 1000 - 1000
9 - 1000 - 1032 - 1112

The tracepoint and acpi-sysfs values are from caches in the kernel and the cycles/ref-cycles are calculated using hardware counters. Note that tracepoint and acpi-sysfs use caches and may miss some hardware updates (BIOS energy policy).
Additionally the tracepoint and acpi-sysfs have different caches and are updated from different kernel calls. For example, tracepoint updates come from the CPU frequency governor policy and are never updated if the policy is performance.
In this case, I would say that cycles/ref-cycles reports the most reliable results.

I will update this PR to use cycles/ref-cycles, but using more counters implies that the kernel will multiplex the counters which requires to apply normalization in the counters, see here.

marceloamaral · 2022-12-13T10:12:37Z

@rootfs It finally worked.

The main challenge was to add support for hardware counter with multiplexing, the bpf_perf_event_read_value function only works when using kprobe and we were using tracepoints.

Why do we need to normalize the counters:

PMUs have a limited number of hardware counters (depending on processor version). If we ask for more
events than counters, the kernel starts time multiplexing and extrapolating the results from the sampling 
window to the the entire lifetime of the process.

Therefore, normalizing here means de-extrapolating the results to expose the actual values.

Also note that the documentation suggests always using bpf_perf_event_read_value:

bpf_perf_event_read_value() is recommended over
bpf_perf_event_read() in general. the latter has
some quirks of the ABI where the error and the counter value are
used as a return code (which is wrong to do since
ranges may overlap). This problem is fixed with
bpf_perf_event_read_value(), which at the same time
provides more resources on the interface bpf_perf_event_read().

marceloamaral · 2022-12-13T13:03:02Z

I have also fixed the cpu time collection.

time1 is the old wrong one
time2 is the new version

pid=0,time1=3293,time2=2
pid=262910,time1=3,time2=3
pid=2625,time1=1,time2=0
pid=318796,time1=1,time2=1

The old code was updating the cpu time with the func lookup_or_try_init, but to update the value in the bpf map we need to use update function.

rootfs · 2022-12-13T14:44:04Z

tested on my setup, the magnitude of frequency stats looks a bit off

kepler_node_cpu_scaling_frequency_hertz{cpu="1",instance="rhel"} 3.970833e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="10",instance="rhel"} 3.916413e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="11",instance="rhel"} 3.615736e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="2",instance="rhel"} 3.967232e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="3",instance="rhel"} 3.980801e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="4",instance="rhel"} 3.976298e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="5",instance="rhel"} 3.765624e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="6",instance="rhel"} 3.246395e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="7",instance="rhel"} 3.978243e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="8",instance="rhel"} 3.953491e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="9",instance="rhel"} 3.953808e+06
100 80130    0 80130    0     0  25.4M      0 --:--:-- --:--:-- --:--:-- 25.4M
[root@rhel kepler]# cpupower frequency-info 
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 800 MHz - 4.10 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 800 MHz and 4.10 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 3.81 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

bpfassets/perf_event/perf_event.c

Signed-off-by: Marcelo Amaral <marcelo.amaral1@ibm.com>

rootfs · 2022-12-14T15:28:27Z

The frequency is in KHZ, the metrics label needs a update. @marceloamaral would you follow up with another PR? Thanks.

marceloamaral mentioned this pull request Nov 29, 2022

scalability and performance tests to evaluate kepler's resource usage and scrape latency under high and low workload #366

Closed

marceloamaral force-pushed the cpu-freq branch from 72173e4 to 13c9ddb Compare November 29, 2022 12:25

marceloamaral mentioned this pull request Nov 29, 2022

Kepler general energy metrics vs energy tracing metrics #365

Closed

marceloamaral force-pushed the cpu-freq branch 4 times, most recently from 312f811 to c911626 Compare November 29, 2022 13:15

rootfs changed the title ~~Update CPU freq calculation to improve performance~~ ebpf: Update CPU freq calculation to improve performance Nov 29, 2022

marceloamaral force-pushed the cpu-freq branch 3 times, most recently from 1ebd75c to 74d306a Compare December 13, 2022 09:39

marceloamaral force-pushed the cpu-freq branch from 74d306a to 3beead8 Compare December 14, 2022 08:12

marceloamaral marked this pull request as draft December 14, 2022 13:07

marceloamaral force-pushed the cpu-freq branch from 3beead8 to 6cb3b07 Compare December 14, 2022 13:58

marceloamaral marked this pull request as ready for review December 14, 2022 13:59

rootfs reviewed Dec 14, 2022

View reviewed changes

bpfassets/perf_event/perf_event.c Outdated Show resolved Hide resolved

marceloamaral added 3 commits December 14, 2022 23:35

acpi: update to only read cpu freq files when BPF is disabled

b181ec3

Signed-off-by: Marcelo Amaral <marcelo.amaral1@ibm.com>

collector: remove expensive cpu freq calculation

0cadb2e

Signed-off-by: Marcelo Amaral <marcelo.amaral1@ibm.com>

bpfassets: add support to collect cpu frequency with BPF

ba931d3

Signed-off-by: Marcelo Amaral <marcelo.amaral1@ibm.com>

marceloamaral force-pushed the cpu-freq branch from 6cb3b07 to ba931d3 Compare December 14, 2022 14:35

rootfs approved these changes Dec 14, 2022

View reviewed changes

rootfs merged commit b3e4847 into sustainable-computing-io:main Dec 14, 2022

marceloamaral deleted the cpu-freq branch December 15, 2022 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ebpf: Update CPU freq calculation to improve performance #427

ebpf: Update CPU freq calculation to improve performance #427

marceloamaral commented Nov 29, 2022 •

edited

marceloamaral commented Nov 29, 2022

rootfs commented Nov 29, 2022

rootfs commented Nov 30, 2022

marceloamaral commented Nov 30, 2022

marceloamaral commented Nov 30, 2022

rootfs commented Nov 30, 2022 •

edited

marceloamaral commented Dec 1, 2022

marceloamaral commented Dec 2, 2022

marceloamaral commented Dec 13, 2022

marceloamaral commented Dec 13, 2022

rootfs commented Dec 13, 2022

rootfs commented Dec 14, 2022

ebpf: Update CPU freq calculation to improve performance #427

ebpf: Update CPU freq calculation to improve performance #427

Conversation

marceloamaral commented Nov 29, 2022 • edited

marceloamaral commented Nov 29, 2022

rootfs commented Nov 29, 2022

rootfs commented Nov 30, 2022

marceloamaral commented Nov 30, 2022

marceloamaral commented Nov 30, 2022

rootfs commented Nov 30, 2022 • edited

marceloamaral commented Dec 1, 2022

marceloamaral commented Dec 2, 2022

marceloamaral commented Dec 13, 2022

marceloamaral commented Dec 13, 2022

rootfs commented Dec 13, 2022

rootfs commented Dec 14, 2022

marceloamaral commented Nov 29, 2022 •

edited

rootfs commented Nov 30, 2022 •

edited