Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ebpf: Update CPU freq calculation to improve performance #427

Merged
merged 3 commits into from Dec 14, 2022

Conversation

marceloamaral
Copy link
Collaborator

@marceloamaral marceloamaral commented Nov 29, 2022

This PR introduces a 2.4X performance improvement for Kepler regarding the CPU profiling.

We have been discussing the performance regression of Kepler in #365, with further analysis in #391.

The performance analysis showed that garbage collector was the source of the performance degradation.
After some evaluation, we identified that the functions calculating and collection the CPU frequency were some of the most expensive calls.

To improve the performance, this PR first removes the function that calculates the CPU frequency per container and implements it using BPF.
Second, the node CPU frequency is also collected by BPF, but when BPF is not available, Kepler can still collect it by reading kernel files. Although, it is a very expensive operation.


Before the update, the pprof CPU profiling was showing a total of 4.89s of CPU utilization.

image

After the update, the pprof CPU profiling was showing a total of 1.96s of CPU utilization.

image

In summary:

  • Before the update, Kepler was using ~12% CPU and dropped to ~1.5%.
  • The goprof CPU profile was showing ~2.62s for runtime.cgocall, after the update it dropped to 1.21s.

@marceloamaral
Copy link
Collaborator Author

Example of the CPU frequency in the log:

For the processes CPU frequency:

I1129 10:11:43.393591       1 container_hc_collector.go:109] system_processes, comm:kworker/44:0, freq:1000000, inst:35614
I1129 10:11:43.393684       1 container_hc_collector.go:109] system_processes, comm:migration/48, freq:2300000, inst:88109
I1129 10:11:43.393734       1 container_hc_collector.go:109] system_processes, comm:migration/4, freq:1000000, inst:218511
I1129 10:11:43.393770       1 container_hc_collector.go:109] system_processes, comm:kworker/88:0, freq:1000000, inst:49713
I1129 10:11:43.393805       1 container_hc_collector.go:109] system_processes, comm:kworker/73:0, freq:1000000, inst:25600
I1129 10:11:43.393841       1 container_hc_collector.go:109] system_processes, comm:kworker/59:2, freq:1000000, inst:55641
I1129 10:11:43.393924       1 container_hc_collector.go:109] system_processes, comm:containerd-shim, freq:0, inst:2639840
I1129 10:11:43.393964       1 container_hc_collector.go:109] system_processes, comm:kworker/112:0, freq:1000000, inst:20193
I1129 10:11:43.394006       1 container_hc_collector.go:109] system_processes, comm:ksoftirqd/0, freq:1100000, inst:1203258
I1129 10:11:43.397728       1 container_hc_collector.go:109] 69ef7e11ba2f2a898bd30e248e8b722b03c6c21668e85bd152380afb4975bce3, comm:snmpd, freq:1000000, inst:6819013
I1129 10:11:43.397848       1 container_hc_collector.go:109] system_processes, comm:kworker/129:0, freq:1000000, inst:186264
I1129 10:11:43.397892       1 container_hc_collector.go:109] system_processes, comm:kworker/5:0, freq:0, inst:259759
I1129 10:11:43.397914       1 container_hc_collector.go:109] system_processes, comm:kworker/46:1, freq:1000000, inst:374557
I1129 10:11:43.397935       1 container_hc_collector.go:109] system_processes, comm:ksoftirqd/19, freq:1500000, inst:130141

For the node CPU frequency

I1129 11:03:53.467639       1 node_energy_collector.go:83] c.NodeCPUFrequency map[0:1400000 1:3900000 2:1000000 3:1400000 4:1000000 5:1000000 6:1000000 7:1000000 8:1000000 9:1000000 10:1000000 11:1000000 12:1000000 13:1000000 14:1000000 15:1000000 16:1000000 17:1000000 18:1000000 19:1000000 20:3900000 21:1900000 22:1000000 23:1000000 24:1000000 25:2900000 26:1000000 27:1000000 28:1000000 29:1000000 30:3900000 31:1000000 32:3900000 33:1000000 34:1000000 35:1000000 36:1000000 37:1000000 38:1000000 39:1000000 40:2900000 41:1000000 42:1000000 43:1000000 44:1900000 45:1300000 46:2200000 47:2600000 48:1300000 49:1100000 50:1600000 51:2600000 52:1000000 53:1100000 54:1100000 55:1000000 56:1800000 57:1000000 58:1000000 59:3500000 60:1000000 61:1000000 62:1000000 63:1000000 64:1000000 65:1000000 66:1000000 67:1000000 68:1000000 69:1000000 70:1000000 71:1000000 72:1000000 73:1000000 74:1000000 75:1000000 76:1000000 77:1000000 78:1000000 79:1900000 80:1000000 81:1000000 82:1000000 83:1400000 84:1000000 85:1000000 86:1000000 87:1000000 88:1000000 89:1000000 90:1000000 91:1000000 92:1000000 93:1000000 94:1000000 95:1000000 96:1000000 97:1000000 98:3900000 99:1000000 100:1000000 101:1100000 102:1000000 103:1000000 104:1000000 105:1000000 106:1000000 107:1000000 108:1000000 109:1000000 110:3900000 111:1000000 112:1000000 113:1000000 114:1000000 115:1000000 116:1000000 117:1000000 118:1000000 119:1000000 120:2400000 121:1000000 122:3200000 123:1000000 124:1000000 125:1100000 126:1000000 127:1000000 128:1300000 129:1000000 130:3500000 131:1100000 132:3900000 133:3700000 134:3200000 135:1000000 136:2100000 137:1000000 138:1000000 139:1000000 140:1000000 141:1000000 142:2100000 143:1000000 144:1000000 145:2200000 146:1000000 147:1000000 148:1000000 149:1000000 150:1000000 151:1000000 152:1000000 153:1000000 154:1000000 155:1000000 156:1000000 157:1000000 158:1000000 159:1000000]

@rootfs
Copy link
Contributor

rootfs commented Nov 29, 2022

the tracepoint doesn't work on my setup. To narrow down the scenarios, @marceloamaral can you try the following and see if you can capture any cpu frequency data?

# perf list 'power:cpu_frequency'

List of pre-defined events (to be used in -e):

  power:cpu_frequency                                [Tracepoint event]

# perf record -e power:cpu_frequency -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.943 MB perf.data ]
# perf script

@rootfs
Copy link
Contributor

rootfs commented Nov 30, 2022

@marceloamaral Since we have per core CPU cycles reading from PMU, we can convert the cycles into frequency

@marceloamaral
Copy link
Collaborator Author

@rootfs
I also do not see data using perf

# perf list 'power:cpu_frequency'

List of pre-defined events (to be used in -e):

Metric Groups:

# perf record -e power:cpu_frequency -a sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.538 MB perf.data ]
# perf script

I think I cannot see it because this tracing is disabled for perf.

# cat /sys/kernel/debug/tracing/events/power/cpu_frequency/enable
0

But the BPF code can get the CPU frequency.

Can you try to run the code of this PR?

@marceloamaral
Copy link
Collaborator Author

Since we have per core CPU cycles reading from PMU, we can convert the cycles into frequency

Yes if we can get the reference cycles. freq=cycles/ref_cycles.
Some CPUs haveref_cycles=100MHZ but it might change and I did not find how to discover it.

But the limitation of using cycles is that an environment that does not have PMU will not collect the frequency.
The tracepoint is a more generic solution.

@rootfs
Copy link
Contributor

rootfs commented Nov 30, 2022

@marceloamaral the frequency is always 0 with the patch on my physical machine

[root@rhel kepler]# patch -p1 < 427.diff 
patching file bpfassets/perf_event/perf_event.c
patching file pkg/bpfassets/attacher/bcc_attacher.go
patching file pkg/bpfassets/attacher/bcc_attacher_stub.go
patching file pkg/bpfassets/perf_event_bindata.go
patching file pkg/collector/container_hc_collector.go
patching file pkg/collector/metric/container_metric.go
patching file pkg/collector/metric/utils.go
patching file pkg/collector/metric/utils_test.go
patching file pkg/collector/metric_collector.go
patching file pkg/collector/node_energy_collector.go
patching file pkg/collector/prometheus_collector.go
patching file pkg/power/acpi/acpi.go
[root@rhel kepler]# make _build_local
gofmt -e -d -s -l -w pkg/ cmd/
[root@rhel kepler]# echo 1 > /sys/kernel/debug/tracing/events/power/cpu_frequency/enable
[root@rhel kepler]# _output/bin/_/kepler -v 5 2>&1 |grep avg
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00
        avgFreq: 0.00

@marceloamaral
Copy link
Collaborator Author

@rootfs I found the problem that cpu frequency is 0.
Accordingly to the bcc book:

When the system is set to the performance governor, this tool shows nothing as there are 
no more frequency changes to instrument: the CPUs are pinned at the highest frequency.

But note that, when the governor policy is set to performance, it does not mean that the cpu frequency does not change.
It is just not changing via the governor policy, something else changes it.

# for i in $(seq 0 5) ; do  cat /sys/devices/system/cpu/cpufreq/policy0/scaling_cur_freq; sleep 1; done
1577682
3396041
1030914
1207322
1013511

I will continue investigate how to implement this via BPF

@marceloamaral
Copy link
Collaborator Author

I figured out how to calculate CPU frequency using hardware counters.
As I mentioned before, we need the reference cycles and base frequency of the CPU, and I found out where this information is.

There are some differences in the CPU frequency measured by the 3 approaches that we have.

cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 1005 - 1003
1 - 1000 - 999 - 1001
2 - 1000 - 1000 - 1000
3 - 1000 - 1000 - 1000
4 - 1000 - 1003 - 1000
5 - 1000 - 1000 - 1050
6 - 1000 - 1000 - 1005
7 - 1000 - 1000 - 1020
8 - 1050 - 2231 - 1000
9 - 1150 - 999 - 1020
cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 1000 - 1000
1 - 1000 - 1001 - 1000
2 - 1000 - 1000 - 1000
3 - 1000 - 1009 - 1000
4 - 1000 - 1000 - 1026
5 - 2450 - 1873 - 1000
6 - 1000 - 1002 - 1000
7 - 1000 - 1001 - 1000
8 - 1000 - 1002 - 999
9 - 1037 - 998 - 1011
cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 999 - 1001
1 - 1000 - 1030 - 1170
2 - 1000 - 997 - 1000
3 - 1000 - 1004 - 1000
4 - 1000 - 1009 - 1000
5 - 1000 - 1005 - 1000
6 - 1000 - 1000 - 1000
7 - 1000 - 1002 - 1000
8 - 1003 - 1003 - 1000
9 - 1000 - 1000 - 1000
cpu - tracepoint - cycle/ref_cycle - acpi-sysfs
0 - 1000 - 1001 - 1000
1 - 1000 - 1015 - 1000
2 - 1000 - 991 - 1000
3 - 1612 - 2299 - 1000
4 - 1000 - 1000 - 1000
5 - 1000 - 1203 - 1000
6 - 1000 - 1005 - 1000
7 - 1000 - 1006 - 1034
8 - 1000 - 1000 - 1000
9 - 1000 - 1032 - 1112

The tracepoint and acpi-sysfs values are from caches in the kernel and the cycles/ref-cycles are calculated using hardware counters. Note that tracepoint and acpi-sysfs use caches and may miss some hardware updates (BIOS energy policy).
Additionally the tracepoint and acpi-sysfs have different caches and are updated from different kernel calls. For example, tracepoint updates come from the CPU frequency governor policy and are never updated if the policy is performance.
In this case, I would say that cycles/ref-cycles reports the most reliable results.

I will update this PR to use cycles/ref-cycles, but using more counters implies that the kernel will multiplex the counters which requires to apply normalization in the counters, see here.

@marceloamaral
Copy link
Collaborator Author

@rootfs It finally worked.

The main challenge was to add support for hardware counter with multiplexing, the bpf_perf_event_read_value function only works when using kprobe and we were using tracepoints.

Why do we need to normalize the counters:

PMUs have a limited number of hardware counters (depending on processor version). If we ask for more
events than counters, the kernel starts time multiplexing and extrapolating the results from the sampling 
window to the the entire lifetime of the process.

Therefore, normalizing here means de-extrapolating the results to expose the actual values.

Also note that the documentation suggests always using bpf_perf_event_read_value:

bpf_perf_event_read_value() is recommended over
bpf_perf_event_read() in general. the latter has
some quirks of the ABI where the error and the counter value are
used as a return code (which is wrong to do since
ranges may overlap). This problem is fixed with
bpf_perf_event_read_value(), which at the same time
provides more resources on the interface bpf_perf_event_read().

@marceloamaral
Copy link
Collaborator Author

I have also fixed the cpu time collection.

time1 is the old wrong one
time2 is the new version

pid=0,time1=3293,time2=2
pid=262910,time1=3,time2=3
pid=2625,time1=1,time2=0
pid=318796,time1=1,time2=1

The old code was updating the cpu time with the func lookup_or_try_init, but to update the value in the bpf map we need to use update function.

@rootfs
Copy link
Contributor

rootfs commented Dec 13, 2022

tested on my setup, the magnitude of frequency stats looks a bit off

kepler_node_cpu_scaling_frequency_hertz{cpu="1",instance="rhel"} 3.970833e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="10",instance="rhel"} 3.916413e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="11",instance="rhel"} 3.615736e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="2",instance="rhel"} 3.967232e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="3",instance="rhel"} 3.980801e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="4",instance="rhel"} 3.976298e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="5",instance="rhel"} 3.765624e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="6",instance="rhel"} 3.246395e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="7",instance="rhel"} 3.978243e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="8",instance="rhel"} 3.953491e+06
kepler_node_cpu_scaling_frequency_hertz{cpu="9",instance="rhel"} 3.953808e+06
100 80130    0 80130    0     0  25.4M      0 --:--:-- --:--:-- --:--:-- 25.4M
[root@rhel kepler]# cpupower frequency-info 
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 800 MHz - 4.10 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 800 MHz and 4.10 GHz.
                  The governor "performance" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 3.81 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

Signed-off-by: Marcelo Amaral <marcelo.amaral1@ibm.com>
Signed-off-by: Marcelo Amaral <marcelo.amaral1@ibm.com>
Signed-off-by: Marcelo Amaral <marcelo.amaral1@ibm.com>
@rootfs
Copy link
Contributor

rootfs commented Dec 14, 2022

The frequency is in KHZ, the metrics label needs a update. @marceloamaral would you follow up with another PR? Thanks.

@rootfs rootfs merged commit b3e4847 into sustainable-computing-io:main Dec 14, 2022
@marceloamaral marceloamaral deleted the cpu-freq branch December 15, 2022 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants