Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kepler general energy metrics vs energy tracing metrics #365

Closed
williamcaban opened this issue Nov 7, 2022 · 23 comments
Closed

Kepler general energy metrics vs energy tracing metrics #365

williamcaban opened this issue Nov 7, 2022 · 23 comments
Labels
high-priority Issues that are part of the next release but not mandatory kind/feature New feature or request wontfix This will not be worked on

Comments

@williamcaban
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

The recent changes in the kind of metrics exposed increase the level of visibility to trace how much energy is consumed by which elements on a process/Pod/container. This kind of metrics is valuable for applications owners looking to optimize the energy consumption by their application or services stack. At the same time, the granularity of these metrics make it extremely difficult for functionalities that require a quick understanding of the system (e.g. schedulers, multi-cluster level dashboards, etc) and considerable increase the load on the Prometheus stack.

Describe the solution you'd like

I like to propose for Kepler to have a flag that can be use to influence the type of metrics that it exposes on the metric endpoint. Basically, enable Kepler to be used by general use cases (e.g. report the top-5 energy consumption by namespace, pods or apps; report the software+OS-view energy consumption by nodes, etc) and another mode for developers that want to do low level tracing of energy consumption of their apps.

The mode for the general use cases should provide a simple single metric per container or Pod, that can be used for user facing reporting and executive types of reports.

Describe alternatives you've considered
With the current granular metrics:

  • We have seen a significant impact on Prometheus utilization (way too many metrics in very short time even on micro environments)
  • It has become very challenging using the metrics of the general use cases as the query formula gets very long in not time, and the system takes several seconds to process it once
  • For comparison, when loading the metrics on DataFrames to run analytics in a Jupyter notebooks for an 8 nodes clusters with ~500 Pods, loading a 2hr dataset with 1 minutes resolution: The current granular metrics takes 2min 17s (as reported by %timeit) to load. With the previous aggregated metrics, the same cluster setup with the same datasets and resolution was loading in less than 1 second.

Additional context
The general use cases should also cover #301

@rootfs
Copy link
Contributor

rootfs commented Nov 7, 2022

thank you @williamcaban for the investigation and proposal

  • We have seen a significant impact on Prometheus utilization (way too many metrics in very short time even on micro environments)
  • For comparison, when loading the metrics on DataFrames to run analytics in a Jupyter notebooks for an 8 nodes clusters with ~500 Pods, loading a 2hr dataset with 1 minutes resolution: The current granular metrics takes 2min 17s (as reported by %timeit) to load. With the previous aggregated metrics, the same cluster setup with the same datasets and resolution was loading in less than 1 second.

This is surely a concern for us. If Prometheus cannot support a small setup, the metrics scheme needs a scalability study.

It has become very challenging using the metrics of the general use cases as the query formula gets very long in not time, and the system takes several seconds to process it once

This can be investigated in two ways: exposing the general use case through metrics or supporting them via Prometheus query. Again, these approaches are to be evaluated via usability and scalability factors.

@rootfs
Copy link
Contributor

rootfs commented Nov 7, 2022

cc @marceloamaral

@rootfs
Copy link
Contributor

rootfs commented Nov 7, 2022

Let's create a process for metrics schema:

  • Support general use cases: current and aggregate power usage by nodes, pods, and potentially namespace.
  • Focus on the scalability on evaluating design choices: what would happen to Kepler, Prometheus, or Grafana dashboard when the numbers of pods and nodes increase?
  • Add integration test on the quantifying metrics query latency.

@sustainable-computing-io/reviewer

@marceloamaral
Copy link
Collaborator

Right, I do agree that we need a metric that sum the energy consumption of all sources:
kepler_container_joules_total as the old pod_curr_energy_millijoule

This metric will represent the container_energy = package+ucore+dram+gpu+other_components

@marceloamaral
Copy link
Collaborator

Regarding the number of metrics. We have some ideas of disabling some metrics via configuration.
So we can do that in the future.

However, just keep in mind that node-exporter, cadvisor and other exporters export a lot more metrics than we do without major problems.
But of course we need to do some analysis on this.

@rootfs
Copy link
Contributor

rootfs commented Nov 8, 2022

thank @williamcaban for sharing me the setup.
This setup is quite beefy

64 cores Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz

It has the latest metrics: kepler_pod_energy.

I ran some quick tests and did find quite some issues:

  • current CPU usage is way too high, it is over 150%
    image
  • latency is also high, 4s vs 1s before refactoring
    image

@marceloamaral
Copy link
Collaborator

@williamcaban regarding the large prometheus metric being too expensive to process, I created a new metric in PR #370

@marceloamaral
Copy link
Collaborator

@rootfs definitely there is something wrong with this deployment.
First, why it has 3 kepler instances?

In my environment the kepler CPU utilization is not high....
image

@rootfs
Copy link
Contributor

rootfs commented Nov 8, 2022

the top output was from a remote tmux console, there are issues with rendering. This is the top from inside the kepler pod
Screenshot

@marceloamaral
Copy link
Collaborator

The empty cgroup values might be related to this fix (PR#377)

@rootfs
Copy link
Contributor

rootfs commented Nov 11, 2022

@williamcaban @marceloamaral Here is my controlled test results.
https://gist.github.com/rootfs/2bb2c742c3afd9acca7811ed8eba7dd7

The TL;DR is, there is a scalability regression in the current code, kepler's CPU usage is very sensitive to the number of Pods.

@rootfs rootfs added the urgent-priority Issues that are part of the next release label Nov 11, 2022
@rootfs
Copy link
Contributor

rootfs commented Nov 11, 2022

The scalability is introduced in ad9daed (#287). I tested the until 73a4291 and found no regression.

@rootfs
Copy link
Contributor

rootfs commented Nov 11, 2022

looking at this block, there are quite some map to create and discard after use. Creating map and letting it go to garbage collection is quite expensive.

	containerMetricValuesOnly, containerIDList, containerGPUDelta := c.getContainerMetricsList()

	// use the container's resource usage metrics to update the node metrics
	nodeTotalPower, nodeTotalGPUPower, nodeTotalPowerPerComponents := c.updateNoderMetrics(containerMetricValuesOnly)

	// calculate the container energy consumption using its resource utilization and the node components energy consumption
	// TODO: minimize the number of collection variables, we already have the containerMetrics and NodeMetrics structures
	c.updateContainerEnergy(containerMetricValuesOnly, containerIDList, containerGPUDelta, nodeTotalPower, nodeTotalGPUPower, nodeTotalPowerPerComponents)

@rootfs
Copy link
Contributor

rootfs commented Nov 11, 2022

Based on the profiling result, gc is an issue

@marceloamaral

@marceloamaral
Copy link
Collaborator

Thanks @rootfs for the excellent profile analysis.
Yes, the new code is using more CPU. Maybe it's because before we had the metrics aggregated by pod and now we have them by container, so it has more values in the maps....

looking at this block, there are quite some map to create and discard after use. Creating map and letting it go to garbage collection is quite expensive.

@rootfs I didn't create new maps I just refactored the ones that already exist to be transparent what we're using.
But performance regression may be related to some of these functions.
Before, everything was in just one function and I split it into several functions to improve code readability....

@rootfs
Copy link
Contributor

rootfs commented Nov 12, 2022

@marceloamaral containerMetricValuesOnly, containerIDList, containerGPUDelta,nodeTotalPower, nodeTotalGPUPower, nodeTotalPowerPerComponents are created by functions. They are used once and threw away. This puts much pressure on gc.

@jichenjc
Copy link
Collaborator

The empty cgroup values might be related to this fix (PR#377)

you are thinking this is a regression?
I tested on my local CentOS 8 env , without fix, the cgroup value is 0 ..
so maybe some regression introduction in #377?

@marceloamaral
Copy link
Collaborator

containerMetricValuesOnly, containerIDList, containerGPUDelta,nodeTotalPower, nodeTotalGPUPower, nodeTotalPowerPerComponents are created by functions. They are used once and threw away. This puts much pressure on gc.

@rootfs I agree 100%. And that is why I included the TODO in the code to remove this...

These arrays are used in the model package because the model server only works with arrays.
To improve this we should use protobuf (for example) for the data and then the model server should also use this.
Using protobuf, we won't need to convert structured data into multiple arrays....

@wangchen615 wangchen615 added the kind/feature New feature or request label Nov 14, 2022
@marceloamaral
Copy link
Collaborator

Based on the CPU profile results, I also analyzed the MEM profile at #391.
In a first, initial analysis, the ListPod function seems to be the bottleneck.
And the getCPUCoreFrequency is also a very expensive function.

@wangchen615 wangchen615 added high-priority Issues that are part of the next release but not mandatory and removed urgent-priority Issues that are part of the next release labels Nov 15, 2022
@simonpasquier
Copy link

👋 Prometheus team member here. Though I'm not in expert in client code instrumentation and I've only skimmed over the issue, I can probably share some information :)
While in general code instrumentation is lightweight in terms of resource usage (e.g. github.com/prometheus/client_golang is highly optimized), the custom collector pattern (using Const metrics) can be problematic, especially when a single exporter exposes tens of thousands of metrics: each scrape from Prometheus allocates lots of objects, putting pressure both on the memory and on the CPU (due to garbage collection).

Projects like kube-state-metrics encountered similar issues in the past (and kubelet/cAdvisor to some extent). In practice, kube-state-metrics alleviated the problem by implementing a custom Prometheus client library (see the metrics_store and metricshandler packages). kubernetes/kube-state-metrics#498 has also lots of details on why building a custom metric client was the right choice for the project.
This isn't me saying that kepler should take the same route :) For instance, there is some experiment in github.com/prometheus/client_golang (prometheus/client_golang#929 and prometheus/client_golang#989) and cAdvisor (kubernetes/kubernetes#104459, kubernetes/kubernetes#108194 and google/cadvisor#2974) to support metrics caching in order to reduce the resource overhead. It would be interesting to see if this is applicable for kepler too..

@marceloamaral
Copy link
Collaborator

Thanks @simonpasquier for the pointers!

@marceloamaral
Copy link
Collaborator

Btw, the PR #427 introduces a 2.4X performance improvement for Kepler regarding the CPU profiling.

@stale
Copy link

stale bot commented May 17, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label May 17, 2023
@stale stale bot closed this as completed May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high-priority Issues that are part of the next release but not mandatory kind/feature New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants