New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing isExclusive field as part of ContainerResources in PodResource API #102989
Conversation
/sig node |
func (m *mockProvider) IsExclusive(podUID, containerName string) bool { | ||
args := m.Called(podUID, containerName) | ||
return args.Get(0).(bool) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a leftover of a past revision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks, fixed it.
4837b35
to
3441e7b
Compare
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
7f3b226
to
bee1a52
Compare
/triage accepted |
@ruiwen-zhao can you please take a look? |
@@ -79,7 +79,7 @@ type Manager interface { | |||
|
|||
// GetCPUs implements the podresources.CPUsProvider interface to provide allocated | |||
// cpus for the container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add to the comment what the boolean is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done!
cpuCount: 1, | ||
podName: "pod-03", | ||
cntName: "cnt-00", | ||
cpuRequest: "1000m", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just trying to understand why we are changing to use millis here? The tests dont seem to use fractional CPU request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit here: 739f11b I have added a test case with fractional CPU request, which demonstrates the case of guaranteed pods with fractional CPU request thereby obtaining CPUs from the shared pool.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This patch changes cpuCount to cpuRequest in order to cater to cases where guaranteed pods make non-integral CPU Requests. Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
739f11b
to
d047a92
Compare
@ruiwen-zhao I have addressed the questions/comments you had. Could you take another look? |
/lgtm Thanks! |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ruiwen-zhao, swatisehgal The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be blunt, but to me this API change looks very kludge-ish without enough long-term perspective 😸 My main concerns are
- this change addresses one specific use case
- this change too heavily reflects the internal state/implementation of kubelet in the API
My suggestion would be to extend the API with full information of the resource requests and limits. This would allow the consumer to derive isExclusive
, and much more.
map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 5;
map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 6;
The solution proposed here is to handle a corner case of guaranteed pods requesting non-integral CPUs and not being able to determine what comprises the shared pool. This issue was also raised by @derekwaynecarr during the enhancements made to the podresource API here. IMO, the solution proposed here is a non-invasive backward compatible solution that does not alter the original goal of the podresource API.
Reflecting the internal state of kubelet in terms of the resources allocated is essentially the goal of pod resource API. It returns information about the kubelet's assignment of concrete resources (devices with their numa ID, cpuids and recently memory) allocated to the containers. Please refer to kubernetes/enhancements#1884, #93243 and #95734. Podresource API allows monitoring agents to gain visibility into how those resources are allocated and the allocated resources for use cases that can be seen here. The information remains local to the node where the endpoint can be accessed by a monitoring agent running on the node and not exposed outside of the node so not sure what the cause of concern is.
Having put some thought into this here are some of my concerns with this:
|
I agree that it's non-invasive and backward compatible. But I think it's short-sighted, addressing the only nearest bump in front of our front tire.
I don't believe the API should be too tightly coupled with the (current) internal implementation details of kubelet. Just from the top of my head what if kubelet starts to support "mixed allocations" with some exclusive and some shared cpus per container(?)
We have the cpu ids. Second, a bit unrelated, you can't do proper accounting of the cpus just by based on the IDs. I.e. with the
Maybe the pod qos class should be added into
For now, it's only cpus, future has might bring surprises, who knows. OTOH, I can't see what harm including the complete information about resource requests would do.
I wouldn't see it as expanding the scope. Rather making it more complete. Maybe it's about perspective 😄 I think overall we should learn something from the way Linux kernel community designs apis. One use case is tracking burstable containers (which I already mentioned above). |
I found that providing requests and limits as well as qos for podresources api is controversial. |
If we are talking about accounting in the context of Topology aware scheduling, considering burstable pods for accounting does not makes sense as burstable pods (just like best effort pods and guaranteed pod with integral CPUs) obtain CPUs from the shared pool. Because CPU manager only allocates exclusive CPUs to guaranteed pods with integral CPU requests (and limit), the rest (non-integral guaranteed, best effort, burstable pods) obtain CPUs from a shared pool. When CPUs are exclusively allocated we know the CPUIds allocated to the pod and can therefore determine the corresponding NUMA but how do you account for CPUs allocated (and therefore) available at a per NUMA level when the CPUs are in the shared pool? All we know is the CPUIds corresponding to the shared pool and even if we knew the amount of CPU requested (as per the approach you are suggesting) it does not help us because we still can't determine NUMA distribution of CPUs for a pool that is being shared by all burstable, best-effort and guaranteed non-integral pods.
Adding QoS to Podresource does not solve the problem at hand here as we are trying to differentiate a pod which is exclusively allocated CPUs and the ones that belong to the shared pool. A guaranteed pod with non-integral CPUs can belong to the shared pool and those are the ones we should not account for the reasons I explained above.
It is indeed about perspective :) |
In what way? You cannot derive that information from the other data.
I can see that, and that is a good motivation 😸 But I think it falls short, why be so short-sighted and only address the very problem at our hands just now? 🤔 |
I think an API shouldn't be designed for one specific consumer/usage scenario alone. There might be other users to the API than just TAS. With all the work going on there is probably interest for other scheduler extensions, too.
True, It possibly wouldn't bring much value. But what about "mixed allocation" that I mentioned earlier (i.e. if in the future kubelet would allocate exclusive plus shared cpus for the same container)? Perhaps having e.g. something like
would be more future-proof?
What is your perspective on "sustainable" API design? 😉 |
Right, we shouldn't be defining the API based on a very specific use case. Let's put Topology aware scheduling aside, you mentioned tracking burstable pods as an example but I double checked the List endpoint and noted that burstable pods are already exposed as part of List response. At the moment it is not clear to me what we are trying to achieve by introducing request and limit information in the podresource API itself? Unfortunately, the value proposition is not very clear to me yet.
Can you elaborate on what Scheduling extensions you are referring to here? If we have a specific use case there is no problem extending the API but like I said from client perspective it seems to be adding a lot of complexity trying to solve a problem which can be solved in more cleaner and simpler manner.
I am okay with addition of
I am all for "sustainable" API but we have to strike a balance here to make sure that we introduce a change that A) aligns with the goals and intents of the API and B) has a strong use case to back the API change. Changing the API for something we "might" need is not ideal. |
I will create a patch with the addition of |
From the original description:
This sounds good!
I see the point, but I also see a danger. In this API design there is a built-in assumption on the nature of CPU pools. The assumption is that CPUs would be either from the shared CPU pool, or they would be exclusive to a container. In use cases where this assumption is true, I think it is perfectly fine that a monitoring agent makes this assumption, too. But if you make this assumption in the API design, you cannot build sensible monitoring agents for different use cases using that API. Please consider the podpools as a real-life example of a different use case. This policy partitions CPUs of a node into pools whose capacity can be configured in the number of pods that are running in the same pool. This is very useful for efficient resource usage in certain cases where you know what kind of workloads will fill your node. For instance, there can be following pool split on the node:
Therefore, I would prefer an API design that exposes the resource reservation information in the format that is independent of the use case. In other words, I would prefer a design that makes minimal or no assumptions on what the reservation algorithm might have thought when making reservations that are reported. |
Rationale behind this change is explained in detail here: #102190
A pod with same non-integral CPU request and limit belongs to Guaranteed
QoS class but obtains CPUs from the shared pool. Currently, in podresource API
there is no way to distinguish such pods from the pods which have been exclusively
allocated CPUs.
One of the primary goals of recent enhancements of PodResource API is to allow it to
enable node monitoring agents to know the allocatable compute resources on a node,
in order to calculate the node compute resource utilization. However, not being
able to determine if a pod has been exclusively allocated CPUs or CPUs belong to
the shared pool means that the goal cannot be realized.
We need to enhance the podresource API to give it the ability to determine if a pod
belongs to the shared pool or exclusive pool to be able to do proper accounting in
the node monitoring agents.
Signed-off-by: Swati Sehgal swsehgal@redhat.com
What type of PR is this?
/kind bug
/kind api-change
What this PR does / why we need it:
This PR introduces a new boolean field called
isExclusive
as part of ContainerResources in PodResource API to distinguish if the CPUs are exclusively allocated to a container or are allocated from shared pool.Which issue(s) this PR fixes:
Fixes # #102190
Documentation update: kubernetes/website#28488
Does this PR introduce a user-facing change?
Special notes for your reviewer:
This issue was discussed in SIG Node on 1st June 2021 and the SIG Node chair (Dawn Chen) agreed to treat this as a bug to be fixed in 1.22 release timeframe.