Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing isExclusive field as part of ContainerResources in PodResource API #102989

Closed

Conversation

swatisehgal
Copy link
Contributor

@swatisehgal swatisehgal commented Jun 18, 2021

Rationale behind this change is explained in detail here: #102190

A pod with same non-integral CPU request and limit belongs to Guaranteed
QoS class but obtains CPUs from the shared pool. Currently, in podresource API
there is no way to distinguish such pods from the pods which have been exclusively
allocated CPUs.

One of the primary goals of recent enhancements of PodResource API is to allow it to
enable node monitoring agents to know the allocatable compute resources on a node,
in order to calculate the node compute resource utilization. However, not being
able to determine if a pod has been exclusively allocated CPUs or CPUs belong to
the shared pool means that the goal cannot be realized.

We need to enhance the podresource API to give it the ability to determine if a pod
belongs to the shared pool or exclusive pool to be able to do proper accounting in
the node monitoring agents.

Signed-off-by: Swati Sehgal swsehgal@redhat.com

What type of PR is this?

/kind bug
/kind api-change

What this PR does / why we need it:

This PR introduces a new boolean field called isExclusive as part of ContainerResources in PodResource API to distinguish if the CPUs are exclusively allocated to a container or are allocated from shared pool.

Which issue(s) this PR fixes:

Fixes # #102190

Documentation update: kubernetes/website#28488

Does this PR introduce a user-facing change?

Kubelet podresources API provides information if a pod has been exclusively allocated CPUs.

Special notes for your reviewer:

This issue was discussed in SIG Node on 1st June 2021 and the SIG Node chair (Dawn Chen) agreed to treat this as a bug to be fixed in 1.22 release timeframe.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 18, 2021
@swatisehgal
Copy link
Contributor Author

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jun 18, 2021
@k8s-ci-robot k8s-ci-robot added area/kubelet area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 18, 2021
func (m *mockProvider) IsExclusive(podUID, containerName string) bool {
args := m.Called(podUID, containerName)
return args.Get(0).(bool)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a leftover of a past revision

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks, fixed it.

@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@swatisehgal swatisehgal force-pushed the podres-isExclusive-pr branch 3 times, most recently from 7f3b226 to bee1a52 Compare June 18, 2021 14:13
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 21, 2021
@ehashman ehashman added this to Triage in SIG Node PR Triage Jun 21, 2021
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jun 22, 2021
@ehashman
Copy link
Member

/triage accepted
/priority important-soon

@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Jun 30, 2021
@SergeyKanzhelev
Copy link
Member

@ruiwen-zhao can you please take a look?

@@ -79,7 +79,7 @@ type Manager interface {

// GetCPUs implements the podresources.CPUsProvider interface to provide allocated
// cpus for the container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add to the comment what the boolean is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done!

cpuCount: 1,
podName: "pod-03",
cntName: "cnt-00",
cpuRequest: "1000m",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just trying to understand why we are changing to use millis here? The tests dont seem to use fractional CPU request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit here: 739f11b I have added a test case with fractional CPU request, which demonstrates the case of guaranteed pods with fractional CPU request thereby obtaining CPUs from the shared pool.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
This patch changes cpuCount to cpuRequest in order to cater to cases
where guaranteed pods make non-integral CPU Requests.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
@ehashman ehashman moved this from Waiting on Author to Needs Reviewer in SIG Node PR Triage Jul 7, 2021
@swatisehgal
Copy link
Contributor Author

@ruiwen-zhao I have addressed the questions/comments you had. Could you take another look?

@ruiwen-zhao
Copy link
Contributor

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 21, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ruiwen-zhao, swatisehgal
To complete the pull request process, please assign mrunalp after the PR has been reviewed.
You can assign the PR to them by writing /assign @mrunalp in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ehashman ehashman moved this from Needs Reviewer to Needs Approver in SIG Node PR Triage Aug 4, 2021
Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be blunt, but to me this API change looks very kludge-ish without enough long-term perspective 😸 My main concerns are

  • this change addresses one specific use case
  • this change too heavily reflects the internal state/implementation of kubelet in the API

My suggestion would be to extend the API with full information of the resource requests and limits. This would allow the consumer to derive isExclusive, and much more.

     map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 5;
     map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 6;

@swatisehgal
Copy link
Contributor Author

swatisehgal commented Aug 25, 2021

The solution proposed here is to handle a corner case of guaranteed pods requesting non-integral CPUs and not being able to determine what comprises the shared pool. This issue was also raised by @derekwaynecarr during the enhancements made to the podresource API here.

IMO, the solution proposed here is a non-invasive backward compatible solution that does not alter the original goal of the podresource API.

this change too heavily reflects the internal state/implementation of kubelet in the API

Reflecting the internal state of kubelet in terms of the resources allocated is essentially the goal of pod resource API. It returns information about the kubelet's assignment of concrete resources (devices with their numa ID, cpuids and recently memory) allocated to the containers. Please refer to kubernetes/enhancements#1884, #93243 and #95734. Podresource API allows monitoring agents to gain visibility into how those resources are allocated and the allocated resources for use cases that can be seen here. The information remains local to the node where the endpoint can be accessed by a monitoring agent running on the node and not exposed outside of the node so not sure what the cause of concern is.

My suggestion would be to extend the API with full information of the resource requests and limits. This would allow the consumer to derive isExclusive, and much more.

 map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 5;
 map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 6;

Having put some thought into this here are some of my concerns with this:

  1. This appears to be a way to determine what the pod spec looked like from resource point of view (something which can be obtained by querying the API server) and doesn’t seem to align with the original intent of providing a way to node local clients to determine info on Kubelet's concrete resource assignment.
  2. Kubelet already evaluates a pod's QoS and whether it was allocated exclusive CPUs or not based on its resource requests and limits. This evaluation is not trivial and expecting the client to perform this evaluation and essentially duplicating what Kubelet does, makes the consumption harder for the clients.
  3. This edge case we have encountered is only in case of CPUs. The only reason we run into this issue is because guaranteed pods requesting CPUs can still belong to shared pool (in case of non-integral CPUs). However, this is not a case with any other resource. So exposing the request and limit of all the possible resources requested by pod as part of podresource API seems like an overkill to me.
  4. Based on my experience expanding the scope of the API without concrete use cases just to make it future proof can be a recipe for disaster. Could you please provide use cases that would benefit from this change so we can understand the motivation a bit better?

@marquiz
Copy link
Contributor

marquiz commented Aug 26, 2021

IMO, the solution proposed here is a non-invasive backward compatible solution

I agree that it's non-invasive and backward compatible. But I think it's short-sighted, addressing the only nearest bump in front of our front tire.

this change too heavily reflects the internal state/implementation of kubelet in the API

Reflecting the internal state of kubelet in terms of the resources allocated is essentially the goal of pod resource API.

I don't believe the API should be too tightly coupled with the (current) internal implementation details of kubelet. Just from the top of my head what if kubelet starts to support "mixed allocations" with some exclusive and some shared cpus per container(?)

My suggestion would be to extend the API with full information of the resource requests and limits. This would allow the consumer to derive isExclusive, and much more.

 map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 5;
 map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 6;

Having put some thought into this here are some of my concerns with this:

  1. This appears to be a way to determine what the pod spec looked like from resource point of view (something which can be obtained by querying the API server) and doesn’t seem to align with the original intent of providing a way to node local clients to determine info on Kubelet's concrete resource assignment.

We have the cpu ids.

Second, a bit unrelated, you can't do proper accounting of the cpus just by based on the IDs. I.e. with the isExclusive flag added we still fall short on e.g. burstable containers.

  1. Kubelet already evaluates a pod's QoS and whether it was allocated exclusive CPUs or not based on its resource requests and limits. This evaluation is not trivial and expecting the client to perform this evaluation and essentially duplicating what Kubelet does, makes the consumption harder for the clients.

Maybe the pod qos class should be added into PodResources message instead, then(?) I don't know if this is a good idea (either 😉 ) but at least the pod qos class is somewhat de-facto standardized property of the pod. Another way would be to have library function for evaluating this from requests and limits so that not all consumers need to re-implement it.

  1. This edge case we have encountered is only in case of CPUs. The only reason we run into this issue is because guaranteed pods requesting CPUs can still belong to shared pool (in case of non-integral CPUs). However, this is not a case with any other resource. So exposing the request and limit of all the possible resources requested by pod as part of podresource API seems like an overkill to me.

For now, it's only cpus, future has might bring surprises, who knows. OTOH, I can't see what harm including the complete information about resource requests would do.

  1. Based on my experience expanding the scope of the API without concrete use cases just to make it future proof can be a recipe for disaster. Could you please provide use cases that would benefit from this change so we can understand the motivation a bit better?

I wouldn't see it as expanding the scope. Rather making it more complete. Maybe it's about perspective 😄 I think overall we should learn something from the way Linux kernel community designs apis.

One use case is tracking burstable containers (which I already mentioned above).

@AlexeyPerevalov
Copy link
Contributor

IMO, the solution proposed here is a non-invasive backward compatible solution

I agree that it's non-invasive and backward compatible. But I think it's short-sighted, addressing the only nearest bump in front of our front tire.

this change too heavily reflects the internal state/implementation of kubelet in the API

Reflecting the internal state of kubelet in terms of the resources allocated is essentially the goal of pod resource API.

I don't believe the API should be too tightly coupled with the (current) internal implementation details of kubelet. Just from the top of my head what if kubelet starts to support "mixed allocations" with some exclusive and some shared cpus per container(?)

My suggestion would be to extend the API with full information of the resource requests and limits. This would allow the consumer to derive isExclusive, and much more.

 map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 5;
 map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 6;

Having put some thought into this here are some of my concerns with this:

  1. This appears to be a way to determine what the pod spec looked like from resource point of view (something which can be obtained by querying the API server) and doesn’t seem to align with the original intent of providing a way to node local clients to determine info on Kubelet's concrete resource assignment.

We have the cpu ids.

Second, a bit unrelated, you can't do proper accounting of the cpus just by based on the IDs. I.e. with the isExclusive flag added we still fall short on e.g. burstable containers.

  1. Kubelet already evaluates a pod's QoS and whether it was allocated exclusive CPUs or not based on its resource requests and limits. This evaluation is not trivial and expecting the client to perform this evaluation and essentially duplicating what Kubelet does, makes the consumption harder for the clients.

Maybe the pod qos class should be added into PodResources message instead, then(?) I don't know if this is a good idea (either 😉 ) but at least the pod qos class is somewhat de-facto standardized property of the pod. Another way would be to have library function for evaluating this from requests and limits so that not all consumers need to re-implement it.

  1. This edge case we have encountered is only in case of CPUs. The only reason we run into this issue is because guaranteed pods requesting CPUs can still belong to shared pool (in case of non-integral CPUs). However, this is not a case with any other resource. So exposing the request and limit of all the possible resources requested by pod as part of podresource API seems like an overkill to me.

For now, it's only cpus, future has might bring surprises, who knows. OTOH, I can't see what harm including the complete information about resource requests would do.

  1. Based on my experience expanding the scope of the API without concrete use cases just to make it future proof can be a recipe for disaster. Could you please provide use cases that would benefit from this change so we can understand the motivation a bit better?

I wouldn't see it as expanding the scope. Rather making it more complete. Maybe it's about perspective 😄 I think overall we should learn something from the way Linux kernel community designs apis.

One use case is tracking burstable containers (which I already mentioned above).

I found that providing requests and limits as well as qos for podresources api is controversial.
It adds redundancy into podresources api.
Now we can identify whether cpu was exclusively allocated indirectly - get the appropriate pod by client-go (or have prefetched list, it doesn't matter) here we could know QoS and integral amount of request and limits
and we can get KubeletConfiguration to make sure CPUManager uses static policy.
The original motivation of this commit was to simplify topology exporter, to reduce data flows.
If we want to generalize podersouces interface, it mostly the task for the next version (v2).

@swatisehgal
Copy link
Contributor Author

Second, a bit unrelated, you can't do proper accounting of the cpus just by based on the IDs. I.e. with the isExclusive flag added we still fall short on e.g. burstable containers.

If we are talking about accounting in the context of Topology aware scheduling, considering burstable pods for accounting does not makes sense as burstable pods (just like best effort pods and guaranteed pod with integral CPUs) obtain CPUs from the shared pool. Because CPU manager only allocates exclusive CPUs to guaranteed pods with integral CPU requests (and limit), the rest (non-integral guaranteed, best effort, burstable pods) obtain CPUs from a shared pool. When CPUs are exclusively allocated we know the CPUIds allocated to the pod and can therefore determine the corresponding NUMA but how do you account for CPUs allocated (and therefore) available at a per NUMA level when the CPUs are in the shared pool? All we know is the CPUIds corresponding to the shared pool and even if we knew the amount of CPU requested (as per the approach you are suggesting) it does not help us because we still can't determine NUMA distribution of CPUs for a pool that is being shared by all burstable, best-effort and guaranteed non-integral pods.

Maybe the pod qos class should be added into PodResources message instead, then(?) I don't know if this is a good idea (either ) but at least the pod qos class is somewhat de-facto standardized property of the pod. Another way would be to have library function for evaluating this from requests and limits so that not all consumers need to re-implement it.

Adding QoS to Podresource does not solve the problem at hand here as we are trying to differentiate a pod which is exclusively allocated CPUs and the ones that belong to the shared pool. A guaranteed pod with non-integral CPUs can belong to the shared pool and those are the ones we should not account for the reasons I explained above.

I wouldn't see it as expanding the scope. Rather making it more complete. Maybe it's about perspective I think overall we should learn something from the way Linux kernel community designs apis.

It is indeed about perspective :)

@marquiz
Copy link
Contributor

marquiz commented Sep 1, 2021

I found that providing requests and limits as well as qos for podresources api is controversial.
It adds redundancy into podresources api.

In what way? You cannot derive that information from the other data.

The original motivation of this commit was to simplify topology exporter, to reduce data flows.

I can see that, and that is a good motivation 😸 But I think it falls short, why be so short-sighted and only address the very problem at our hands just now? 🤔

@marquiz
Copy link
Contributor

marquiz commented Sep 1, 2021

If we are talking about accounting in the context of Topology aware scheduling

I think an API shouldn't be designed for one specific consumer/usage scenario alone. There might be other users to the API than just TAS. With all the work going on there is probably interest for other scheduler extensions, too.

Adding QoS to Podresource does not solve the problem at hand here as we are trying to differentiate a pod which is exclusively allocated CPUs and the ones that belong to the shared pool. A guaranteed pod with non-integral CPUs can belong to the shared pool and those are the ones we should not account for the reasons

True, It possibly wouldn't bring much value.

But what about "mixed allocation" that I mentioned earlier (i.e. if in the future kubelet would allocate exclusive plus shared cpus for the same container)? Perhaps having e.g. something like

repeated int64 exclusive_cpu_ids

would be more future-proof?

It is indeed about perspective :)

What is your perspective on "sustainable" API design? 😉

@swatisehgal
Copy link
Contributor Author

I think an API shouldn't be designed for one specific consumer/usage scenario alone. There might be other users to the API than just TAS.

Right, we shouldn't be defining the API based on a very specific use case. Let's put Topology aware scheduling aside, you mentioned tracking burstable pods as an example but I double checked the List endpoint and noted that burstable pods are already exposed as part of List response. At the moment it is not clear to me what we are trying to achieve by introducing request and limit information in the podresource API itself? Unfortunately, the value proposition is not very clear to me yet.

With all the work going on there is probably interest for other scheduler extensions, too.

Can you elaborate on what Scheduling extensions you are referring to here? If we have a specific use case there is no problem extending the API but like I said from client perspective it seems to be adding a lot of complexity trying to solve a problem which can be solved in more cleaner and simpler manner.

But what about "mixed allocation" that I mentioned earlier (i.e. if in the future kubelet would allocate exclusive plus shared cpus for the same container)? Perhaps having e.g. something like

repeated int64 exclusive_cpu_ids

would be more future-proof?

I am okay with addition of exclusive_cpu_ids field instead of isExclusive. From client perspective, this is simple and all they have to do is to compare this field with the cpu_ids field to determine if all the CPUs allocated are exclusive!

What is your perspective on "sustainable" API design?

I am all for "sustainable" API but we have to strike a balance here to make sure that we introduce a change that A) aligns with the goals and intents of the API and B) has a strong use case to back the API change. Changing the API for something we "might" need is not ideal.

@swatisehgal
Copy link
Contributor Author

Perhaps having e.g. something like

repeated int64 exclusive_cpu_ids

would be more future-proof?

I am okay with addition of exclusive_cpu_ids field instead of isExclusive. From client perspective, this is simple and all they have to do is to compare this field with the cpu_ids field to determine if all the CPUs allocated are exclusive!

I will create a patch with the addition of exclusive_cpu_ids field instead of isExclusive and link it here.

@askervin
Copy link

askervin commented Sep 7, 2021

From the original description:

One of the primary goals of recent enhancements of PodResource API is to allow it to
enable node monitoring agents to know the allocatable compute resources on a node,

This sounds good!

However, not being able to determine if a pod has been exclusively allocated
CPUs or CPUs belong to the shared pool means that the goal cannot be realized.

I see the point, but I also see a danger. In this API design there is a built-in assumption on the nature of CPU pools. The assumption is that CPUs would be either from the shared CPU pool, or they would be exclusive to a container. In use cases where this assumption is true, I think it is perfectly fine that a monitoring agent makes this assumption, too. But if you make this assumption in the API design, you cannot build sensible monitoring agents for different use cases using that API.

Please consider the podpools as a real-life example of a different use case. This policy partitions CPUs of a node into pools whose capacity can be configured in the number of pods that are running in the same pool. This is very useful for efficient resource usage in certain cases where you know what kind of workloads will fill your node.

For instance, there can be following pool split on the node:

| CPU |  Pool
+-----+-------------------------
|cpu0 | Reserved
|cpu1 | (kube-system pods)
+-----+-------------------------
|cpu2 | Single-pod pool A
|cpu3 | - runs all containers
|cpu4 |   of a single pod
|cpu5 |
+-----+-------------------------
|cpu6 | Single-pod pool B
|cpu7 | - runs all containers
|cpu8 |   of a single pod
|cpu9 |
+-----+-------------------------
:     :
+-----+-------------------------
|cpuK | Quad-pod pool A
|cpuL | - runs all containers
|cpuM |   of at most 4 pods
|cpuN |
+-----+-------------------------
:     :
+-----+-------------------------
|cpuX | Shared pool
|cpuY | - runs non-pooled
|cpuZ |   workloads
+-----+-------------------------

Therefore, I would prefer an API design that exposes the resource reservation information in the format that is independent of the use case. In other words, I would prefer a design that makes minimal or no assumptions on what the reservation algorithm might have thought when making reservations that are reported.

@swatisehgal
Copy link
Contributor Author

@marquiz @askervin Thanks for your input and all the discussion on this PR. Given that you both have objections on the proposed API changes as part of this PR, I am closing this PR. We'd be looking at alternative solutions to solve this issue.

SIG Node PR Triage automation moved this from Needs Approver to Done Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

None yet

10 participants