podresources: add Watch endpoint #1926

ffromani · 2020-08-06T15:52:21Z

Extend the protocol with a simple implementation of ListAndWatch
to enable monitoring agents to be notified of resource allocation
changes.

Signed-off-by: Francesco Romani fromani@redhat.com

Extend the protocol with a simple implementation of ListAndWatch to enable monitoring agents to be notified of resource allocation changes. Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot · 2020-08-06T15:52:29Z

Welcome @fromanirh!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-08-06T15:52:30Z

Hi @fromanirh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ffromani · 2020-08-06T15:53:07Z

@dashpole could be interested in reviewing this PR

swatisehgal · 2020-08-06T16:03:04Z

This feature would be useful for the topology exporter agent that we intend to introduce as part of Topology Aware Scheduling work

dashpole · 2020-08-06T23:19:00Z

keps/sig-node/compute-device-assignment.md

+Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager.
+The GRPC Service can return:
+- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node.
+- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates.


Should the stream be of individual pods, rather than the entire list when it changes? We would need to do something to signal deletion... But I would worry that a high rate of pod churn could make this very expensive.

it's reasonable to introduce threshold on kubelet side, configurable via KubeletConfiguration, to not send notification so often, one notification will contain a bunch of podresources (it already described in this KEP). I think it worth mentioning in this KEP, but for implementation I think it's step 2.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

initially, the state of all the pods (exactly the same output as the current List() API)

then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

So the monitoring app observes

message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
(some time passes, pod gets deleted)

a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".

Makes sense?

Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?

also: yes, we can totally add configurable thresholds, I'll add to the KEP.

Thanks, that addresses my primary concern. I'm not entirely sold on "pod without resources" indicating deletion being the best way to represent it, but as long as we consider some alternatives and still prefer it, it is reasonable.

No problem regarding better ways to represent deletions: I'm very open on alternatives.
To elaborate my rationale, I considered this approach because it required no extra changes to the API - the diff is minimal and the semantic seemed clear enough.

@dashpole the intention is totally to stream only individual pod changes. The API will return:

initially, the state of all the pods (exactly the same output as the current List() API)

I would like to avoid it

then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

So the monitoring app observes

message which contains pod P1, with container C1 (with devices d1, d2) and container C2 (with devices d3)
(some time passes, pod gets deleted)

a message which contains pod P1, with container C1 (no devices) and container C2 (no devices)
So the monitor app can unambiguously learn that "all the resources previously allocated to C1 and C2 in P1 can now be cleared".

Makes sense?

Also, I believe this should be documented, processwise is ok to add this in the KEP, or should be added as comment to the .proto file? or both?

@dashpole the intention is totally to stream only individual pod changes. The API will return:

initially, the state of all the pods (exactly the same output as the current List() API)

I would like to avoid it

Which issues do you see with this approach?

then, only the pod whose resource allocation changed for any reason, newly created pod, or deleted pod.
Regarding deletion, I was thinking to just send a message with all the resources (currently devices only) cleared.

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

If we send the complete allocation with each message, besides the DELETED case we are still discussing, I don't really see the benefit of a separate action field: could you please elaborate on this?
I see some benefit if each message provides the resource allocation delta (changes from the previous message), but I'd like to avoid sending deltas, it seems more robust (and not much more inefficient) to send total allocation each time.

Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.

I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

That would be consistent with k8s watch, so I would really support this one.

Re serving initial state as part of "watch": kubernetes/kubernetes#13969
This is causing us a bunch of burden and is misleading to users.

I would really prefer those two to be separate calls for list vs watch (stream). There is a question how you can ensure consistency (i.e. that nothing happened between the list and the watch calls that won't be reflected in watch and also weren't in list yet).

Maybe just add action field into ListPodResourcesResponse with following possible values: ADDED, UPDATED, DELETED

That would be consistent with k8s watch, so I would really support this one.

Ok, all of those are good points, and I was not aware of the issue you mentioned. I will update the KEP.

AlexeyPerevalov · 2020-08-07T08:27:49Z

keps/sig-node/compute-device-assignment.md

+Add a v1alpha1 Kubelet GRPC service, at `/var/lib/kubelet/pod-resources/kubelet.sock`, which returns information about the kubelet's assignment of devices to containers. It obtains this information from the internal state of the kubelet's Device Manager.
+The GRPC Service can return:
+- a single PodResourcesResponse, enabling monitor applications to poll for resources allocated to pods and containers on the node.
+- a stream of PodResourcesResponse, enabling monitor applications to be notified of new resource allocation, release or resource allocation updates.


it's reasonable to introduce threshold on kubelet side, configurable via KubeletConfiguration, to not send notification so often, one notification will contain a bunch of podresources (it already described in this KEP). I think it worth mentioning in this KEP, but for implementation I think it's step 2.

Address reviewers comment: 1. Add explicit Watch endpoint so APIs are composable (not bundled in ListAndWatch) 2. Add explicit action field in the Watch() endpoint response Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2020-08-11T08:26:30Z

Thanks for all the comments. I think I addressed all of them but the consistency concern between List and Watch, which I'm still thinking about. I will update again once I'm happy with a proposal.
I didn't add the threshold mentioned here #1926 (comment) because I think this pertains more the implementation side; but if reviewers disagree I'll mention in the next update.

Last but not least: I'll squash commits when we reach agreement.

ffromani · 2020-08-11T13:30:52Z

keps/sig-node/compute-device-assignment.md

+
+// WatchPodResourcesResponse is the response returned by Watch function
+message WatchPodResourcesResponse {
+    WatchPodAction action = 1;


I wonder if just exposing the pod resourceVersion here is a good way forward

keps/sig-node/compute-device-assignment.md

* Missed reference to "ListAndWatch", now replaced by "Watch" * renamed UPDATED->MODIFIED To be more compliant with kube naming standards (https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes) Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani · 2020-09-08T10:37:06Z

PR with the implemenatation: kubernetes/kubernetes#94612

derekwaynecarr

I have a few questions, both related to this KEP, and how it relates to closing out the KubeletPodResources feature that is presently in beta.

Please clarify the following:

The PodResources message does not include the uuid, this means we are not able to differentiate across space and time as two different pods with same name/same namespace appear the same. I think we should emit the uid especially when adding watch.
The grpc interface for PodResources is v1alpha1, but the feature is beta. Should we just move this to v1 now and commit to backward compatibility on this interface. I do not see the need for a new feature gate related to this new operation, but would like to clean up our grpc surfaces to stable fwd compatible boundaries.
How is resource version derived? Is it the value on the pod itself?
Does the watch guarantee ordering? For example, if a pod A is deleted, device D is reclaimed, and pod C is started and assigned device D, am I guaranteed to get those events emitted in order? If so, how and where are you integrating emitting the watch events within the kubelet subsystems?
How do you intend to test the watch semantics in e2e?

derekwaynecarr · 2020-09-29T14:52:41Z

keps/sig-node/compute-device-assignment.md

 // PodResources contains information about the node resources assigned to a pod
 message PodResources {
    string name = 1;
    string namespace = 2;
    repeated ContainerResources containers = 3;
+    int64 resource_version = 4;


is this the pod resource version as stored in etcd or something else?

yes, that's the intention, in order to enable client code to reconcile data from Watch() with the data they get from List().

derekwaynecarr · 2020-09-29T14:55:18Z

keps/sig-node/compute-device-assignment.md

+// WatchPodResourcesRequest is the request made to the Watch PodResourcesLister service
+message WatchPodResourcesRequest {}
+
+enum WatchPodAction {


when is each action emitted? can you clarify when modified would be used in life of pod?

Actions should be emitted:

ADDED: when resources are assigned to the pod (I'm thinking about HintProvider's Allocate())

DELETED: when resources are claimed back (I'm thinking about UpdateAllocatedDevices())
I'll document better in the KEP text.

In Hindsight we most likely don't need MODIFED, will just remove it.

derekwaynecarr · 2020-09-29T14:56:38Z

keps/sig-node/compute-device-assignment.md

+To keep the implementation simple as possible, the kubelet does *not* store any historical list of changes.
+
+In order to make sure not to miss any updates, client application can:
+1. call the `Watch` endpoint to get a stream of changes.


what triggers watch events from getting emitted in kubelet code flows?

After some experiments, I think ADDED should be triggered after succesfull allocation from topology manager (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/topologymanager/topology_manager.go#L232)
while DELETED should be triggered once device are claimed back (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/container_manager_linux.go#L1034)

derekwaynecarr · 2020-09-29T15:11:58Z

/assign @dashpole @derekwaynecarr

derekwaynecarr · 2020-09-29T15:12:22Z

/ok-to-test

derekwaynecarr · 2020-09-29T15:17:06Z

I see this feature as an evolution of the existing beta feature for KubeletPodResources where users just require the new watch versus list.

To track this properly in the project, we should do the following:

create an issue in this repo (unless one exists for this existing feature gate that @dashpole has handy)
update the design document to match the updated kep template that has been rolled out scoped to the prior issue
we can then get the issue and kep in a mergable state before enhancement freeze for 1.20

Do you agree with above @dashpole ?

ffromani · 2020-09-30T17:46:14Z

I have a few questions, both related to this KEP, and how it relates to closing out the KubeletPodResources feature that is presently in beta.

Please clarify the following:
* The `PodResources` message does not include the uuid, this means we are not able to differentiate across space and time as two different pods with same name/same namespace appear the same.  I think we should emit the uid especially when adding watch.

Good point, needs to be added.

* The grpc interface for PodResources is v1alpha1, but the feature is beta.  Should we just move this to v1 now and commit to backward compatibility on this interface.  I do not see the need for a new feature gate related to this new operation, but would like to clean up our grpc surfaces to stable fwd compatible boundaries.

I'm not sure which is the question here: surely I want to meet the beta graduation criterias (good e2e tests for example).
From my perspective having a code which is solid enough and backward compatible to not require a feature gate is a fair requirement implementationwise. But not sure I got your point :)

 * How is resource version derived?  Is it the value on the pod itself?

Yes. This just seemed the simplest and safest approach but I'm open to alternatives here.

* Does the watch guarantee ordering?  For example, if a pod A is deleted, device D is reclaimed, and pod C is started and assigned device D, am I guaranteed to get those events emitted in order?  If so, how and where are you integrating emitting the watch events within the kubelet subsystems?

Initially I thought ordering was not needed because this interface wants to enable resource accounting, but thinking about it some ordering guarantees are indeed needed to avoid some misreporting in corner cases (pod deletion/addition when at full resource capacity comes to mind, but there is likely more). So we need to guarantee this, and I'll document this in the KEP.

About emitting event I think we covered in inline comments.

* How do you intend to test the watch semantics in e2e?

I'll elaborate test plans (does this need to be in the KEP? asking because the original one doesn't mention that) but I think we'll start with

the fundamentals: basic reporting (events emissed when expected)
ensure consistent ordering of event reporting in corner case (no resources, full capacity...)
the same in concurrent scenarios
make sure the clients can safely reconcile List() and Watch() outputs

e2e tests is integral part of prototyping implementation already in this WIP PR: kubernetes/kubernetes@8c0e617
to cover the last point (List() and Watch() reconciliation) I also want to add e2e tests to cover the existing podresources API behaviour (alongside with unit tests obv.)

ffromani · 2020-09-30T17:48:22Z

Next step for me is to reflect the review comments in the KEP PR, then I'll consolidate this PR in the existing #1884 ; last I'll close this one. This is meant to simplify the process and make review easier.

Add new field in the API responses objects to allow client applications to consume both `List` and `Watch` endpoints. The issue here is enabling client applications to not lose any updates when both APIs are used. The straightforward option is to follow the generic k8s approach (see link below) and let kubelet keep a historical window of the last recent changes, so client applications have the chance to issue `List` and shortly after `Watch`, starting from the resourceVersion returned in `List`. The underlying assumption is indeed that `Watch` happens "shortly" after `List`, otherwise the system cannot guarantee the lack of gaps. However implementing this support requires to keep the aforementioned sliding window of changes, which however requires careful implementation to address scalability and safety guarantees. However, the `podresources` API is a specific API, so, while is good to follow as much as possible the generic API concepts, it also allows some possible little differences which can help keep the implementation simple and safe. This patch proposes a simplest possible approach to reconcile the `List` and `Watch` responses, providing the `resource_version` field and suggesting a little change in the client applications programming model. Inspired by the concepts found on https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes Signed-off-by: Francesco Romani <fromani@redhat.com>

Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot · 2020-10-01T11:21:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fromanirh
To complete the pull request process, please assign derekwaynecarr after the PR has been reviewed.
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2020-10-01T11:22:37Z

@fromanirh: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-enhancements-verify	`1346c2a`	link	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ffromani · 2020-10-01T14:33:27Z

Next step for me is to reflect the review comments in the KEP PR, then I'll consolidate this PR in the existing #1884 ; last I'll close this one. This is meant to simplify the process and make review easier.

comments addressed and updated content added on top of #1884 - closing this one to reduce the confusion.

k8s-ci-robot · 2020-10-01T14:33:34Z

@fromanirh: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Per comments during the review of the Watch endpoint proposal: 1. kubernetes#1926 (comment) 2. kubernetes#1926 (review) The agreed semantic of a Watch() response message a. refers to a single pod. The intention was always to stream only individual pod changes (kubernetes#1926 (comment)) b. must allow the client to reconcile with the response of the List() endpoint, thus must include a pod resource version. This patch thus adds the missing resource version field and removes the `repeated` attribute to the `PodResources` field. Removing `repeated` is the simplest possible change that aligns the proposal to the intention. Alternatively, it is possible to change the proto so we can allow a `WarchPodResourcesResponse` object to convey information about more pods; however the performance and UX benefits of this more invasive change are unclear, so we avoid it at this moment. This change was missing because it was lost in a rebase Signed-off-by: Francesco Romani <fromani@redhat.com>

podresources: add ListAndWatch function

0c6948a

Extend the protocol with a simple implementation of ListAndWatch to enable monitoring agents to be notified of resource allocation changes. Signed-off-by: Francesco Romani <fromani@redhat.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 6, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 6, 2020

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 6, 2020

k8s-ci-robot requested review from dchen1107 and derekwaynecarr August 6, 2020 15:52

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 6, 2020

dashpole reviewed Aug 6, 2020

View reviewed changes

AlexeyPerevalov suggested changes Aug 7, 2020

View reviewed changes

podresources: watch: split endpoint, add action

86596df

Address reviewers comment: 1. Add explicit Watch endpoint so APIs are composable (not bundled in ListAndWatch) 2. Add explicit action field in the Watch() endpoint response Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani changed the title ~~podresources: add ListAndWatch function~~ podresources: add Watch endpoint Aug 11, 2020

ffromani commented Aug 11, 2020

View reviewed changes

swatisehgal suggested changes Aug 11, 2020

View reviewed changes

keps/sig-node/compute-device-assignment.md Outdated Show resolved Hide resolved

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 20, 2020

ffromani mentioned this pull request Sep 8, 2020

WIP: Podresources: add Watch() API kubernetes/kubernetes#94612

Closed

derekwaynecarr requested changes Sep 29, 2020

View reviewed changes

k8s-ci-robot assigned dashpole and derekwaynecarr Sep 29, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 29, 2020

ffromani added 2 commits October 1, 2020 13:20

podresources: watch: address review comments

1346c2a

Signed-off-by: Francesco Romani <fromani@redhat.com>

ffromani force-pushed the kep-list-and-watch branch from aea8a6e to 1346c2a Compare October 1, 2020 11:20

ffromani closed this Oct 1, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2020

ffromani mentioned this pull request Oct 2, 2020

PodResources interface enhancements #1884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

podresources: add Watch endpoint #1926

podresources: add Watch endpoint #1926

ffromani commented Aug 6, 2020

k8s-ci-robot commented Aug 6, 2020

k8s-ci-robot commented Aug 6, 2020

ffromani commented Aug 6, 2020

swatisehgal commented Aug 6, 2020

dashpole Aug 6, 2020

AlexeyPerevalov Aug 7, 2020

ffromani Aug 7, 2020 •

edited

ffromani Aug 7, 2020

dashpole Aug 7, 2020

ffromani Aug 7, 2020 •

edited

AlexeyPerevalov Aug 10, 2020

ffromani Aug 10, 2020

wojtek-t Aug 10, 2020

ffromani Aug 10, 2020

AlexeyPerevalov Aug 7, 2020

ffromani commented Aug 11, 2020

ffromani Aug 11, 2020

ffromani commented Sep 8, 2020

derekwaynecarr left a comment

derekwaynecarr Sep 29, 2020

ffromani Sep 29, 2020

derekwaynecarr Sep 29, 2020

ffromani Sep 30, 2020 •

edited

derekwaynecarr Sep 29, 2020

ffromani Sep 30, 2020

derekwaynecarr commented Sep 29, 2020

derekwaynecarr commented Sep 29, 2020

derekwaynecarr commented Sep 29, 2020

ffromani commented Sep 30, 2020

ffromani commented Sep 30, 2020

k8s-ci-robot commented Oct 1, 2020

k8s-ci-robot commented Oct 1, 2020

ffromani commented Oct 1, 2020 •

edited

k8s-ci-robot commented Oct 1, 2020

podresources: add Watch endpoint #1926

podresources: add Watch endpoint #1926

Conversation

ffromani commented Aug 6, 2020

k8s-ci-robot commented Aug 6, 2020

k8s-ci-robot commented Aug 6, 2020

ffromani commented Aug 6, 2020

swatisehgal commented Aug 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani Aug 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani Aug 7, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani commented Aug 11, 2020

Choose a reason for hiding this comment

ffromani commented Sep 8, 2020

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani Sep 30, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Sep 29, 2020

derekwaynecarr commented Sep 29, 2020

derekwaynecarr commented Sep 29, 2020

ffromani commented Sep 30, 2020

ffromani commented Sep 30, 2020

k8s-ci-robot commented Oct 1, 2020

k8s-ci-robot commented Oct 1, 2020

ffromani commented Oct 1, 2020 • edited

k8s-ci-robot commented Oct 1, 2020

ffromani Aug 7, 2020 •

edited

ffromani Aug 7, 2020 •

edited

ffromani Sep 30, 2020 •

edited

ffromani commented Oct 1, 2020 •

edited