Storage Capacity Constraints for Pod Scheduling #1353

pohly · 2019-11-04T06:53:25Z

This KEP explains how CSI drivers can expose how much capacity a storage system has available via the API server. This information can then be used by the Kubernetes Pod scheduler to make more intelligent decisions when placing:

pods with ephemeral inline volumes
pods with persistent volumes that haven't been been created yet (WaitForFirstConsumer binding mode)

Enhancement issue: #1472

pohly · 2019-11-04T06:59:04Z

@cdickmann just created #1347, which is a similar proposal but with a different API and different goals. We learned about each others work last week and agreed that the best approach for reconciling the two proposals would be to create KEP PRs and then use the KEP process to discuss them.

The goal has to be to define one API extension which works for both purposes (Kubernetes scheduler here, more intelligent operators in #1347).

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md

cdickmann · 2019-11-04T10:16:06Z

If I compare this KEP to ours (#1347, hopefully my CLA issue is resolved today, waiting for tech support) then I see some key differences:

We assume that storage can have more diverse topology, e.g. tracking individual disks or disk groups in a host, or tracking multiple independent storage arrays attached to the same node. In all these cases, the capacity is really on a StoragePool and a node has (or doesn't have) access to it. IMHO that is more powerful and a superset of what is proposed here.
Beyond capacity, IMHO we may need more fields in the future, e.g. health and other properties of a StoragePool.
This KEP proposed K8s scheduler changes to take advantage of the additional available information.

I feel it is useful to separate two topics:

Surfacing up information about the storage. @pohly Do you see anything that you would like K8s to surface up which isn't covered by our KEP?
Having K8s leverage them for scheduling. We decided to scope this out of our KEP. @pohly Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

pohly · 2019-11-04T11:23:57Z

IMHO that is more powerful and a superset of what is proposed here.

True. The downside is that it is also more complicated and can't be implemented without extensions to the CSI standard. The proposal here was meant to work with just the existing mechanisms (CSI topology, v1.NodeSelector to represent that in Kubernetes).

Beyond capacity, IMHO we may need more fields in the future, e.g. health and other properties of a StoragePool.

Agreed.

Do you see anything that you would like K8s to surface up which isn't covered by our KEP?
Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

I still need to think about your proposal and how the Kubernetes scheduler itself could make use of it with a CSI driver that hasn't been modified.

pohly · 2019-11-04T13:14:43Z

IMHO that is more powerful and a superset of what is proposed here.

True.

After going through your proposal once more I am not so sure about that anymore. It seems to be focused exclusively on a flow where a PVC is somehow tied to a storage pool and then gets provisioned with CreateVolume first, before any pod using that volume.

The generic late-binding case (Pod uses PVC which refers to a storage class and WaitForFirstConsumer is used) cannot be handled because that generic PVC doesn't reference a storage pool, so Kubernetes doesn't have the information about storage capacity that it needs for choosing a node for the pod. Ephemeral inline volumes also cannot be handled because those don't involve CreateVolume.

pohly · 2019-11-05T10:56:29Z

@pohly Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

As explained in my previous comment, #1353 (comment), I think the answer is no. Let me turn the question around, can you imagine basing your extension on the revised API in this KEP (see below)? 😄

Obviously the original API was too focused on just capacity tracking. Thanks to your KEP additional use cases became clearer and I now tried to come up with a more general API that can be extended to also cover those - see e0e7c43. At this point, it's a 1:1 renaming of what I had before plus some additional flexibility regarding what information must be provided. In this PR I'd prefer to keep it at that level to ensure that it remains small enough to make progress.

If you think that this goes in the right direction, then I could try to come up with a revision of your KEP that is based on this one.

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md

pohly · 2019-11-08T04:53:48Z

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md

+This is a problem for deployments that do support it, but haven’t been
+able to provide the necessary information yet. TODO: extend
+CSIDriverInfo with a field that tells the scheduler to retry later
+when information is missing? Avoid the situation by (Somehow? How?!)


I might have found a solution for this:

the driver deployment creates the top-level storage info object

external-provisioner just updates it while the object exists

removing the driver also removes the object

This way the scheduler knows that if there is an object, it has to schedule based on the information contained in it. The information might still be stale or incomplete, though.

Or even simpler: add a CSIDriver.Status field with all the dynamic information.

I switched over to that. Conceptually it makes sense to me. But other APIs also would work.

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md

cdickmann · 2019-11-14T12:53:50Z

I think between our 2 KEPs we are converging. Our use cases have some overlap, but are not 100% the same. Yet, at the core of our solutions we have a concept of a StoragePool, which represents a logical grouping of storage which is accessible by a set of nodes, and which has metadata like capacity. Patrick's KEP has a nice idea where capacity is reported per-storage-class because the storage class can actually change how much capacity gets used, and I agree with this idea. We can build our KEP on top of the same K8s API that Patrick is proposing, even though we likely need to extend it, and we also need CSI changes.

Patrick covers the following cases that we do not need for our use case, but that don't conflict:

K8s scheduler changes that consume the storage pool capacity
Concepts of and pseudo-storage-classes

Our KEP covers the following which isn't in Patrick's KEP:

Decouple Node and StoragePool topology, e.g. offer the ability to present multiple StoragePools per node, and StoragePools-to-Compute accessibility which don't follow the general node topology. This in turn requires our KEP to amend CSI.
Motivation and guidance around how K8s operators of Storage applications would leverage StoragePools for their own placement
A way of facilitating an application choosing which StoragePool to provision against, as it will consider more than just capacity

So we can easily converge here, and to me the path forward comes down to community guidance on:

Should we make a single KEP? Or should we make one core KEP which introduces StoragePool, and 2 extensions for the 2 use cases? Or just go with the 2 KEPs we have right now? I have no strong preference.
Should StoragePool be a first class K8s object (like we suggest), or be embedded in a larger CSIDriver K8s object (like Patrick suggets)? I have no strong preference
Should we put the new K8s logic into a new sidecar or into external provisioner? I have no strong preference.

Maybe @saad-ali , @thockin, and others can give their guidance on these questions?

NickrenREN · 2019-11-19T12:28:36Z

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md

+// cluster. That subset can be identified either via NodeTopology or
+// NodeList, but not both. If neither is set, the pool is assumed
+// to be available in the entire cluster.
+type CSIStoragePool struct {


For local storage in a very large cluster, is it too long for users to read the information in one API object ? It seems we need to have a CSIStoragePool struct for each node.
And also, is it reasonable to organize the storage pool by storage class ? I agree it is good for network storage, but it seems not that good for local storage .

For local storage in a very large cluster, is it too long for users to read the information in one API object ?

That may become an issue, yes.

It seems we need to have a CSIStoragePool struct for each node.

But that's then not meant to be readable by users, is it? As discussed in #1347 (comment), the name of those objects most likely will have to be generated such that they don't mean anything.

What about storing the information as proposed in this KEP, but then not show it in kubectl describe? kubectl describe pods does that for spec.affinity. It's still available to users when using kubectl get -o yaml. I'm not sure how that is implemented and whether it is a good idea, but I at least wanted to bring up... 🤷‍♂️

And also, is it reasonable to organize the storage pool by storage class ? I agree it is good for network storage, but it seems not that good for local storage .

I don't follow here. Are you saying that local storage can ignore storage classes? That's not necessarily the case. The example where available usable storage depends on storage class parameters is LVM with striped vs. mirrored mode selected via the storage class. That is local storage.

But that's then not meant to be readable by users, is it?

Yes, exactly.

What about storing the information as proposed in this KEP, but then not show it in kubectl describe? kubectl describe pods does that for spec.affinity. It's still available to users when using kubectl get -o yaml

I guess users may want to know the topology info just like capacity and allocatable in node status.

Are you saying that local storage can ignore storage classes? That's not necessarily the case. The example where available usable storage depends on storage class parameters is LVM with striped vs. mirrored mode selected via the storage class. That is local storage.

Yeah, actually, for local storage, we just need one storage class with some necessary parameters in the cluster. If so, as proposed here, each node with local storage is a storage pool.... Their storageclass may be all the same. I am not sure if it is reasonable, just feel weird.
It seems StorageClass and StoragePool does not have strict mapping relation. Why do we need to add StorageClass in StoragePool object.

Or could we limit the StoragePool more strictly, and make it 1:1 mapping relationship with StorageClass ?

I guess users may want to know the topology info just like capacity and allocatable in node status.

So you are arguing that we should put storage information into the Node.Status? That only makes sense for one particular use case (local storage). I don't think we should develop a special solution just for that.

Yeah, actually, for local storage, we just need one storage class with some necessary parameters in the cluster.

That may be the case for some drivers, but not all. We need a solution that is flexible enough to cover different scenarios. A driver with multiple different storage classes and capacity that depends on the storage class is one of those scenarios.

If a driver does not need that flexibility, it can produce information with just storageClassName: <fallback>, as in the "local storage" example:

status: storage: - classes: - capacity: 256G storageClassName: <fallback> name: node-1 nodeList: - node-1

Or could we limit the StoragePool more strictly, and make it 1:1 mapping relationship with StorageClass ?

Isn't that the special case above?

Especially user-friendly when using kubectl describe...

But how does the user find the right object to call kubectl describe for? How user-friendly is that part going to be?

Since our cluster is quite large, we care about the performance a lot.

That part is indeed crucial. But it also depends on the usage pattern of the performance-critical parts. So let's consider that aspect for the worst case, node-local storage. With the current proposal (CSIDriver.Status), there will be one producer per node, sending updates to its part of the status via a patch operation. There's one consumer (the scheduler). It should better not create a watch because otherwise it'll constantly get updates even when there's nothing to schedule. Instead, it can retrieve the CSIDriver each time it is needed. I'm assuming here that scheduling pods with storage is going to be rare compared to scheduling pods without storage.

With CSIStoragePool, the writers will send a complete Status update for one object. There's still only one consumer (or two in your case?). That scheduler now must retrieve all objects with a certain CSIStoragePool.Spec.DriverName value each time it wants to schedule a pod with a volume using that driver. Because individual objects are smaller, a watch might make more sense here, although the argument about getting updates when they aren't needed still holds.

Does that sound right?

It's not obvious to me which approach will lead to better performance. I was hoping that some apiserver experts have already measured something like this and could share their experience. If that doesn't happen, then I have to proceed with prototyping both APIs and writing some microbenchmark to get some real numbers.

But how does the user find the right object to call kubectl describe for? How user-friendly is that part going to be?

For local storage, its (CSIStoragePool) name can be set to node name

Does that sound right?

hmm, sorry for the unclear expression, what i want to clarify here is: We modify the k8s scheduler part. We create several schedulers and make them work in parallel. Each scheduler is only responsible for one partition (pods with specific NodeSelector and partial nodes with specific label). And schedulers can not share resources(nodes).
For predicate functions, we need to make sure Allocatable storage >= Pod(PVC) Request.
If we store all local storage info in one object, all schedulers need to look up to it, therefore, we need a lock or something like that. But if we create separate objects, each scheduler only care about partial CSIStoragePool objects, since scheduelers can not share nodes resource, they can work in parallel and do not need locks.

I know the solution is not general enough. Schedulers will have to share some other informations. But at least for local storage, it is ok. For network storage, we can shard the storage pools manually and make it fit for multiple schedulers too.

For local storage, its (CSIStoragePool) name can be set to node name

This only works under the assumption that only one CSI driver is providing capacity information for that node. But indeed, if this assumption holds, then individual objects may work better. Thanks for clarifying this.

I spoke with Jordan and he agrees that if the objects will end up scaling with the number of nodes, then it is preferable to have a separate object. The main issues with a single object:

any update to that single object, no matter how small, ends up sending the whole API object to all the watchers. This will not scale in large clusters.

when an informer gets an update, it gets the whole object, you can't see only the small delta that has changed, so you end up needing to iterate through the entire object to process a small update.

Separating it out into separate objects means more work on the consumer end, especially to build up caches, but it substantially reduces API server load in large clusters.

Both of these points apply to a usage pattern that does not necessarily apply for the implementation outlined in this KEP. But as @NickrenREN explained, they might want to use the API in a different way where they do apply, so I'm fine with switching to a individual objects. I'll update the KEP accordingly.

msau42 · 2019-11-27T22:47:02Z

@kubernetes/sig-storage-pr-reviews
/assign @msau42 @jsafrane

Before we start going into API details, I think we should focus on consolidating the two keps. And I think it would first be good to get agreement on the use cases and user journeys before we dive into API.

pohly · 2020-05-13T20:11:27Z

Can the discussions about ephemeral inline volumes all moved to this new KEP #1701? In this KEP, just add one line to reference that new KEP? I think that will make this KEP more focused and easier to read.

Agreed, it allowed me to remove quite a few special cases. Now we just need to finish "generic inline volumes"... 😁

xing-yang · 2020-05-13T23:33:48Z

keps/sig-storage/1472/README.md

+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+


Can you check the ones already done?

I'm not sure which ones I can consider done. I've marked those where I am relatively sure.

xing-yang · 2020-05-13T23:41:15Z

keps/sig-storage/1472/README.md

+
+### Non-Goals
+
+* Only CSI drivers will be supported.


The way this sentence is written, it is a "Goal", not "Non-Goal":).

Updated. I always struggle with avoiding double-negation in this section.

xing-yang · 2020-05-13T23:52:40Z

keps/sig-storage/1472/README.md

+
+- Gather feedback from developers and users
+- Integration with [Cluster Autoscaler](https://github.com/kubernetes/autoscaler)
+- Generic solution for identifying storage pools


Can you reference this KEP here (#1347)?

Also add that we need to figure out how to do spreading. Maybe it is covered by storage pools but it should be called out.

xing-yang · 2020-05-14T00:01:15Z

Here's recording for the meeting on 5/13: https://www.youtube.com/watch?v=Xj7ebO-Hqrw

xing-yang · 2020-05-14T00:02:37Z

keps/sig-storage/1472/README.md

+capacity information, store it in the API server, and then use that
+information in the scheduler. That information then flows
+through different components:
+1. storage backend


nit/storage/Storage

xing-yang · 2020-05-14T00:06:37Z

keps/sig-storage/1472/README.md

+
+### API
+
+#### CSIStorageCapacity


I still think this name is not right and CSIStoragePlacement is a better name, but we can discuss about it during the implementation.

"Placement" of pods or storage? The former is what the API is used for, the later sounds odd. But yes, let's discuss separately.

xing-yang · 2020-05-14T00:09:28Z

keps/sig-storage/1472/README.md

+    //
+	// +optional
+	Capacity *resource.Quantity
+}


NodeTopology, StorageClassName, Capacity should still be in the Status as they can change.

I consider "Capacity" as the only mutable field. If topology or storage classes change, then new objects get created and those that are no longer relevant get removed.

This is part of the move away from CSIStoragePool. That entity had varying topology, but CSIStorageCapacity objects are defined by the input parameters of GetCapacity (driver, storage class, topology) and provide the output value (capacity).

xing-yang · 2020-05-14T00:10:16Z

keps/sig-storage/1472/README.md

+
+	// The storage class name of the StorageClass which provided the
+    // additional parameters for the GetCapacity call.
+	StorageClassName string


Is this required or optional?

Now it is required. I was proposing an optimization earlier ("Storage class parameters that never affect capacity") where it would have been optional, but that was seen as too complicated.

xing-yang · 2020-05-14T00:17:48Z

keps/sig-storage/1472/README.md

+    // might get added. If not set, the scheduler will simply ignore the
+    // CSIStorageCapacity object.
+    //
+	// +optional


Shouldn't this field be required? Without capacity, how is CSIStorageCapacity useful?

This is meant to enable future extensions. Once we introduce "total capacity" and "maximum volume size", this API would become:

Capacity *resource.Quantity TotalCapacity *resource.Quantity MaximumVolumeSize *resource.Quantity

At least one will be required then.

Let me change the "optional" into "required", because right now it is.

xing-yang · 2020-05-14T01:00:12Z

keps/sig-storage/1472/README.md

+5GiB each, so accepting the node for two such volumes could be a false
+positive.
+
+More promising might be to add prioritization of nodes based on how


By saying "prioritization of nodes", it seems that you are talking about local storage here?

Networked storage might also be available only on a subset of the cluster. So prioritization may also be applicable there.

xing-yang · 2020-05-14T01:06:50Z

keps/sig-storage/1472/README.md

+
+* **Is the rollout accompanied by any deprecations and/or removals of features,
+  APIs, fields of API types, flags, etc.?**
+  Even if applying deprecation policies, they may still surprise some users.


I think the question is by introducing this CSIStorageCapacity feature, will any existing features get deprecated or removed.

xing-yang · 2020-05-14T01:08:51Z

keps/sig-storage/1472/README.md

+* **How can an operator determine if the feature is in use by workloads?**
+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
+  checking if there are objects with field X set) may be last resort. Avoid
+  logs or events for this purpose.


I think they can find out this feature is being used by querying CSIStorageCapacity objects?

xing-yang · 2020-05-14T01:10:43Z

keps/sig-storage/1472/README.md

+    - Usage description:
+      - Impact of its outage on the feature:
+      - Impact of its degraded performance or high error rates on the feature:
+


It seems that you have copied some paragraphs in the example KEP but didn't make any changes to them. I think if they are not relevant, then they should just be removed.

Yes, I copied everything but left the part that says "must be completed when targeting beta graduation" for later without making changes there. I can take that out.

xing-yang · 2020-05-14T01:13:29Z

keps/sig-storage/1472/README.md

+
+[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
+[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
+


In general, this readiness section needs more work. If the example paragraphs copied from the example KEP are not relevant, please remove them. Either write down your answers related to this feature or leave them blank if not relevant.

I kept the headlines and added comments saying "Will be added before the transition to beta."

xing-yang · 2020-05-14T01:17:07Z

The verify failure is from "./hack/verify-kep-metadata.sh" not typo. That means there's some formatting issue with kep.yaml.

pohly · 2020-05-14T06:43:21Z

keps/sig-storage/1472/README.md

+type CSIStorageCapacitySpec struct {
+	// The CSI driver that provides the storage.
+	// This must be the string returned by the CSI GetPluginName() call.
+	DriverName string


Strictly speaking, the driver name is redundant here. It was a left-over from the time when storage class was optional (CSI ephemeral volumes, the optimization for "storage class does not affect capacity").

But after having thought a bit about it, I think it is worthwhile keeping because it enables the listing of all objects for one driver, something that the external-provisioner has to do.

alculquicondor · 2020-05-15T15:47:27Z

keps/sig-storage/1472/README.md

+    // Depending on how the driver is implemented, this might be the total
+    // size of the available storage which is only available when allocating
+    // multiple smaller volumes ("total available capacity") or the
+    // actual size that a volume may have ("maximum volume size").


alculquicondor · 2020-05-15T15:49:21Z

keps/sig-storage/1472/README.md

+to rely on that information when making scheduling decisions that
+involve volumes that need to be created by the driver.
+
+If not set, the scheduler makes such decisions without considering


Or set to false?

Yes, good catch.

pohly · 2020-05-15T16:37:01Z

I've updated the KEP, please have another look.

msau42 · 2020-05-16T00:09:24Z

keps/sig-storage/1472/README.md

+jurisdication) availability may still be limited to a subset of the
+nodes in a cluster.
+
+#### Custom schedulers


IIUC, this use case is about being able to leverage the new API to build a custom scheduler extension if the current implementation doesn't suit a plugin's needs? If that's the case, the you may need to disable the built-in plugin? @ahg-g is that possible through a config, or would the scheduler have to be rebuilt?

When choosing a custom scheduler for a pod through the PodSpec.SchedulerName, only the custom scheduler is going to dispatch the pod, so there's no need to disable the feature in the default scheduler. It's still going to be enabled in case that some other pod uses the default scheduler and has volumes, though.

msau42 · 2020-05-16T00:41:11Z

keps/sig-storage/1472/README.md

+matches the storage pool(s) that it has access to, with granularity
+that matches the most restrictive pool.
+
+For example, if the driver runs in a node with zone/region/rack


I'm not sure I understand this example. Can we have something more concrete?

Let's start with a simple example. Say we have a storage driver that supports zonal disks, and the cluster has nodes across many zones a, b, and c. Each NodeGetInfo call from the driver daemonset is going to return topology "zone" = "a", "b", or "c", and report it in the CSINode object.

I believe this example is trying to convey what to do if a driver returns different topology keys on different nodes? For example, for nodeA, it returns "zone" = "a", but for nodeB, it returns ("zone" = "A" AND "region" = "X").

Can we just treat nodeB's topology as another unique topology value different from nodeA's topology, and expect the driver to be able to report an appropriate value?

Suppose the storage is local to a region. There's storage with label region=west and storage with region=east. Now in each region, we have data centers. As long as NodeGetInfo returns region=east resp. region=west for each node depending on where it is located, the proposed algorithm works.

Now suppose that in addition to storage in each region, each data center also has storage. Some storage classes select the region storage, some others the faster storage in the datacenter.

The storage in a New York data center then has "region=east,center=NewYork" as topology, and NodeGetInfo for a node in that center has to return that.

The proposed algorithm will then still work: it will use region+center for the topologies and create CSIStorageCapacity objects with that topology also for region storage.

What isn't okay is when NodeGetInfo only reports region, because that is not detailed enough for the storage in the centers.

Thanks for clarifying. I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a more clear hierarchy, like region and rack?

msau42 · 2020-05-16T00:44:05Z

keps/sig-storage/1472/README.md

+
+#### With central controller, generic solution
+
+For all other cases, a new CSI call to enumerate storage topology


This section is very vague. Is there an example of such a use case that won't work with above two? Let's leave it out of the proposal or put it under open issues if we don't have a concrete plan here.

I can take it out until we identify a driver where the proposed approach doesn't work.

Yes let's remove it for now

msau42 · 2020-05-16T00:47:44Z

keps/sig-storage/1472/README.md

+- compute the union of the topology segments from these
+  `CSINodeDriver` entries.
+
+For each entry in that union, one `CSIStorageCapacity` object is


How is the name of the object generated to ensure uniqueness? Does it need to be deterministic so we can update it later?

The name can't be determinstic. We have to use a UUID for the name, similar to how PVs are named.

Updating is still possible because external-provisioner has to know about all objects for its driver and then can match by topology and storage class name.

PVs names are deterministically generated so that if we call CreateVolume multiple times for the same PVC, it will get called with the same name.

Searching through every object for every predicate call, or every update in external-provisioner is not ideal, especially for a local volume driver that's going to have an entry per node. Is there a way we cache the mappings (maybe keyed by node) to improve the lookups?

PVs names are deterministically generated so that if we call CreateVolume multiple times for the same PVC, it will get called with the same name.

That relies on the unique ID of the PVC. We have no such UID here, just a bunch of attributes. We could hash them, but then we'll have to deal with hash collisions.

Internally, both external-provisioner and kube-scheduler will have to use caching and more efficient data structures to find the information that they need.

msau42 · 2020-05-16T00:51:29Z

keps/sig-storage/1472/README.md

+name + nodes + storage class) and then update that one
+instead. Obsolete objects need to be removed.
+
+To ensure that `CSIStorageCapacity` objects get removed when the driver


What if topologies disappear while the driver is still running? For example, an entire rack gets decomissioned? There probably still has to be some logic that scans the current topologies from CSINode and reconciles with existing StorageCapacity objects.

Yes, that's one of the situation where external-provisioner has to remove stale objects ("While external-provisioner runs, it needs to update and potentially potentially delete CSIStorageCapacity objects: ... when nodes change (for central provisioning)").

msau42 · 2020-05-16T00:58:41Z

keps/sig-storage/1472/README.md

+While external-provisioner runs, it needs to update and potentially
+delete `CSIStorageCapacity` objects:
+- when nodes change (for central provisioning)
+- when storage classes change (for persistent volumes)


what does "for persistent volumes" mean in this context?

Let me remove it. It's a left-over from the time when ephemeral volumes were treated as a special case.

msau42 · 2020-05-16T00:59:40Z

keps/sig-storage/1472/README.md

+delete `CSIStorageCapacity` objects:
+- when nodes change (for central provisioning)
+- when storage classes change (for persistent volumes)
+- when volumes were created or deleted (for central provisioning)


how come this is not a concern for local volume provisioning?

It is. I'll remove the condition.

msau42 · 2020-05-16T01:03:57Z

keps/sig-storage/1472/README.md

+The lookup sequence will be:
+- find the `CSIDriver` object for the driver
+- check whether it has `CSIDriver.spec.storageCapacity` enabled
+- find all `CSIStorageCapacity` objects that have the right spec


Would there be multiple valid objects, or just one?

There should only be one, otherwise there's conflicting information. We can make the scheduler check for that error, if we want.

msau42 · 2020-05-16T01:08:05Z

keps/sig-storage/1472/README.md

+capacity that is too small is considered unusable at the moment and
+ignored.
+
+Each volume gets checked separately, independently of other volumes


As a best effort, can we at least sum up the capacities for multiple volumes in the same Pod? This can at least reduce the chance of running into partial binding failures.

As a local volume example, if the node says it has "50GB" left, and a pod requests two local volumes of "50GB" each, then we should be fairly confident that the pod definitely cannot fit.

That requires summing up the capacity of all the requested volumes and making a decision at the same time? But the create volume requests may not all come at exactly the same time.

As a local volume example, if the node says it has "50GB" left, and a pod requests two local volumes of "50GB" each, then we should be fairly confident that the pod definitely cannot fit.

This makes assumptions about the semantic of "capacity", something that we wanted to avoid. It also assumes that "capacity" == "total capacity". It's erring on the wrong side if the driver reports "maximum volume size".

Let's first build something that doesn't use heuristics and then check how well it works before introducing additional logic, okay?

msau42 · 2020-05-16T01:11:27Z

keps/sig-storage/1472/README.md

+
+### CSI drivers without topology support
+
+To simplify the implementation of external-provisioner, [topology


I'm confused. Topology is a capability in CSI so drivers have to opt-in to support this.

A certain @msau42 said that we can assume that CSI drivers support topology when they want to use storage capacity tracking 😝

I added this text in commit 86cf87d and removed some external-provisioner modes in response to your comment in https://github.com/kubernetes/enhancements/pull/1353/files#r369745671:

I'm not sure we should have so many different modes of operation. This will make maintaining the provisioner very complex.

I think we should try to simplify things and require that the plugin supports Topology in order to use this feature, and only have two modes of operation "local" and
"not local".

Yes that part makes sense. I'm confused by the later sentence that talks about a driver implementing topology when they don't need it. What is the use case for a driver wanting the capacity feature when they don't support topology? The scheduler is not going to make any different decisions if capacity is equal across all nodes.

Ah, now I see what you mean. Indeed, what the heck was I thinking of here? I don't remember. Let me remove the second sentence and if we find a use-case for it again, we can add back something better.

This is an attempt to capture all points that were raised during review of kubernetes#1353.

msau42

Has the KEP been updated with all the comments? When I look at the diff I still see the older text.

msau42 · 2020-05-18T22:20:33Z

keps/sig-storage/1472/README.md

+
+### CSI drivers without topology support
+
+To simplify the implementation of external-provisioner, [topology


Yes that part makes sense. I'm confused by the later sentence that talks about a driver implementing topology when they don't need it. What is the use case for a driver wanting the capacity feature when they don't support topology? The scheduler is not going to make any different decisions if capacity is equal across all nodes.

msau42 · 2020-05-18T22:25:03Z

keps/sig-storage/1472/README.md

+
+#### With central controller, generic solution
+
+For all other cases, a new CSI call to enumerate storage topology


Yes let's remove it for now

msau42 · 2020-05-18T22:26:27Z

keps/sig-storage/1472/README.md

+approach that integration into Kubernetes is managed as part of the
+CSI driver deployment, ideally without having to modify the CSI driver
+itself. This will work without changes for some storage systems while
+others may have to support [a new CSI API


Let's remove this part about a generic solution until we have a more concrete use case.

msau42 · 2020-05-18T22:34:15Z

keps/sig-storage/1472/README.md

+
+#### CSIDriver.spec.storageCapacity
+
+A new field `storageCapacity` of type `boolean` with default `false`


Do we need this new field? Or would the presence of the CSIStorageCapacity object be sufficient?

I guess without the field, then rollout would potentially not be as smooth? With the field, you could upgrade the DaemonSet to the new version and only flip the driver bit when there is capacity reported across all nodes.

Do we need this new field? Or would the presence of the CSIStorageCapacity object be sufficient?

Suppose there is not a single CSIStorageCapacity for the driver: is that because the driver does not support the feature, or because capacity information has not been created yet (driver still starting up)? The scheduler cannot distinguish between the two cases.

If it assumes that "no objects means no capacity", then it'll stop scheduling for drivers which don't support the feature.

If it assumes that "no objects means feature not supported", then it'll schedule instead of waiting for information. That will work eventually (it's just another variant of "acting on incomplete or out-dated information" and the recovery mechanisms need to deal with it), though, so perhaps that flag can indeed be dropped?

I guess without the field, then rollout would potentially not be as smooth? With the field, you could upgrade the DaemonSet to the new version and only flip the driver bit when there is capacity reported across all nodes.

I was thinking of a simpler deployment where the driver just gets installed once through a set of yamls and then Kubernetes itself waits for the driver to provide some capacity information, without an admin or operator implementing such an additional post-install step.

I now think that we don't need the new field, without any loss of functionality. Here's what we can do instead:

Support the manual creation of CSIStorageCapacity objects by ensuring that external-provisioner only deletes objects that it also created. This was implied by the design, but not called out explicitly yet. I expect that this will be useful also for the autoscaler because then a cluster admin can use such manually created objects to define (local) storage that currently non-existent nodes will have access to.

The scheduler blocks pod scheduling if it finds at least one CSIStorageCapacity object for the CSI driver and none that has enough capacity.

CSI driver deployments create one CSIStorageCapacity object, for example with the CSI driver name as name, if and only if the deployment is going to produce storage capacity information. That object can define zero available storage, topology doesn't matter. All that matters is that the object exists and has the CSI driver name.

@msau42: what do you think?

This may make upgrading a local storage driver in a large cluster difficult, say 5000 nodes. While the rolling update is happening, the scheduler will go from 5000 node choices quickly down to a handful. This could cause a lot of scheduling failures while this upgrade is happening. Ideally, we would want some way to switch over the behavior after the upgrade is complete.

@msau42: This is about upgrading from a driver deployment which has no support for storage capacity to one which does, right? In that case, storage capacity constraints will kick in once the first CSIStorageCapacity object appears, and you are worried that it would be better to let the driver continue with creating such objects for a while.

Let me augment my proposal to cover this: to opt into storage capacity scheduling, a driver deployment must create a CSIStorageCapacity with the driver name as object name. Whoever deploys the driver can then decide to create that object immediately or wait a bit.

We discussed this today and agreed that the CSIDriverSpec.StorageCapacity is the better API, so we'll keep the KEP as originally proposed and merged.

msau42 · 2020-05-18T22:34:37Z

keps/sig-storage/1472/README.md

+
+A new field `storageCapacity` of type `boolean` with default `false`
+in
+[CSIDriver.spec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#csidriverspec-v1beta1-storage-k8s-io)


It's v1 as of 1.18.

Updated the link.

msau42 · 2020-05-18T22:36:26Z

keps/sig-storage/1472/README.md

+because the reported capacity is smaller than the size of a
+volume. This will work better when CSI drivers implement `GetCapacity`
+such that they consider constraints like fragmentation and report the
+size that the largest volume can have at the moment.


Let's reference the CSI open issue. We need to resolve this ambiguity before beta.

msau42 · 2020-05-18T22:41:31Z

keps/sig-storage/1472/README.md

+matches the storage pool(s) that it has access to, with granularity
+that matches the most restrictive pool.
+
+For example, if the driver runs in a node with zone/region/rack


Thanks for clarifying. I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a more clear hierarchy, like region and rack?

msau42 · 2020-05-18T22:44:17Z

keps/sig-storage/1472/README.md

+This is a slight deviation from the CSI spec, which defines that
+"`accessible_topology` specifies where the *node* is accessible
+from". For node-local storage there is no difference, but for
+network-attached storage there might be one.


I'm not sure what the problem is for network-attached storage. Could you give an example? This KEP is only exposing capacity information per topology where topology of the node aligns with the storage system's topology.

We're not handling scenarios where the storage system has a topology independent of Kubernetes nodes.

msau42 · 2020-05-18T22:59:07Z

Nevermind, I think I was looking at an older diff

pohly · 2020-05-19T12:51:51Z

Various replies, GitHub no longer let's me reply directly to the comments:

Let's reference the CSI open issue. We need to resolve this ambiguity before beta.

Link added.

I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a
more clear hierarchy, like region and rack?

Okay, done.

I'm not sure what the problem is for network-attached storage. Could you give an example? This KEP is only exposing capacity information per topology where topology of the node aligns with the storage system's topology.

I added a remark for the example that "If it uses region/rack, then the proposed approach below will
still work, but at the cost of creating redundant CSIStorageCapacity objects" and removed the comment about "deviation from the CSI spec". I still feel that we expect more from the driver than required by the spec, but it's not in violation of the spec, so it's fine.

A proposal for increasing the chance that the Kubernetes scheduler picks a suitable node for pods using volumes that have capacity limits.

pohly · 2020-05-19T13:03:37Z

I've squashed the commits now

because we are (hopefully!) close to merging
this PR has reached exactly 50 commits, which is a nice, round number 😁

Today's changes were:

diff --git a/keps/sig-storage/1472-storage-capacity-tracking/README.md b/keps/sig-storage/1472-storage-capacity-tracking/README.md
index c2b874e..5898bb5 100644
--- a/keps/sig-storage/1472-storage-capacity-tracking/README.md
+++ b/keps/sig-storage/1472-storage-capacity-tracking/README.md
@@ -480,5 +480,5 @@ status:
 A new field `storageCapacity` of type `boolean` with default `false`
 in
-[CSIDriver.spec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#csidriverspec-v1beta1-storage-k8s-io)
+[CSIDriver.spec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#csidriverspec-v1-storage-k8s-io)
 indicates whether a driver deployment will create `CSIStorageCapacity`
 objects with capacity information and wants the Kubernetes scheduler
@@ -506,5 +506,5 @@ The CSI spec up to and including the current version 1.2 just
 specifies that ["the available
 capacity"](https://github.com/container-storage-interface/spec/blob/314ac542302938640c59b6fb501c635f27015326/lib/go/csi/csi.pb.go#L2548-L2554)
-is to be returned by the driver. It is left open whether that means
+is to be returned by the driver. It is [left open](https://github.com/container-storage-interface/spec/issues/432) whether that means
 that a volume of that size can be created. This KEP uses the reported
 capacity to rule out pools which clearly have insufficient storage
@@ -540,13 +540,11 @@ matches the storage pool(s) that it has access to, with granularity
 that matches the most restrictive pool.
 
-For example, if the driver runs in a node with zone/region/rack
-topology and has access to one per-zone pool and one per-zone/region
-pool, then the driver should report the zone/region of the second pool
-as its topology.
-
-This is a slight deviation from the CSI spec, which defines that
-"`accessible_topology` specifies where the *node* is accessible
-from". For node-local storage there is no difference, but for
-network-attached storage there might be one.
+For example, if the driver runs in a node with region/rack topology
+and has access to per-region storage as well as per-rack storage, then
+the driver should report topology with region/rack as its keys. If it
+only has access to per-region storage, then it should just use region
+as key. If it uses region/rack, then the proposed approach below will
+still work, but at the cost of creating redundant `CSIStorageCapacity`
+objects.
 
 Assuming that a CSI driver meets the above requirement and enables
@@ -969,7 +967,5 @@ scheduler extender varies between clusters.
 To simplify the implementation of external-provisioner, [topology
 support](https://kubernetes-csi.github.io/docs/topology.html) is
-expected from a CSI driver. A driver which does not really need
-topology support can add it simply by always returning the same static
-`NodeGetInfoResponse.AccessibleTopology`.
+expected from a CSI driver.

Full history before squashing is here: https://github.com/pohly/enhancements/commits/storage-capacity-2020-05-19

msau42 · 2020-05-19T19:38:07Z

/lgtm
/approve
Thanks for working through all the comments and adapting all the feedback!

k8s-ci-robot · 2020-05-19T19:38:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-storage/OWNERS~~ [msau42]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

msau42 · 2020-05-19T19:38:47Z

/hold cancel

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 4, 2019

k8s-ci-robot requested review from childsb and saad-ali November 4, 2019 06:53

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 4, 2019

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from fb6c15e to 89ec735 Compare November 4, 2019 06:55

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 89ec735 to 2560793 Compare November 4, 2019 07:00

cofyc reviewed Nov 4, 2019

View reviewed changes

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md Outdated Show resolved Hide resolved

pohly mentioned this pull request Nov 4, 2019

KEP: StoragePool API for Advanced Storage Placement #1347

Closed

pohly commented Nov 7, 2019

View reviewed changes

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md Outdated Show resolved Hide resolved

pohly commented Nov 8, 2019

View reviewed changes

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 16a69b7 to fc24335 Compare November 12, 2019 12:25

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 13, 2019

xing-yang reviewed Nov 14, 2019

View reviewed changes

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md Outdated Show resolved Hide resolved

xing-yang reviewed Nov 14, 2019

View reviewed changes

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md Outdated Show resolved Hide resolved

pohly commented Nov 14, 2019

View reviewed changes

keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md Outdated Show resolved Hide resolved

pohly mentioned this pull request Nov 19, 2019

Enable PMEM storage in Memcached kubedb/project#690

Open

NickrenREN reviewed Nov 19, 2019

View reviewed changes

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch 3 times, most recently from fb9e14b to 0401cb1 Compare November 22, 2019 14:29

k8s-ci-robot assigned xing-yang May 13, 2020

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 07a1619 to 5c43a13 Compare May 13, 2020 20:06

xing-yang reviewed May 13, 2020

View reviewed changes

xing-yang reviewed May 14, 2020

View reviewed changes

pohly commented May 14, 2020

View reviewed changes

alculquicondor reviewed May 15, 2020

View reviewed changes

msau42 reviewed May 16, 2020

View reviewed changes

pohly mentioned this pull request May 18, 2020

disable dax during mount intel/pmem-csi#644

Open

pohly added a commit to pohly/enhancements that referenced this pull request May 18, 2020

storage capacity: document drawbacks

b75421d

This is an attempt to capture all points that were raised during review of kubernetes#1353.

msau42 reviewed May 18, 2020

View reviewed changes

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from c742793 to 0a3853f Compare May 19, 2020 12:58

storage capacity: initial version

556497a

A proposal for increasing the chance that the Kubernetes scheduler picks a suitable node for pods using volumes that have capacity limits.

pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 0a3853f to 556497a Compare May 19, 2020 13:03

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2020

k8s-ci-robot merged commit 4c9fc3c into kubernetes:master May 19, 2020

k8s-ci-robot added this to the v1.19 milestone May 19, 2020

ahg-g mentioned this pull request May 20, 2020

Cleanup volume binding leftover kubernetes/kubernetes#91175

Merged

pohly mentioned this pull request Jun 9, 2020

CSIStorageCapacity API kubernetes/kubernetes#91939

Merged


		[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
		[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos


		#### With central controller, generic solution

		For all other cases, a new CSI call to enumerate storage topology


		### CSI drivers without topology support

		To simplify the implementation of external-provisioner, [topology


		#### CSIDriver.spec.storageCapacity

		A new field `storageCapacity` of type `boolean` with default `false`

Storage Capacity Constraints for Pod Scheduling #1353

Storage Capacity Constraints for Pod Scheduling #1353

Conversation

pohly commented Nov 4, 2019 • edited

pohly commented Nov 4, 2019

cdickmann commented Nov 4, 2019

pohly commented Nov 4, 2019

pohly commented Nov 4, 2019 • edited

pohly commented Nov 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cdickmann commented Nov 14, 2019

NickrenREN Nov 19, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly Nov 25, 2019 • edited

Choose a reason for hiding this comment

NickrenREN Nov 26, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 Nov 27, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Nov 27, 2019

pohly commented May 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented May 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xing-yang commented May 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented May 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pohly commented Nov 4, 2019 •

edited

pohly commented Nov 4, 2019 •

edited

NickrenREN Nov 19, 2019 •

edited

pohly Nov 25, 2019 •

edited

NickrenREN Nov 26, 2019 •

edited

msau42 Nov 27, 2019 •

edited