Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEP for default Even Pods Spreading #1260

Merged

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Sep 26, 2019

Done: Summary, Motivation, Proposal, Design Details (Alpha Graduation Criteria), Alternatives.

Feature: #1258
Main repo issue: kubernetes/kubernetes#80639

Signed-off-by: Aldo Culquicondor <acondor@google.com>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 26, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 26, 2019
@alculquicondor
Copy link
Member Author

/cc @Huang-Wei @ahg-g

@alculquicondor alculquicondor changed the title Add KEP for default Even Pods Spreading [WIP] Add KEP for default Even Pods Spreading Sep 26, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2019
Copy link
Contributor

@draveness draveness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easy to implement this with runtime class scheduling, we could add another field to Scheduling struct like this:

type Scheduling struct {
    // nodeSelector lists labels that must be present on nodes that support this
    // RuntimeClass. Pods using this RuntimeClass can only be scheduled to a
    // node matched by this selector. The RuntimeClass nodeSelector is merged
    // with a pod's existing nodeSelector. Any conflicts will cause the pod to
    // be rejected in admission.
    // +optional
    NodeSelector map[string]string

    // tolerations adds tolerations to pods running with this RuntimeClass.
    // +optional
    Tolerations []corev1.Toleration

    // TopologySpreadConstraints describes how a group of pods are spread
    // If specified, scheduler will enforce the constraints
    // +optional
    TopologySpreadConstraints []TopologySpreadConstraint // new field

}

The pod with specified runtime class would use the topology spread constraints in the Scheduling.

cc/ @tallclair

@alculquicondor
Copy link
Member Author

RuntimeClass solves a totally different problem. And it doesn't solve the portability problem. The workload authors now need to be aware of what the runtime classes are.

Signed-off-by: Aldo Culquicondor <acondor@google.com>
Signed-off-by: Aldo Culquicondor <acondor@google.com>
@alculquicondor alculquicondor changed the title [WIP] Add KEP for default Even Pods Spreading Add KEP for default Even Pods Spreading Sep 27, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 27, 2019
Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alculquicondor ! LGTM generally, some comments below.

keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
With `EvenPodsSpreading`, workload authors can define spreading rules for their loads based on the
topology of the clusters.

By introducing configurable default spreading rules, workloads can be spread in the topology of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reword this section to highlight that EvenPodsSpreading enables users to control pod spreading on a PodSpec basis, while this feature will bring a cluster-level configurable pod spreading policy, and then mention a bit on the differences.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded the 2 paragraphs.

keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
#### Story 1

As a cluster operator, I want to set default spreading rules for workloads in the cluster.
Currently, `SelectorSpreadPriority` provides a canned priority that spreads across nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"nodes and zones", or just "zones"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both. It gives a weight of 1/3 to nodes and 2/3 to zones.

keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
`SelectorSpreadingPriority`.
Operators can disable `SelectorSpreadingPriority`. But once default rules for `EvenPodsSpreading` is GA,
we can consider removing `SelectorSpreadingPriority` and replacing it by an equivalent
k8s default to the default rules for `EvenPodsSpreading`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to present this KEP as the replacement for the SelectorSpreadingPriority, they shouldn't exist together. To make the transition frictionless, we can set a default value for what we are proposing here such that the scheduler's default behavior isn't changed (i.e., we set a default configuration for "default even pods spreading" to spread across zones).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what this paragraph is trying to say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that we shouldn't allow configuring both at the same time at all. The paragraph is still talking about the possibility of that being the case.

Currently, SelectorSpreadingPriority is used as one of the priority functions when no Policy is specified, and it is a value that we accept in the Policy file. Those two aspects are our guarantees to the users. We can solve them by doing the following if this feature is enabled:

  1. SelectorSpreadingPriority is removed from the default set of priorities. The default configuration for cluster-wide EvenPodsSpreading is set such that we maintain the same current scheduler default behavior that SelectorSpreadingPriority offers (or as close as possible).

  2. During the deprecation period of Policy (and by association SelectorSpreadingPriority), we say when set, SelectorSpreadingPriority overrides cluster-wide EvenPodsSpreading.

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they are conflicting. See updated section.

// that don't define any.
// If not specified, the scheduler applies the following default.
// +optional
DefaultTopologySpreadConstraints []TopologySpreadConstraint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop the prefix Default, I think it is not necessary. It will be confusing when we start talking about a default value for DefaultTopologySpreadConstraints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's confusing. But we need to convey the idea that it's still a cluster default that is applied if the pods don't specify any rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Component config is a cluster-wide configuration by definition, adding a "Default" prefix does not really clarify the semantics of this configuration variable as you state it, at the end of the day it has to be explained in the docs/comments that this value can be overridden at the pod level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See updated name and comment.

keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
}
```

Note that `TopologySpreadConstraint` is similar to `k8s.io/api/core/v1.TopologySpreadConstraint`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As current TopologySpreadConstraint.LabelSelector is optional, you can reuse that structure. (validation logic can help detect illegal presence of labelSelector)

@Huang-Wei What happens currently in EvenPodsSpreading when the label selector is not specified?

Signed-off-by: Aldo Culquicondor <acondor@google.com>
keps/sig-scheduling/20190926-default-even-pod-spreading.md Outdated Show resolved Hide resolved
`SelectorSpreadingPriority`.
Operators can disable `SelectorSpreadingPriority`. But once default rules for `EvenPodsSpreading` is GA,
we can consider removing `SelectorSpreadingPriority` and replacing it by an equivalent
k8s default to the default rules for `EvenPodsSpreading`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that we shouldn't allow configuring both at the same time at all. The paragraph is still talking about the possibility of that being the case.

Currently, SelectorSpreadingPriority is used as one of the priority functions when no Policy is specified, and it is a value that we accept in the Policy file. Those two aspects are our guarantees to the users. We can solve them by doing the following if this feature is enabled:

  1. SelectorSpreadingPriority is removed from the default set of priorities. The default configuration for cluster-wide EvenPodsSpreading is set such that we maintain the same current scheduler default behavior that SelectorSpreadingPriority offers (or as close as possible).

  2. During the deprecation period of Policy (and by association SelectorSpreadingPriority), we say when set, SelectorSpreadingPriority overrides cluster-wide EvenPodsSpreading.

WDYT?

@tallclair
Copy link
Member

I agree with @alculquicondor, the RuntimeClass scheduling constraints are only intended to address compatibility.

A more relevant proposal would the SchedulingPolicy, which unfortunately hasn't gotten off the ground yet.
/cc @yastij

@alculquicondor
Copy link
Member Author

I updated the PR with the provided feedback.

However, I'm taking a look at the progress of SchedulingPolicy. If there a no major blockers, I might merge the efforts.

@ahg-g
Copy link
Member

ahg-g commented Oct 3, 2019

However, I'm taking a look at the progress of SchedulingPolicy. If there a no major blockers, I might merge the efforts.

+1 to considering SchedulingPolicy again, it could allow us to deprecate most scheduler's "custom predicates/priorities".

And replace the word `rules` with `constraints`, for consistency.

Signed-off-by: Aldo Culquicondor <acondor@google.com>
@draveness
Copy link
Contributor

+1 to considering SchedulingPolicy again, it could allow us to deprecate most scheduler's "custom predicates/priorities".

I talked to Yassine and took the job, and I'll send the KEP soon.

@ahg-g
Copy link
Member

ahg-g commented Oct 4, 2019

+1 to considering SchedulingPolicy again, it could allow us to deprecate most scheduler's "custom predicates/priorities".

I talked to Yassine and took the job, and I'll send the KEP soon.

note that I miss understood the current proposal for SchedulingPolicy, in its current form it will not help with deprecating custom predicates/priorities.

In any case, we discussed this during the sig meeting today and Aldo will be following up with few people who initially proposed it to understand if this is still worth pursuing especially that what SchedulingPolicy is providing can be implemented as a user admission webhook.

Signed-off-by: Aldo Culquicondor <acondor@google.com>
@alculquicondor
Copy link
Member Author

After looking at what the Scheduling Policy proposes and the older discussions, I suggest that we proceed with this KEP as is.

The rational being:

  • Scheduler Policy focuses on admission, not necessarily setting defaults.
  • An API that would give us such flexibility for setting defaults is probably a better candidate for a separate mutating webhook, rather than the core scheduler.

I updated Goals and Alternatives sections to reflect this suggestion.

Comment on lines 130 to 133
- If the cluster operator provides a Policy that includes `SelectorSpreadingPriority` and
`EvenPodsSpreadPriority`, K8s provides an empty Default `topologySpreadConstraints`.
The cluster operator can still specify Default `topologySpreadConstraints`,
in which case both priorities run.
Copy link
Member

@ahg-g ahg-g Oct 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should run both priorities, I think we should clarify that if both are specified, then one overrides the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So your take would be that EvenPodsSpreadPriority overrides and disables SelectorSpreadingPriority?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a single implementation, a plugin. This plugin will either be configured using the priority-to-plugin translation logic we have if SelectorSpreadingPriority is specified, otherwise it is configured using the new ComponentConfig parameters you will be adding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See updated clarifications

Signed-off-by: Aldo Culquicondor <acondor@google.com>
| SelectorSpreading | EvenPodsSpread | Valid | Pod spread constraints |
| :---------------: | :------------: | :---: | :---------------------------------------: |
| N | Y | Yes | provided or [k8s default](#default-rules) |
| Y | Y | Yes | provided or [k8s default](#default-rules) |
Copy link
Member

@ahg-g ahg-g Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the constraint should be k8s default always, never provided in this case because SelectorSpreading takes precedence.

Copy link
Member

@ahg-g ahg-g Oct 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you can consider specifying both (topologySpreadConstraint and SelectorSpreading) an invalid configuration, which I think is better since we don't have to make a call which one takes precedence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the second option

Signed-off-by: Aldo Culquicondor <acondor@google.com>

### Graduation Criteria

Alpha (v1.17):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should say that this feature is tired to even pods spread, so they will graduate together.

@ahg-g
Copy link
Member

ahg-g commented Oct 7, 2019

/lgtm

This is great, thanks Aldo for the detailed explanation of how this will work. I will approve one @Huang-Wei LGTMs

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2019
@ahg-g
Copy link
Member

ahg-g commented Oct 7, 2019

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 7, 2019
@k8s-ci-robot k8s-ci-robot merged commit 872fd49 into kubernetes:master Oct 7, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 7, 2019
@ahg-g
Copy link
Member

ahg-g commented Oct 7, 2019

sorry for the premature merge :(

Aldo, can you please clarify the graduation dependency between this and EvenPodsSpreading in a followup PR?

@alculquicondor
Copy link
Member Author

Sure thing. There is a new PR in place.

- **Unit Tests**: All core changes must be covered by unit tests.
- **Integration Tests**: One integration test for the default rules and one for custom rules.
- **Benchmark Tests**: A benchmark test that compare the default rules against `SelectorSpreadingPriority`.
The performance should be as close as possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for jumping late into this discussion - but do we believe we can do that efficiently?
I'm asking in the context of large clusters - i.e. 5k nodes.

I want to ensure that we won't introduce a visible regression, because this may become a blocker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to do this as efficiently as the current logic, I don't see why we can't. I agree that it is a blocker if this will be causing performance regressions though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectorSpreadingPriority is currently part of the default algorithms that run. Theoretically, we are talking about the same problem, and then it should be possible to devise an algorithm for EvenPodsSpreadPriority with the same complexity.
We will be backing it with performance measurements, as seen in the Test Plan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor perhaps we should explicitly call that out in the KEP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Test Plan is part of the graduation criteria.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants