Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added benchmarks for pod affinity NamespaceSelector #101329

Merged
merged 1 commit into from
Apr 26, 2021

Conversation

ahg-g
Copy link
Member

@ahg-g ahg-g commented Apr 21, 2021

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds two operations to scheduler perf benchmarks integration test: 1) create namespace, 2) create multiple sets of pods.

Those were necessary to create pod (anti)affinity benchmarks with NamespaceSelector

The benchmark results are in the following file: BenchmarkPerfScheduling.txt

The comparison is against the existing affinity benchmarks. The current affinity benchmarks put all existing pods in one namespace, the new ones split them across 100 namespaces and use namespace selector, the results show that there is no performance drop.

Which issue(s) this PR fixes:

Part of kubernetes/enhancements#2249 #97203

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 21, 2021
@k8s-ci-robot
Copy link
Contributor

@ahg-g: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 21, 2021
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 21, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Apr 21, 2021

/cc @adtac

@k8s-ci-robot k8s-ci-robot requested a review from adtac April 21, 2021 16:44
@ahg-g
Copy link
Member Author

ahg-g commented Apr 21, 2021

@alculquicondor @Huang-Wei this is needed for beta graduation.

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/sig scheduling

}
}
if err != nil {
klog.Fatalf("Creating namespace: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better not use klog.Fatal in a test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should return here, updated.

@@ -681,6 +802,7 @@ func createPods(namespace string, cpo *createPodsOp, clientset clientset.Interfa
if err != nil {
return err
}
klog.Infof("Creating %d pods in namespace %q", cpo.Count, namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b.Info is easier to deal with on debugging tools

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logs get truncated, not sure if there is an option to prevent that.

b.Fatalf("op %d: %v", opIndex, err)
}
if err := nsPreparer.prepare(); err != nil {
b.Fatalf("op %d: %v", opIndex, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if some namespaces were successfully created?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get what is the concern, this is a fatal error, so the whole the test case will fail

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything else clearing the namespaces? Isn't the etcd db shared for the entire test suite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, I added a call to cleanup()

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Apr 21, 2021
Copy link
Member

@Huang-Wei Huang-Wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits below.

One q: are you going to compose a baseline test to compare the results? For example, create $initNamespaces namespaces, and run workloads specifying spec.affinity...namespaces.

measurePods: 1000


- name: SchedulingPreferredAffinityWithNSSelector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated with L553?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicates should be L627 and L553 instead of here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, removed the duplicate.

}
klog.Infof("Making %d namespaces with prefix %q and template %v", p.count, p.prefix, *base)

retries := 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may wrap this by reusing:

import "k8s.io/client-go/util/retry"

retry.RetryOnConflict(retry.DefaultRetry, fn)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

measurePods: 1000


- name: SchedulingPreferredAffinityWithNSSelector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicates should be L627 and L553 instead of here.

Comment on lines 242 to 245
// Number of namespaces to create. Parameterizable through CountParam.
Count int
// Template parameter for Count.
CountParam string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both "Count" and "CountParam" are semantically identical, possible to just use one instead, maybe if the "CountParam" is not set it could be parsed as "1" for the measured namespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an established pattern across all operations.

namespaceTemplatePath: config/namespace-with-labels.yaml
- opcode: createNamespaces
prefix: measure-ns
count: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe something like this?

countParam: $measureNamespaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameter is not reused across workloads, we want to explicitly use a single namespace, hence its hardcoded.

func (cpso createPodSetsOp) patchParams(w *workload) (realOp, error) {
if cpso.CountParam != "" {
var ok bool
if cpso.Count, ok = w.Params[cpso.CountParam[1:]]; !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider the case that both "cpso.CountParam" and "cpso.Count" are set in the template.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an established pattern in the file, CountParam takes precedence, I added a comment.

test/integration/scheduler_perf/scheduler_perf_test.go Outdated Show resolved Hide resolved
test/integration/scheduler_perf/scheduler_perf_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 22, 2021
@alculquicondor
Copy link
Member

LGTM for me after squash

@ahg-g
Copy link
Member Author

ahg-g commented Apr 22, 2021

Some nits below.

One q: are you going to compose a baseline test to compare the results? For example, create $initNamespaces namespaces, and run workloads specifying spec.affinity...namespaces.

The comparison is against the existing affinity benchmarks. The current benchmarks put all existing pods in one namespace, the new ones split them across 100 namespaces and use namespace selector, I am showing that there is no performance drop.

@Huang-Wei
Copy link
Member

The comparison is against the existing affinity benchmarks. The current benchmarks put all existing pods in one namespace, the new ones split them across 100 namespaces and use namespace selector, I am showing that there is no performance drop.

Sounds good.

@ahg-g
Copy link
Member Author

ahg-g commented Apr 22, 2021

commits squashed and I updated the PR description with the results.

@alculquicondor
Copy link
Member

/lgtm

/hold
for others

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 22, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Apr 22, 2021

/retest

1 similar comment
@ahg-g
Copy link
Member Author

ahg-g commented Apr 22, 2021

/retest

for i := 0; i < p.count; i++ {
n := base.DeepCopy()
n.Name = fmt.Sprintf("%s-%d", p.prefix, i)
testutils.RetryWithExponentialBackOff(func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the returned error is discarded. IMO we should abort the loop to return the (timeout) error? In current logic, prepare() always return nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the function returns an error, RetryWithExponentialBackOff will directly return and not retry. Ideally there should be a way to check if the error is not retry-able and only in that case return an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the functions in

func CreatePodWithRetries(c clientset.Interface, namespace string, obj *v1.Pod) error {
are actually not doing any retries on errors. It is because of that the benchmark was sometimes failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the function returns an error, RetryWithExponentialBackOff will directly return and not retry

True, but the inner function doesn't return any error, right? So the only non-nil error we may get from testutils.RetryWithExponentialBackOff is timeout error, and in that case, should we abort the test?

are actually not doing any retries on errors

yes, the namings (CreatePodWithRetries and others) are confusing and I proposed #100688.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but the inner function doesn't return any error, right?
you mean line 1038 below? correct. I am simplifying things here and I am assuming that all errors are retry-able because we don't have a method that tells us whether the error is retry-able (in which case we would return nil) or not retry-able (in which case we would return the error)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to capture the error returned by RetryWithExponentialBackOff and return it.

test/integration/scheduler_perf/scheduler_perf_test.go Outdated Show resolved Hide resolved
@Huang-Wei
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2021
@ahg-g
Copy link
Member Author

ahg-g commented Apr 23, 2021

/retest

@ahg-g
Copy link
Member Author

ahg-g commented Apr 26, 2021

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 26, 2021
@k8s-ci-robot k8s-ci-robot merged commit 3e71ecc into kubernetes:master Apr 26, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Apr 26, 2021
@ahg-g ahg-g deleted the ahg-nss-bench branch October 25, 2021 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants