Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add full program benchmark for kustomize build #5425

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chlunde
Copy link
Contributor

@chlunde chlunde commented Oct 30, 2023

This change introduces a benchmarking test that constructs a complete kustomization tree using various features of Kustomize.

This update aims to address several objectives:

  • Demonstrating current performance challenges in Kustomize in a reproducible manner.
  • Evaluating the effects of performance enhancements.
  • Guarding against potential performance setbacks and inadvertent quadratic behavior in the future.
  • Considering the possibility of incorporating profile-guided optimization (PGO) in future iterations.

Usage:

go test -run=x -bench=BenchmarkBuild ./kustomize/commands/build

# sigs.k8s.io/kustomize/kustomize/v5/commands/build.test
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
BenchmarkBuild-8   	       1	8523677542 ns/op
PASS
ok  	sigs.k8s.io/kustomize/kustomize/v5/commands/build	8.798s

Currently, this benchmark requires 3000 seconds to run on my machine. In order to run it on master today, you need to add -timeout=30m to the go test command.

The dataset size was chosen because I believe it represents a real workload which we could get a runtime of less than 10 seconds.

Updates #5084

Notes on PGO:

Real-life profiles would be better, but creating one based on a benchmark should not hurt:

https://go.dev/doc/pgo#collecting-profiles

Will PGO with an unrepresentative profile make my program slower than no PGO?
It should not. While a profile that is not representative of production behavior will result in optimizations in cold parts of the application, it should not make hot parts of the application slower. If you encounter a program where PGO results in worse performance than disabling PGO, please file an issue at https://go.dev/issue/new.

Collecting a profile:

go test -cpuprofile cpu1.pprof -run=^$ -bench ^BenchmarkBuild$ sigs.k8s.io/kustomize/kustomize/v5/commands/build

go build -pgo=./cpu1.pprof  -o kust-pgo ./kustomize
go build -o kust-nopgo ./kustomize

Compare PGO and non-PGO-builds:

./kust-pgo build -o /dev/null testdata/  21.88s user 2.00s system 176% cpu 13.505 total
./kust-nopgo build -o /dev/null testdata/  22.76s user 1.98s system 174% cpu 14.170 total

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 30, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @chlunde. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 30, 2023
@shapirus
Copy link

One important aspect that I'd like to point out here is that the benchmarks should cover the following cases:

  • single invocation of kustomize to process a big and complex resource tree
  • multiple invocations of kustomize to process a small set of manifests on each run (simulates e.g. argocd handling many microservices, or a validation CI job running on the repository hosting a large set of manifests)

These cases are very different in terms of the performance impact that specific parts of code create, yet each of them is quite valid as far as practical usage is concerned.

@natasha41575
Copy link
Contributor

/ok-to-test
/triage accepted

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 30, 2023
@chlunde
Copy link
Contributor Author

chlunde commented Oct 30, 2023

In hindsight, it makes no sense to merge this with the current performance.
Either we wait for a few of them to be merged or we adjust some of the sizing parameters for reasonable numbers (runtime <10m in CI).

I was a bit surprised at how long the baseline performance was, as I had quite a few performance patches locally at the time I wrote the benchmark.

@shapirus yeah, we should have some benchmarks emulating argo/flux too. Do you think this affects how this PR is structured?
I think we should look at the API surface used by argo/flux so we use the right entry points, and we should probably avoid using the file system, so I think it would be a different benchmark.

@chlunde chlunde marked this pull request as draft October 30, 2023 21:40
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 30, 2023
@ncapps
Copy link
Contributor

ncapps commented Oct 30, 2023

In hindsight, it makes no sense to merge this with the current performance.
Either we wait for a few of them to be merged or we adjust some of the sizing parameters for reasonable numbers (runtime <10m in CI).

I agree that it would be beneficial to limit the duration and scope of these performance tests. I really like the suggestion from @shapirus. Ideally, we can run meaningful benchmark tests as a PR check. This would help us understand if a specific commit significantly changed the performance.

@shapirus
Copy link

shapirus commented Oct 31, 2023

@shapirus yeah, we should have some benchmarks emulating argo/flux too. Do you think this affects how this PR is structured?

@chlunde As far as I can see, it is basically a separate go program that imports kustomize as a library and then runs tests in its own context. This will only allow to benchmark the "single invocation, big resource tree" scenario, as it makes the startup overhead take place only once (see #5422 (comment)).

To emulate argocd/fluxcd we need to, well, do what they do: invoke an actual kustomize binary (which, in the CI flow, should be already built by the time the benchmarks run, if I understand correctly) a certain number of times and measure the total run time.

My set of benchmarks at https://github.com/shapirus/kustomize-benchmark-suite (Dockerfile serves as a readme there) does exactly that. Probably some core code could be reused for the kustomize CI. Yes it's somewhat crude, but also small enough to be understood and reused.
If we're limited to go (and e.g. can't use shell), it's no problem too: go can spawn processes to run commands and measure their execution time just as well, and with a cleaner code.

Unfortunately I lack the knowledge required to convert them into a proper CI workflow step for kustomize, but it should be easy, if not trivial, for those who don't.

I think we should look at the API surface used by argo/flux so we use the right entry points

There is really no need to do this, I think. Can't tell about fluxcd, but argocd simply runs a standalone kustomize executable binary to build the resulting manifest for each Application that it has been requested for. This is easily simulated with a simple shell script. Let's try to avoid overengineering :).

and we should probably avoid using the file system, so I think it would be a different benchmark.

In the ideal world, yes, but in practice it doesn't make any discernable difference, unless the underlying block device is very slow and/or high latency. On the other hand, it adds extra complications.

To reduce the effects of the disk access overhead sufficiently for our specific practical use case, we can run find ./manifests-tree -type f -exec cat {} > /dev/null \; before running the benchmark to pre-heat page cache by reading all the files we're going to need.

My bigger concern is not the file system, but the shared CPU on the machine where the benchmark runs. If there are other CPU-intensive jobs running on the same machine, they can affect the benchmark execution times quite significantly. There are possible workarounds however:

  • run the benchmark process with nice -19 (not always possible, requires privileges)
  • use a sufficiently high number of iterations to make the benchmark run at least a few tens of seconds to reduce the effects of small lags and fluctuations (i.e. you can't reliably benchmark when your test takes, say, a second or less, unless you have hardware dedicated to this test alone)
  • run the same benchmark several times and use the results of the one that completed in the shortest time (aka had most of the CPU available to itself alone)
  • use a sufficently high threshold to mark the test failed upon detection of a regression against a previous version (say > 40%) -- however, this makes it necessary to test not only against that one version, but several previous versions to detect the cumulative effect of small incremental regressions, and implementing this logic properly to detect regressions reliably and without false positives may be non-trivial. There are probably existing tools, someone with a QA background might give more ideas.

@ephesused
Copy link
Contributor

In poking around at this benchmark, I did notice some behavior that surprised me. I adjusted the value of genconfig[1].resources and the run time jump between 1 and 2 stands out dramatically. @chlunde (or others), do you see the same behavior?

Admittedly, I haven't taken the time to understand the test case well.

$ DEPTH_ONE_RESOURCE_COUNT=1 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64                                         
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz          
BenchmarkBuild-8        1000000000               0.4780 ns/op
--- BENCH: BenchmarkBuild-8                                           
    build_benchmark_test.go:162: genconfig[1].resources redefined as 1
PASS                                                                  
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       12.829s

$ DEPTH_ONE_RESOURCE_COUNT=2 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64                                         
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz          
BenchmarkBuild-8               1        1639273100 ns/op
--- BENCH: BenchmarkBuild-8
    build_benchmark_test.go:162: genconfig[1].resources redefined as 2
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       3.119s

$ DEPTH_ONE_RESOURCE_COUNT=3 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-8               1        2591509400 ns/op
--- BENCH: BenchmarkBuild-8
    build_benchmark_test.go:162: genconfig[1].resources redefined as 3
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       3.841s

$ DEPTH_ONE_RESOURCE_COUNT=4 go test ./kustomize/commands/build -run nope -bench BenchmarkBuild
goos: windows
goarch: amd64
pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
cpu: Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
BenchmarkBuild-8               1        3785031200 ns/op
--- BENCH: BenchmarkBuild-8
    build_benchmark_test.go:162: genconfig[1].resources redefined as 4
PASS
ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       5.099s

$

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
@chlunde
Copy link
Contributor Author

chlunde commented Feb 1, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
This change introduces a benchmarking test that constructs a
complete kustomization tree using various features of Kustomize.

This update aims to address several objectives:
* Demonstrating current performance challenges in Kustomize in a reproducible manner.
* Evaluating the effects of performance enhancements.
* Guarding against potential performance setbacks and inadvertent quadratic behavior in the future.
* Considering the possibility of incorporating profile-guided optimization (PGO) in future iterations.

Usage:

    $ make run-benchmarks
    go test -run=x -bench=BenchmarkBuild ./kustomize/commands/build

    pkg: sigs.k8s.io/kustomize/kustomize/v5/commands/build
    BenchmarkBuild-8               1        48035946042 ns/op
    PASS
    ok      sigs.k8s.io/kustomize/kustomize/v5/commands/build       48.357s

*Currently*, this benchmark requires 48 seconds to run on my machine.

Updates kubernetes-sigs#5084

Notes on PGO:

Real-life profiles would be better, but creating one based on a benchmark should not hurt:

https://go.dev/doc/pgo#collecting-profiles

> Will PGO with an unrepresentative profile make my program slower than no PGO?
> It should not. While a profile that is not representative of production behavior will result in optimizations in cold parts of the application, it should not make hot parts of the application slower. If you encounter a program where PGO results in worse performance than disabling PGO, please file an issue at https://go.dev/issue/new.

Collecting a profile:

    go test -cpuprofile cpu1.pprof -run=^$ -bench ^BenchmarkBuild$ sigs.k8s.io/kustomize/kustomize/v5/commands/build

    go build -pgo=./cpu1.pprof  -o kust-pgo ./kustomize
    go build -o kust-nopgo ./kustomize

Compare PGO and non-PGO-builds:

    ./kust-pgo build -o /dev/null testdata/  21.88s user 2.00s system 176% cpu 13.505 total
    ./kust-nopgo build -o /dev/null testdata/  22.76s user 1.98s system 174% cpu 14.170 total
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chlunde
Once this PR has been reviewed and has the lgtm label, please assign koba1t for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@chlunde chlunde marked this pull request as ready for review April 16, 2024 19:37
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 16, 2024
@chlunde
Copy link
Contributor Author

chlunde commented Apr 16, 2024

I've added a few more code comments as requested, rebased on master and updated the code to use labels instead of commonLabels.

I've also reduced the dataset here to run in less than one minute on my machine (48s).
On top of the PR #5427 it takes around 2.4 seconds. @ncapps is there anything else you would like?

Comment on lines +104 to +105
fn := fmt.Sprintf("res%d", res)
fmt.Fprintf(&buf, " - %v\n", fn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to be a list of resources to be rendered in the kustomization, is that correct?
Could this be missing a .yaml extension?
I'm also curious if you need the %v formatter here -- since these are all strings I imagine %s is enough.

// as an index into the given configs slice.
//
// The function is recursive and will call itself for config as long as resources > 0.
func makeKustomization(configs []GenConfig, fSys filesys.FileSystem, path, id string, depth int) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions on this function:

  • Is this being built with string concatenation to avoid the overhead of YAML serialization?
  • Is there a reason why passing the *testing.B parameter here and marking it as a helper with b.Helper() would be undesirable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

None yet

8 participants