Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openshift/os: increase requested resources #29031

Closed
wants to merge 1 commit into from

Conversation

miabbott
Copy link
Member

The ostree operations that happen as part of the cosa-build image
build are incredibly memory hungry during the ostree commit and
ostree container operations. They are moving upwards of 2G of data
into memory and onto disk and vice-versa.

The original resource requests were insufficient, causing the CI jobs
to be incredibly slow and sometimes even timing out completely. It's
been observed that the ostree container encapsulate operation ends
up requesting nearly 6Gi of memory.

This bumps both the memory requests and the CPU requests for the image
builds. It should give the jobs some healthy head room to perform the
operations at a reasonable pace.

The `ostree` operations that happen as part of the `cosa-build` image
build are incredibly memory hungry during the `ostree commit` and
`ostree container` operations. They are moving upwards of 2G of data
into memory and onto disk and vice-versa.

The original resource requests were insufficient, causing the CI jobs
to be incredibly slow and sometimes even timing out completely. It's
been observed that the `ostree container encapsulate` operation ends
up requesting nearly 6Gi of memory.

This bumps both the memory requests and the CPU requests for the image
builds. It should give the jobs some healthy head room to perform the
operations at a reasonable pace.
@miabbott
Copy link
Member Author

Evidence of the extra memory request -
image

@openshift-ci openshift-ci bot requested review from jmarrero and travier May 31, 2022 16:50
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 31, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: miabbott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2022
@miabbott
Copy link
Member Author

I could be convinced of bumping the CPU request back down to 2000m but I think the extra cycles will benefit the speed of the jobs.

@miabbott
Copy link
Member Author

Memory observed to actually go over 8Gi during the final container commit operation of the image build!

memory-usage

@cgwalters
Copy link
Member

It's been observed that the ostree container encapsulate operation ends up requesting nearly 6Gi of memory.

Hmm...that must be a bug somewhere. I'm not immediately seeing large heap usage here, peaking at just 3.4MB.

@miabbott
Copy link
Member Author

It's been observed that the ostree container encapsulate operation ends up requesting nearly 6Gi of memory.

Hmm...that must be a bug somewhere. I'm not immediately seeing large heap usage here, peaking at just 3.4MB.

To be fair, I was loosely correlating that operation with what I was seeing in the metrics dashboard, so there could be a misalignment.

On this topic of resource usage, I've not been able to reproduce the drastic slowness when spinning up a cosa pod on build02 and doing cosa build.

I'm beginning to think there are some special conditions applied to the cosa-build pod; I think it is being run as an OpenShift Build rather than a normally scheduled pod and I wonder if there are constraints there.

@miabbott
Copy link
Member Author

Well, giving the cosa-build job more memory didn't seem to improve things:

�[36mINFO�[0m[2022-05-31T20:00:54Z] Ran for 3h7m6s

Open to new suggestions

@miabbott
Copy link
Member Author

miabbott commented Jun 1, 2022

/retest

cgwalters added a commit to cgwalters/coreos-assembler that referenced this pull request Jun 1, 2022
In openshift/release#29031 we are
debugging very slow build times.  Of the approximately 3h build
time, 30 minutes is compressing all the files into the archive repo
in `tmp/repo`.

This is all essentially wasted time, because we now canonically represent
the ostree commit as an ociarchive, which is re-compressed again
differently.

Eventually, we should drop `tmp/repo` and have `cache/repo-build`
be the canonical uncompressed cache.

In the short term though, ostree makes it easy to turn down the
zlib compression level, which can have a dramatic impact here.

Locally on my desktop:

Before:

```
$ time sudo ostree --repo=tmp/repo pull-local cache/repo-build/ 988a1ffb47df4dda08df4d97d8e5f39f34c624d5c54b9c870f696203011758ef
3009 metadata, 19604 content objects imported; 1.3 GB content written

________________________________________________________
Executed in    8.33 secs    fish           external
   usr time   44.23 secs  836.00 micros   44.23 secs
   sys time    3.95 secs  108.00 micros    3.95 secs
```

After:

```
$ time sudo ostree --repo=tmp/repo pull-local cache/repo-build/ 988a1ffb47df4dda08df4d97d8e5f39f34c624d5c54b9c870f696203011758ef
3009 metadata, 19604 content objects imported; 1.3 GB content written

________________________________________________________
Executed in    6.09 secs    fish           external
   usr time   21.94 secs    0.00 micros   21.94 secs
   sys time    4.34 secs  955.00 micros    4.34 secs
```

The wall clock time isn't hugely different, but that's because
my desktop is a hyperthreaded, otherwise idle i9-9900k.  The actual
CPU time spent is notably lower.

In the Prow cluster where we're contending for CPU on slower processors,
and further we are limited by cpu shares, this should help.
cgwalters added a commit to coreos/coreos-assembler that referenced this pull request Jun 1, 2022
In openshift/release#29031 we are
debugging very slow build times.  Of the approximately 3h build
time, 30 minutes is compressing all the files into the archive repo
in `tmp/repo`.

This is all essentially wasted time, because we now canonically represent
the ostree commit as an ociarchive, which is re-compressed again
differently.

Eventually, we should drop `tmp/repo` and have `cache/repo-build`
be the canonical uncompressed cache.

In the short term though, ostree makes it easy to turn down the
zlib compression level, which can have a dramatic impact here.

Locally on my desktop:

Before:

```
$ time sudo ostree --repo=tmp/repo pull-local cache/repo-build/ 988a1ffb47df4dda08df4d97d8e5f39f34c624d5c54b9c870f696203011758ef
3009 metadata, 19604 content objects imported; 1.3 GB content written

________________________________________________________
Executed in    8.33 secs    fish           external
   usr time   44.23 secs  836.00 micros   44.23 secs
   sys time    3.95 secs  108.00 micros    3.95 secs
```

After:

```
$ time sudo ostree --repo=tmp/repo pull-local cache/repo-build/ 988a1ffb47df4dda08df4d97d8e5f39f34c624d5c54b9c870f696203011758ef
3009 metadata, 19604 content objects imported; 1.3 GB content written

________________________________________________________
Executed in    6.09 secs    fish           external
   usr time   21.94 secs    0.00 micros   21.94 secs
   sys time    4.34 secs  955.00 micros    4.34 secs
```

The wall clock time isn't hugely different, but that's because
my desktop is a hyperthreaded, otherwise idle i9-9900k.  The actual
CPU time spent is notably lower.

In the Prow cluster where we're contending for CPU on slower processors,
and further we are limited by cpu shares, this should help.
@miabbott
Copy link
Member Author

miabbott commented Jun 2, 2022

coreos/coreos-assembler#2888 landed; let's see if that improves things here

/retest

@miabbott
Copy link
Member Author

miabbott commented Jun 2, 2022

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 3, 2022

@miabbott: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/openshift/os/master/test-in-cluster 5c4211e link unknown /test pj-rehearse
ci/rehearse/openshift/os/master/test-qemu-kola 5c4211e link unknown /test pj-rehearse
ci/prow/pj-rehearse 5c4211e link false /test pj-rehearse

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@miabbott
Copy link
Member Author

This isn't the fix we want; see openshift/os#839 and #29329

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 10, 2022

@miabbott: Closed this PR.

In response to this:

This isn't the fix we want; see openshift/os#839 and #29329

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
2 participants