Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactors cosa session initialization #821

Conversation

cheesesashimi
Copy link
Member

@cheesesashimi cheesesashimi commented Jun 6, 2022

Right now, the OpenShift CI jobs attached to this repository have the bottleneck of running $ cosa build within a container image build context. Because of how the OpenShift CI system is designed, it's not likely that we can use KVM within the container image build process to accelerate things. Despite this limitation, I've noticed that while the cosa-build image build itself usually does not time out, the 2+ hours it takes to run leaves us with very little headroom to run our test stages afterward. Despite this, we still need to produce the artifact in cosa-build in the way that we do. Here's how I plan to address this:

  1. Separate the $ cosa init and $ cosa build steps in https://github.com/openshift/os/blob/master/ci/prow-build.sh.
  2. Add another image build in openshift/release that only does the $ cosa init stuff (lets call this image cosa-init for now). This ensures we're using the same COSA session across all of the build and test stages.
  3. Pass the cosa-init image into the current image build flow to do $ cosa build (this becomes the input to the cosa-build image build). This will still be the image that gets promoted if all the required tests pass.
  4. We also pass the cosa-init image into the test-qemu-* test stages. Each test stage will run its own $ cosa build (with KVM acceleration) and execute the kola tests (also with KVM acceleration). These stages will run as soon as the cosa-init image is built and will run concurrently with the cosa-build image build.

This PR implements point 1) above by consolidating the $ cosa init (and the rest of the preparation steps) into the ci/prow-prepare.sh script. This also modifies ci/prow-build-test-qemu.sh and ci/prow-build.sh to make use of the new consolidated initialization flow. Points 2-4 are validated by openshift/release#29187.

Once this PR is merged, we'll need another PR to openshift/release to add the cosa-init stage.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 6, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot requested review from cgwalters and jmarrero June 6, 2022 19:42
@cheesesashimi cheesesashimi force-pushed the zzlotnik/refactor-build-scripts branch 2 times, most recently from d2b3949 to b2e23f0 Compare June 6, 2022 20:00
@cheesesashimi
Copy link
Member Author

/test all

@cheesesashimi cheesesashimi marked this pull request as ready for review June 6, 2022 20:36
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2022
@openshift-ci openshift-ci bot requested review from marmijo and prestist June 6, 2022 20:41
@cheesesashimi
Copy link
Member Author

/assign @miabbott

@cgwalters
Copy link
Member

Each test stage will run its own $ cosa build (with KVM acceleration)

Wait is that happening today? Why? I'd expect us to have a single built ostree image and a single qemu disk image that we test.

@travier
Copy link
Member

travier commented Jun 7, 2022

See openshift/release#29187. Builds are too slow in the build phase thus Zack moved them back to the test phase.

@cheesesashimi
Copy link
Member Author

Wait is that happening today? Why? I'd expect us to have a single built ostree image and a single qemu disk image that we test.

What's happening today is that the ostree image is being built within a container image build stage. We're seeing a lot of test timeouts because we can't use KVM in an image build to accelerate the build process (this is a limitation of OpenShift CI). So while the image builds themselves are not timing out, the time they take to build leaves us with very little headroom by the time the test stages run. However, since this is the only reasonable way to get an RHCOS container image into the ImageStream, we still need to do this in an image build.

What this PR sets the stage for is a followup PR to openshift/release. The followup PR will use these scripts to create a COSA session in the image build stage and pass that off to the kola test stages. Each test stage will run $ cosa build and its test suite. These will run concurrently with the cosa-build image build stage. The reasoning behind this is the test stage can perform the build much faster than the image build stage can; however we still need the RHCOS image produced by the cosa-build image build stage.

@cheesesashimi
Copy link
Member Author

/test all

@cheesesashimi cheesesashimi force-pushed the zzlotnik/refactor-build-scripts branch from b2e23f0 to bfe4264 Compare June 7, 2022 14:20
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 7, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cheesesashimi
To complete the pull request process, please ask for approval from miabbott after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This ensures that regardless of entrypoint, the cosa session is always
initialized the same way. This also separates the cosa session
initialization steps from the cosa build steps to aid in speeding up CI.
@cheesesashimi cheesesashimi force-pushed the zzlotnik/refactor-build-scripts branch from bfe4264 to 02f1768 Compare June 7, 2022 14:33
@LorbusChris
Copy link
Member

LorbusChris commented Jun 7, 2022

@cheesesashimi it should be configureable to expose KVM on CI build farms, but I'm not sure that's already the case on all of them. If the builds are scheduled on clusters where KVM is not exposed, we should be able to configure them to force scheduling on one where KVM is actually exposed.

@cheesesashimi
Copy link
Member Author

@LorbusChris I know that's possible for the test stages and we do make use of that already. However, I don't know if that is possible for the image build stages. If my memory serves correctly, even if the image build pods have KVM exposed, I don't think the image build context can make use of it.

@cgwalters
Copy link
Member

Looks like this needs the export COSA_SKIP_OVERLAY=1 bits

@cheesesashimi
Copy link
Member Author

I'm thinking we may want to combine this with #827.

@cgwalters
Copy link
Member

If my memory serves correctly, even if the image build pods have KVM exposed, I don't think the image build context can make use of it.

Yeah, though this is something we can fix; part of the idea of us "self hosting" CI is our own needs can drive product changes. Not saying anyone should do that right now, but...if it keeps being painful we should consider it.

The other alternative though which is probably simpler is to do the ostree-container (and disk image) builds in a regular pod, then push the container image to some external registry (or even a temporary registry spun up as a pod), and then later tests can pull from that.

Alternatively, try to figure out how support grafting in more complex/arbitrary workflows into ci-operator/prow. Perhaps even something like having a prow job invoke tekton.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 10, 2022

@cheesesashimi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/test-in-cluster 02f1768 link true /test test-in-cluster
ci/prow/test-qemu-firmware-uefi 02f1768 link true /test test-qemu-firmware-uefi
ci/prow/test-qemu-kola 02f1768 link true /test test-qemu-kola
ci/prow/test-qemu-kola-upgrade 02f1768 link true /test test-qemu-kola-upgrade
ci/prow/test-qemu-metal 02f1768 link true /test test-qemu-metal
ci/prow/test-qemu-nvme 02f1768 link true /test test-qemu-nvme
ci/prow/build-test-qemu-kola-all 02f1768 link true /test build-test-qemu-kola-all
ci/prow/build-test-qemu-kola-metal 02f1768 link true /test build-test-qemu-kola-metal
ci/prow/build-test-qemu-kola-basic 02f1768 link true /test build-test-qemu-kola-basic
ci/prow/build-test-qemu-kola-upgrade 02f1768 link true /test build-test-qemu-kola-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 10, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 10, 2022

@cheesesashimi: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@travier
Copy link
Member

travier commented Jun 10, 2022

Replaced by #839
/close

@openshift-ci openshift-ci bot closed this Jun 10, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 10, 2022

@travier: Closed this PR.

In response to this:

Replaced by #839
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants