Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: initial body of self-hosted runners TF modules #343

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Gregory-Pereira
Copy link
Collaborator

Addresses: #264
Still WIP but cc @lmilbaum @cevich

Other things that need to happen in tandem:

  1. Github app needs a private key generated for it and that needs to be uploaded as a repo secret.
  2. Other variables need to be added such as VPC and subnets
  3. Actually running the recipe workloads after machiens are provisioned, and filtering of jobs per ARCH + OS

@Gregory-Pereira Gregory-Pereira marked this pull request as draft April 25, 2024 18:26
@Gregory-Pereira Gregory-Pereira force-pushed the 264-tf-github-self-hosted-runners branch 2 times, most recently from b9da9a7 to 19c2a39 Compare April 26, 2024 04:07
@cevich
Copy link
Member

cevich commented Apr 26, 2024

Thanks. I'll have time to take a peek early next week.

Signed-off-by: greg pereira <grpereir@redhat.com>
@Gregory-Pereira Gregory-Pereira force-pushed the 264-tf-github-self-hosted-runners branch from 19c2a39 to 76d46b8 Compare April 28, 2024 02:13
Signed-off-by: greg pereira <grpereir@redhat.com>
@cevich
Copy link
Member

cevich commented Apr 29, 2024

@Gregory-Pereira Review Q: Is there a good reason for burying the whole tf tree under the (hidden) .github directory?

IMHO, this directory's purpose is already heavily abused. I'm not going to block over it, but would you consider moving tf under a more meaningful cicd, or contrib/tf (or similar) directory?

@Gregory-Pereira
Copy link
Collaborator Author

Gregory-Pereira commented Apr 29, 2024

Definitely not married to sticking it in the .github directory, Ill include this in my next update to this PR

Copy link
Member

@cevich cevich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of content, so I only made a quick, incomplete pass with some suggestions. Hopefully they're helpful.

I'd highly recommend adding some PR-level CI here or as a followup. Even just a simple run of terraform fmt, tf validate, and/or tflint would be really helpful in catching minor booboo's before wasting time committing broken changes. Same goes for the Ansible content. This will also be really important for vetting any Renovate proposed update PRs.

I appreciate all the README's, though I'm concerned if maintaining all the version-references (in base and lambdas-download) is possible and practical? Many of them are going to end up encoded in the backend provider config's anyway, no? ISTM these READMEs could be more helpful if they simply contained a short synopsis of the directory's purpose, and intended use (by what/whom).

Related, please consider adding a top-level tf/README.md explaining (even briefly) what all the content is and how it should be used (i.e. ref. to the workflow file). At a glance, the arm64, default and windows READMEs are all very similar. Perhaps they could be condensed/simplified into a single top-level tf/README.md instead?

@@ -9,3 +9,4 @@ models/*
convert_models/converted_models
recipes/common/bin/*
*/.venv/
.github/tf/*/.terraform.lock.hcl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but I thought it was standard practice to check these into VCS and allow terraform to manage them. No?

Comment on lines +22 to +24
# AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
# AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
# AWS_DEFAULT_REGION: "eu-west-2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend either adding a comment explaining why these lines are commented out, or simply removing them if they're not used.

integration-tests:
if: github.repository == 'containers/ai-lab-recipes'
strategy:
fail-fast: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make some guesses why you set this false. But it would be helpful to future maintainers if you added a short comment explaining why this non-default is necessary.

with:
repository: containers/ai-lab-recipes
path: ai-lab-recipes/.github/tf/multi-runner
ref: 'main'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC main is the default value, no?

TF_VAR_aws_ami_name: '["Fedora-Cloud-Base-*"]'
TF_VAR_aws_volume_size: 100
TF_VAR_aws_access_key: ${{ secrets.AWS_ACCESS_KEY_ID }}
TF_VAR_aws_secret_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend setting TF_INPUT=0


- name: Download Lambdas
id: download-lambdas
run: terraform apply -auto-approve -lock=false -var=module_version=${{ env.tf_version }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-var=module_version=${{ env.tf_version }} is confusing for me. If this is intended, I'd recommend adding a comment so future maintainers understand why the values are congruent.

Same suggestion for all the places below where module_version is set this way.

terraform_wrapper: false

- name: Init Lambdas
run: terraform init
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider if adding --upgrade is useful here, and all the other similar calls below.

working-directory: ${{ matrix.tf_dir }}

- name: Bootstrap
id: up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it, but if this id isn't needed/used anywhere I'd suggest removing it.

- name: Destroy Test Environment
id: down
if: always()
run: terraform destroy -auto-approve -lock=false -var=module_version=${{ env.tf_version }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing -lock=false makes my ears stand up. Would you consider adding a comment explaining why this is both safe and necessary?

with:
payload: |
{
"text": "${{ github.workflow }} workflow status: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be asking too much, if so please ignore the suggestion. Since this is running every hour (including over weekends), a matrix is in use, and there are a lot of complex steps involved, it would be super-helpful to operators to have more context clues. For example, providing the matrix tf_dir value, and if Ansible failed, provide the final-status table or last few lines of stderr.

This does a few things:

  1. It lets operators easily see (without clicking) if multiple failures might be the same or related.
  2. When glancing at possibly hundreds of historical slack messages, quickly pick out any that are DIFFERENT.
  3. If other monitoring tools are processing the status messages, the additional details will help them better categorize and/or take more relevant actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants