New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: initial body of self-hosted runners TF modules #343
base: main
Are you sure you want to change the base?
WIP: initial body of self-hosted runners TF modules #343
Conversation
b9da9a7
to
19c2a39
Compare
Thanks. I'll have time to take a peek early next week. |
Signed-off-by: greg pereira <grpereir@redhat.com>
19c2a39
to
76d46b8
Compare
Signed-off-by: greg pereira <grpereir@redhat.com>
@Gregory-Pereira Review Q: Is there a good reason for burying the whole IMHO, this directory's purpose is already heavily abused. I'm not going to block over it, but would you consider moving |
Definitely not married to sticking it in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of content, so I only made a quick, incomplete pass with some suggestions. Hopefully they're helpful.
I'd highly recommend adding some PR-level CI here or as a followup. Even just a simple run of terraform fmt
, tf validate
, and/or tflint
would be really helpful in catching minor booboo's before wasting time committing broken changes. Same goes for the Ansible content. This will also be really important for vetting any Renovate proposed update PRs.
I appreciate all the README's, though I'm concerned if maintaining all the version-references (in base
and lambdas-download
) is possible and practical? Many of them are going to end up encoded in the backend provider config's anyway, no? ISTM these READMEs could be more helpful if they simply contained a short synopsis of the directory's purpose, and intended use (by what/whom).
Related, please consider adding a top-level tf/README.md
explaining (even briefly) what all the content is and how it should be used (i.e. ref. to the workflow file). At a glance, the arm64
, default
and windows
READMEs are all very similar. Perhaps they could be condensed/simplified into a single top-level tf/README.md
instead?
@@ -9,3 +9,4 @@ models/* | |||
convert_models/converted_models | |||
recipes/common/bin/* | |||
*/.venv/ | |||
.github/tf/*/.terraform.lock.hcl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but I thought it was standard practice to check these into VCS and allow terraform to manage them. No?
# AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
# AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
# AWS_DEFAULT_REGION: "eu-west-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend either adding a comment explaining why these lines are commented out, or simply removing them if they're not used.
integration-tests: | ||
if: github.repository == 'containers/ai-lab-recipes' | ||
strategy: | ||
fail-fast: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make some guesses why you set this false
. But it would be helpful to future maintainers if you added a short comment explaining why this non-default is necessary.
with: | ||
repository: containers/ai-lab-recipes | ||
path: ai-lab-recipes/.github/tf/multi-runner | ||
ref: 'main' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC main
is the default value, no?
TF_VAR_aws_ami_name: '["Fedora-Cloud-Base-*"]' | ||
TF_VAR_aws_volume_size: 100 | ||
TF_VAR_aws_access_key: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
TF_VAR_aws_secret_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend setting TF_INPUT=0
|
||
- name: Download Lambdas | ||
id: download-lambdas | ||
run: terraform apply -auto-approve -lock=false -var=module_version=${{ env.tf_version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-var=module_version=${{ env.tf_version }}
is confusing for me. If this is intended, I'd recommend adding a comment so future maintainers understand why the values are congruent.
Same suggestion for all the places below where module_version
is set this way.
terraform_wrapper: false | ||
|
||
- name: Init Lambdas | ||
run: terraform init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider if adding --upgrade
is useful here, and all the other similar calls below.
working-directory: ${{ matrix.tf_dir }} | ||
|
||
- name: Bootstrap | ||
id: up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed it, but if this id
isn't needed/used anywhere I'd suggest removing it.
- name: Destroy Test Environment | ||
id: down | ||
if: always() | ||
run: terraform destroy -auto-approve -lock=false -var=module_version=${{ env.tf_version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing -lock=false
makes my ears stand up. Would you consider adding a comment explaining why this is both safe and necessary?
with: | ||
payload: | | ||
{ | ||
"text": "${{ github.workflow }} workflow status: ${{ job.status }}\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be asking too much, if so please ignore the suggestion. Since this is running every hour (including over weekends), a matrix is in use, and there are a lot of complex steps involved, it would be super-helpful to operators to have more context clues. For example, providing the matrix tf_dir
value, and if Ansible failed, provide the final-status table or last few lines of stderr.
This does a few things:
- It lets operators easily see (without clicking) if multiple failures might be the same or related.
- When glancing at possibly hundreds of historical slack messages, quickly pick out any that are DIFFERENT.
- If other monitoring tools are processing the status messages, the additional details will help them better categorize and/or take more relevant actions.
Addresses: #264
Still WIP but cc @lmilbaum @cevich
Other things that need to happen in tandem: