Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci(ssh): revert using ssh-compute action & increase sshd connection limit #5367

Merged
merged 7 commits into from Oct 11, 2022

Conversation

gustavovalverde
Copy link
Member

@gustavovalverde gustavovalverde commented Oct 10, 2022

Motivation

We've been having multiple errors (more than before) after implementing ssh-compute

Solution

  • Revert the ssh-compute implementation
  • Stop using IAP for TCP forwarding as there are known limitations which might be counterproductive for our use cases
  • Keep using sudo in some docker commands as we might be connecting with a non-root user to the VMs
  • Set the MaxStartups setting in sshd to 500 to avoid a failing SSH connection causes GitHub Action to fail

Note: tj-actions/changed-files file comparison was failing in this PR and it has been failing in the main branch, so we're adding a fix here too as this would halt the PR from merging. Here's the actual explanation on this fix tj-actions/changed-files#639 (comment)

Closes #5358
Closes #5365
Fixes #5362
Fixes #5361

Review

If CI passes, anyone can review this PR

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
  • How do you know it works? Does it have tests?
    • Tested I've tested this is working doing a manual SSH into the VM and confirming the file has changed with the actual configuration being used in the script file gcp-vm-startup-script.sh

Follow Up Work

We might also want to wait up to 90 seconds after a VM has been created, just so we're sure all configurations are complete.

@gustavovalverde gustavovalverde added C-bug Category: This is a bug A-infrastructure Area: Infrastructure changes A-devops Area: Pipelines, CI/CD and Dockerfiles P-Critical 🚑 I-integration-fail Continuous integration fails, including build and test failures labels Oct 10, 2022
@gustavovalverde gustavovalverde self-assigned this Oct 10, 2022
@gustavovalverde gustavovalverde requested a review from a team as a code owner October 10, 2022 14:18
@gustavovalverde gustavovalverde requested review from teor2345 and removed request for a team October 10, 2022 14:18
@github-actions github-actions bot added the C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG label Oct 10, 2022
@gustavovalverde gustavovalverde changed the title Revert "ci(ssh): connect using ssh-compute action by Google (#5330)" ci(ssh): revert connect using ssh-compute action & increase sshd connection limit Oct 10, 2022
@gustavovalverde gustavovalverde changed the title ci(ssh): revert connect using ssh-compute action & increase sshd connection limit ci(ssh): revert using ssh-compute action & increase sshd connection limit Oct 10, 2022
gustavovalverde added a commit that referenced this pull request Oct 10, 2022
Motivation:
We've been trying multiple solutions to our SSH connection issues, our last
try solving this issues was PR https://github.com/ZcashFoundation/zebra/pull/5367/files

Depends-On: #5367

Expected behavior:
An SSH connection should not be terminated by the server, the connection
must be kept alive indefinitely until it's killed by GitHub Actions

Solution:
Disable TCP keepalive messages from the server and set `ClientAliveCountMax`
to 0, which disables connection termination
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I'd like to check that the sshd connection limit adjustment actually worked before we merge this, since we're calling the script a different way now.

We might also want to use the bullseye image for the instances.

.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved
.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved
.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved
.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved
.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved
@gustavovalverde
Copy link
Member Author

gustavovalverde commented Oct 10, 2022

Sorry for adding chore: fix tj-actions/changed-files file comparison to this PR. But it's annoying having an ❌ in this PR and this will be pulled to main, fixing it here is minimal.

Edit: I had to fix it here anyways, as it wouldn't merge otherwise

@gustavovalverde
Copy link
Member Author

CI is failing with a new (unrelated?) error 🥲

e2fsck: Cannot continue, aborting.
/dev/sdb is in use.

@teor2345
Copy link
Contributor

CI is failing with a new (unrelated?) error 🥲

e2fsck: Cannot continue, aborting.
/dev/sdb is in use.

We've changed two things related to the disk in this PR:

Here are some things we could try:

  • checking the server logs
  • find out what is using the disk, by calling lsof when e2fsck or resize2fs fail
  • moving e2fsck and resize2fs to the startup script

Is there anything else we could try?

@teor2345
Copy link
Contributor

Is there anything else we could try?

That would only work if the instance startup script is causing the disk to be used.

@teor2345
Copy link
Contributor

@gustavovalverde I just realised that these disk resize commands are only needed when we change the disk size.

Specifically, they are only needed between:

  • when we merge the change to main, and
  • when we stop using cached states with the smaller size (after the new larger full sync image is generated, or we've created new updated cached states with resized disks).

So let's ignore the failure for now, and fix it if it becomes a problem after #5085 ?

@gustavovalverde
Copy link
Member Author

So let's ignore the failure for now, and fix it if it becomes a problem after #5085 ?

If a re-run fixes the issue, we can ignore it for now. I'm anyways testing a PR here to handle this repartitioning: #5371

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good, let's get it fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles A-infrastructure Area: Infrastructure changes C-bug Category: This is a bug C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
2 participants