ci(ssh): revert using `ssh-compute` action & increase sshd connection limit #5367

gustavovalverde · 2022-10-10T14:18:54Z

Motivation

We've been having multiple errors (more than before) after implementing ssh-compute

Solution

Revert the ssh-compute implementation
Stop using IAP for TCP forwarding as there are known limitations which might be counterproductive for our use cases
Keep using sudo in some docker commands as we might be connecting with a non-root user to the VMs
Set the MaxStartups setting in sshd to 500 to avoid a failing SSH connection causes GitHub Action to fail

Note: tj-actions/changed-files file comparison was failing in this PR and it has been failing in the main branch, so we're adding a fix here too as this would halt the PR from merging. Here's the actual explanation on this fix tj-actions/changed-files#639 (comment)

Closes #5358
Closes #5365
Fixes #5362
Fixes #5361

Review

If CI passes, anyone can review this PR

Reviewer Checklist

Will the PR name make sense to users?
- Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
Are the PR labels correct?
Does the code do what the ticket and PR says?
How do you know it works? Does it have tests?
- Tested I've tested this is working doing a manual SSH into the VM and confirming the file has changed with the actual configuration being used in the script file gcp-vm-startup-script.sh

Follow Up Work

We might also want to wait up to 90 seconds after a VM has been created, just so we're sure all configurations are complete.

This reverts commit b366d6e.

.github/workflows/scripts/gcp-vm-startup-script.sh

Motivation: We've been trying multiple solutions to our SSH connection issues, our last try solving this issues was PR https://github.com/ZcashFoundation/zebra/pull/5367/files Depends-On: #5367 Expected behavior: An SSH connection should not be terminated by the server, the connection must be kept alive indefinitely until it's killed by GitHub Actions Solution: Disable TCP keepalive messages from the server and set `ClientAliveCountMax` to 0, which disables connection termination

teor2345

Looks great!

I'd like to check that the sshd connection limit adjustment actually worked before we merge this, since we're calling the script a different way now.

We might also want to use the bullseye image for the instances.

.github/workflows/deploy-gcp-tests.yml

gustavovalverde · 2022-10-10T20:23:16Z

Sorry for adding chore: fix tj-actions/changed-files file comparison to this PR. But it's annoying having an ❌ in this PR and this will be pulled to main, fixing it here is minimal.

Edit: I had to fix it here anyways, as it wouldn't merge otherwise

gustavovalverde · 2022-10-10T20:32:04Z

CI is failing with a new (unrelated?) error 🥲

e2fsck: Cannot continue, aborting.
/dev/sdb is in use.

.github/workflows/lint.yml

teor2345 · 2022-10-10T20:51:51Z

CI is failing with a new (unrelated?) error 🥲
e2fsck: Cannot continue, aborting.
/dev/sdb is in use.

We've changed two things related to the disk in this PR:

bullseye image, which worked in PR fix(ci): Increase the Google Cloud instance sshd connection limit #5365: https://github.com/ZcashFoundation/zebra/actions/runs/3216898611/jobs/5263285350#step:11:103
moving the script to the image creation, which is new

Here are some things we could try:

checking the server logs
find out what is using the disk, by calling lsof when e2fsck or resize2fs fail
moving e2fsck and resize2fs to the startup script

Is there anything else we could try?

teor2345 · 2022-10-10T22:06:20Z

Is there anything else we could try?

Using ssh to adjust the sshd config, like PR fix(ci): Increase the Google Cloud instance sshd connection limit #5365

That would only work if the instance startup script is causing the disk to be used.

teor2345 · 2022-10-10T22:12:47Z

@gustavovalverde I just realised that these disk resize commands are only needed when we change the disk size.

Specifically, they are only needed between:

when we merge the change to main, and
when we stop using cached states with the smaller size (after the new larger full sync image is generated, or we've created new updated cached states with resized disks).

So let's ignore the failure for now, and fix it if it becomes a problem after #5085 ?

gustavovalverde · 2022-10-10T22:17:13Z

So let's ignore the failure for now, and fix it if it becomes a problem after #5085 ?

If a re-run fixes the issue, we can ignore it for now. I'm anyways testing a PR here to handle this repartitioning: #5371

.github/workflows/deploy-gcp-tests.yml

teor2345

Thanks, looks good, let's get it fixed!

gustavovalverde added 2 commits October 10, 2022 09:47

Revert "ci(ssh): connect using ssh-compute action by Google (#5330)"

dcfdeed

This reverts commit b366d6e.

ci(ssh): use sudo for docker commands if user is not root

1905d7d

gustavovalverde added C-bug Category: This is a bug A-infrastructure Area: Infrastructure changes A-devops Area: Pipelines, CI/CD and Dockerfiles P-Critical 🚑 I-integration-fail Continuous integration fails, including build and test failures labels Oct 10, 2022

gustavovalverde self-assigned this Oct 10, 2022

gustavovalverde requested a review from a team as a code owner October 10, 2022 14:18

gustavovalverde requested review from teor2345 and removed request for a team October 10, 2022 14:18

github-actions bot added the C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG label Oct 10, 2022

gustavovalverde added 2 commits October 10, 2022 11:11

ci(ssh): specify the service account to connect with

df1958a

ci(ssh): increase the Google Cloud instance sshd connection limit

2b72323

gustavovalverde changed the title ~~Revert "ci(ssh): connect using ssh-compute action by Google (#5330)"~~ ci(ssh): revert connect using ssh-compute action & increase sshd connection limit Oct 10, 2022

gustavovalverde changed the title ~~ci(ssh): revert connect using ssh-compute action & increase sshd connection limit~~ ci(ssh): revert using ssh-compute action & increase sshd connection limit Oct 10, 2022

This was referenced Oct 10, 2022

fix(ci): Show logs after each Google Cloud test job #5358

Closed

fix(ci): Increase the Google Cloud instance sshd connection limit #5365

Closed

teor2345 reviewed Oct 10, 2022

View reviewed changes

.github/workflows/scripts/gcp-vm-startup-script.sh Outdated Show resolved Hide resolved

gustavovalverde mentioned this pull request Oct 10, 2022

ci(ssh): avoid ssh timeouts from the server #5370

Closed

5 tasks

chore: add a new line at the end of the script

14ca04a

teor2345 requested changes Oct 10, 2022

View reviewed changes

teor2345 reviewed Oct 10, 2022

View reviewed changes

.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved

gustavovalverde added 2 commits October 10, 2022 16:11

chore: update our VM image to bullseye

ce80c3a

chore: fix tj-actions/changed-files file comparison

6ee8791

teor2345 reviewed Oct 10, 2022

View reviewed changes

.github/workflows/lint.yml Show resolved Hide resolved

arya2 mentioned this pull request Oct 10, 2022

fix(ci): adds full sync jobs for 1800k, 1810k, and 1820k blocks #5372

Closed

5 tasks

teor2345 reviewed Oct 10, 2022

View reviewed changes

.github/workflows/deploy-gcp-tests.yml Show resolved Hide resolved

teor2345 approved these changes Oct 10, 2022

View reviewed changes

This was referenced Oct 10, 2022

Document the use of fetch-depth in tj-actions/changed-files #5373

Closed

ci(disk): use an official GCP image on CI VMs for disk auto-resizing, make CI & CD disks 300GB #5371

Merged

mergify bot added a commit that referenced this pull request Oct 10, 2022

Merge of #5367

c0ba1dc

mergify bot mentioned this pull request Oct 10, 2022

merge-queue: embarking main (3bc8f09) and #5367 together #5374

Closed

mergify bot merged commit 658fbd9 into main Oct 11, 2022

mergify bot deleted the revert-ssh-fix branch October 11, 2022 00:11

teor2345 mentioned this pull request Oct 11, 2022

Release Zebra 1.0.0-rc.0 #5383

Merged

32 tasks

arya2 added a commit that referenced this pull request Oct 12, 2022

Update run based on PR #5367

33aa305

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(ssh): revert using `ssh-compute` action & increase sshd connection limit #5367

ci(ssh): revert using `ssh-compute` action & increase sshd connection limit #5367

gustavovalverde commented Oct 10, 2022 •

edited

teor2345 left a comment

gustavovalverde commented Oct 10, 2022 •

edited

gustavovalverde commented Oct 10, 2022

teor2345 commented Oct 10, 2022

teor2345 commented Oct 10, 2022

teor2345 commented Oct 10, 2022

gustavovalverde commented Oct 10, 2022

teor2345 left a comment

ci(ssh): revert using ssh-compute action & increase sshd connection limit #5367

ci(ssh): revert using ssh-compute action & increase sshd connection limit #5367

Conversation

gustavovalverde commented Oct 10, 2022 • edited

Motivation

Solution

Review

Reviewer Checklist

Follow Up Work

teor2345 left a comment

Choose a reason for hiding this comment

gustavovalverde commented Oct 10, 2022 • edited

gustavovalverde commented Oct 10, 2022

teor2345 commented Oct 10, 2022

teor2345 commented Oct 10, 2022

teor2345 commented Oct 10, 2022

gustavovalverde commented Oct 10, 2022

teor2345 left a comment

Choose a reason for hiding this comment

ci(ssh): revert using `ssh-compute` action & increase sshd connection limit #5367

ci(ssh): revert using `ssh-compute` action & increase sshd connection limit #5367

gustavovalverde commented Oct 10, 2022 •

edited

gustavovalverde commented Oct 10, 2022 •

edited