Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci(ssh): avoid ssh timeouts from the server #5370

Closed
wants to merge 6 commits into from

Conversation

gustavovalverde
Copy link
Member

Motivation

We've been trying multiple solutions to our SSH connection issues, our last try solving this issues was PR https://github.com/ZcashFoundation/zebra/pull/5367/files

Depends-On: #5367

This is an optional PR if #5367 fails to solve the issue

Expected behavior

An SSH connection should not be terminated by the server, the connection must be kept alive indefinitely until it's killed by GitHub Actions

Solution

Disable TCP keepalive messages from the server and set ClientAliveCountMax to 0, which disables connection termination

Review

Anyone can review this

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
  • How do you know it works? Does it have tests?

Motivation:
We've been trying multiple solutions to our SSH connection issues, our last
try solving this issues was PR https://github.com/ZcashFoundation/zebra/pull/5367/files

Depends-On: #5367

Expected behavior:
An SSH connection should not be terminated by the server, the connection
must be kept alive indefinitely until it's killed by GitHub Actions

Solution:
Disable TCP keepalive messages from the server and set `ClientAliveCountMax`
to 0, which disables connection termination
@gustavovalverde gustavovalverde added C-bug Category: This is a bug A-infrastructure Area: Infrastructure changes A-devops Area: Pipelines, CI/CD and Dockerfiles I-integration-fail Continuous integration fails, including build and test failures P-Optional ✨ labels Oct 10, 2022
@gustavovalverde gustavovalverde self-assigned this Oct 10, 2022
@github-actions github-actions bot added the C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG label Oct 10, 2022
@@ -11,6 +14,8 @@ sudo cat /etc/ssh/sshd_config
echo
echo 'Modifying config:'
echo 'MaxStartups 500' | sudo tee --append /etc/ssh/sshd_config \
echo 'TCPKeepAlive no' | sudo tee --append /etc/ssh/sshd_config \
Copy link
Contributor

@teor2345 teor2345 Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling TCP keepalive packets will make some firewalls terminate the connection as inactive.
Is there some documentation supporting this change?

@@ -11,6 +14,8 @@ sudo cat /etc/ssh/sshd_config
echo
echo 'Modifying config:'
echo 'MaxStartups 500' | sudo tee --append /etc/ssh/sshd_config \
echo 'TCPKeepAlive no' | sudo tee --append /etc/ssh/sshd_config \
echo 'ClientAliveCountMax 0' | sudo tee --append /etc/ssh/sshd_config \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to set this to a high number instead (6 hours), to avoid stuck connections exhausting the connection limit.

Or we can also increase the authenticated connection limit so it's higher than the number of connections we expect from the longest workflow. (About 20.)

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if there is a specific log message or failure you're trying to fix here?

Edit: yes, this is a known issue.

@teor2345
Copy link
Contributor

I'm also going to ask the new CI change questions from the retro:

  • how are we sure that this change is the best solution to the problem?
  • how are we going to test this?
  • what manual testing has been done?

Base automatically changed from revert-ssh-fix to main October 11, 2022 00:11
@teor2345
Copy link
Contributor

Just wondering if there is a specific log message or failure you're trying to fix here?

This is actually happening in CI:

Timeout, server 34.123.232.71 not responding.

#5391 (comment)

@gustavovalverde
Copy link
Member Author

We're no longer needing this. We might consider to open this once again if we see SSH issues reappear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-devops Area: Pipelines, CI/CD and Dockerfiles A-infrastructure Area: Infrastructure changes C-bug Category: This is a bug C-trivial Category: A trivial change that is not worth mentioning in the CHANGELOG I-integration-fail Continuous integration fails, including build and test failures
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants