Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-72163] Add tests for agent endpoint resolution retries #608

Merged
merged 2 commits into from
Oct 12, 2023

Conversation

Vlatombe
Copy link
Member

@Vlatombe Vlatombe commented Dec 6, 2022

Follow-up #449

When using hudson.remoting.Engine.nonFatalJnlpAgentEndpointResolutionExceptions, the failure to resolve JNLP endpoint is considered non-fatal, so there should be retries handled by remoting.

This can help with cases where a flaky network causes initial JNLP endpoint lookup to fail.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

Follow-up jenkinsci#449

When using `hudson.remoting.Engine.nonFatalJnlpAgentEndpointResolutionExceptions`,
the failure to resolve JNLP endpoint is considered non-fatal, so there should be retries handled by remoting.

This can help with cases where a flaky network causes initial JNLP
endpoint lookup to fail.
@sboardwell
Copy link

Hi @Vlatombe, I have two additions if possible.

  • add an option to avoid an infinite loop by setting the value of, e.g:

    • nonFatalJnlpAgentEndpointResolutionExceptionsMaxRetries max number of retries before failing
    • or nonFatalJnlpAgentEndpointResolutionExceptionsMaxSecondsElapsed for the max time elapsed before failing.
    • it might also be useful to add an 'x' second wait in between retries (possibly making it configurable?).
  • secondly, would it be possible to make these properties configurable via environment variables?

    • currently you need to set _JAVA_OPTIONS in a pod template
    • being able to set environment variables seems a cleaner solution

Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary intended consumer of this existing property is Swarm, which wraps Remoting around a layer that does its own retries. So this is unnecessary for Swarm. Was there another consumer you had in mind that would benefit from this?

@Vlatombe
Copy link
Member Author

@sboardwell has a customer case where it would be beneficial for the initial jnlp endpoint lookup to apply retries.

#541 also seems to stem from a similar use case.

In any case, I think this needs further formalization of the expected behaviour as the various states (lookup jnlp endpoint, connecting to tcp port, connecting through websockets) all have various different retry behaviours.

@basil
Copy link
Member

basil commented Dec 13, 2022

Yeah, I added this property to Remoting more or less as a hack to allow Swarm (with Swarm's higher-layer retrying) to work with Remoting. But I think ideally Remoting would natively support retries and exponential backoff which would allow Swarm to remove its own retrying functionality and delegating to Remoting. Happy to work toward that goal one step at a time.

@sboardwell
Copy link

If we are going to implement this, would it also be possible to give a more informative error message if it does fail?

For example, the following seems to suggest the (HTTP) endpoint is timing when, in actual fact, it was the connection to the underlying TCP port 5000x

Nov 29, 2022 1:49:37 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://ci-server.ci-namespace.svc.cluster.local/ci-server/]
Nov 29, 2022 1:50:07 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Failed to connect to http://ci-server.ci-namespace.svc.cluster.local/ci-server/tcpSlaveAgentListener/: connect timed out
java.io.IOException: Failed to connect to http://ci-server.ci-namespace.svc.cluster.local/ci-server/tcpSlaveAgentListener/: connect timed out
at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:211)
at hudson.remoting.Engine.innerRun(Engine.java:744)
at hudson.remoting.Engine.run(Engine.java:543)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)

@sboardwell
Copy link

@Vlatombe - can I have a crack at an initial non-breaking stop-gap solution to this.

We still have this problem and it looks to be a bigger discussion to solve "properly". My solution was mentioned in #608 (comment) (the first one). This would not change current behaviour but allow people to set the retries/interval value themselves if they so wish.

@basil basil changed the title Expand non-fatal JNLP agent endpoint resolution to do retries [JENKINS-72163] Add tests for agent endpoint resolution retries Oct 12, 2023
@basil basil added the test label Oct 12, 2023
@basil basil marked this pull request as ready for review October 12, 2023 20:11
Copy link
Member

@basil basil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New tests fail without #675 and pass with it.

@basil basil merged commit 737fe1e into jenkinsci:master Oct 12, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants