[JENKINS-72163] Add tests for agent endpoint resolution retries #608

Vlatombe · 2022-12-06T16:52:41Z

Follow-up #449

When using hudson.remoting.Engine.nonFatalJnlpAgentEndpointResolutionExceptions, the failure to resolve JNLP endpoint is considered non-fatal, so there should be retries handled by remoting.

This can help with cases where a flaky network causes initial JNLP endpoint lookup to fail.

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

Follow-up jenkinsci#449 When using `hudson.remoting.Engine.nonFatalJnlpAgentEndpointResolutionExceptions`, the failure to resolve JNLP endpoint is considered non-fatal, so there should be retries handled by remoting. This can help with cases where a flaky network causes initial JNLP endpoint lookup to fail.

sboardwell · 2022-12-12T12:46:23Z

Hi @Vlatombe, I have two additions if possible.

add an option to avoid an infinite loop by setting the value of, e.g:
- nonFatalJnlpAgentEndpointResolutionExceptionsMaxRetries max number of retries before failing
- or nonFatalJnlpAgentEndpointResolutionExceptionsMaxSecondsElapsed for the max time elapsed before failing.
- it might also be useful to add an 'x' second wait in between retries (possibly making it configurable?).
secondly, would it be possible to make these properties configurable via environment variables?
- currently you need to set _JAVA_OPTIONS in a pod template
- being able to set environment variables seems a cleaner solution

basil

The primary intended consumer of this existing property is Swarm, which wraps Remoting around a layer that does its own retries. So this is unnecessary for Swarm. Was there another consumer you had in mind that would benefit from this?

Vlatombe · 2022-12-13T08:49:02Z

@sboardwell has a customer case where it would be beneficial for the initial jnlp endpoint lookup to apply retries.

#541 also seems to stem from a similar use case.

In any case, I think this needs further formalization of the expected behaviour as the various states (lookup jnlp endpoint, connecting to tcp port, connecting through websockets) all have various different retry behaviours.

basil · 2022-12-13T16:16:18Z

Yeah, I added this property to Remoting more or less as a hack to allow Swarm (with Swarm's higher-layer retrying) to work with Remoting. But I think ideally Remoting would natively support retries and exponential backoff which would allow Swarm to remove its own retrying functionality and delegating to Remoting. Happy to work toward that goal one step at a time.

sboardwell · 2022-12-21T13:02:54Z

If we are going to implement this, would it also be possible to give a more informative error message if it does fail?

For example, the following seems to suggest the (HTTP) endpoint is timing when, in actual fact, it was the connection to the underlying TCP port 5000x

Nov 29, 2022 1:49:37 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://ci-server.ci-namespace.svc.cluster.local/ci-server/]
Nov 29, 2022 1:50:07 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: Failed to connect to http://ci-server.ci-namespace.svc.cluster.local/ci-server/tcpSlaveAgentListener/: connect timed out
java.io.IOException: Failed to connect to http://ci-server.ci-namespace.svc.cluster.local/ci-server/tcpSlaveAgentListener/: connect timed out
at org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver.resolve(JnlpAgentEndpointResolver.java:211)
at hudson.remoting.Engine.innerRun(Engine.java:744)
at hudson.remoting.Engine.run(Engine.java:543)
Caused by: java.net.SocketTimeoutException: connect timed out
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)

sboardwell · 2023-02-24T10:03:25Z

@Vlatombe - can I have a crack at an initial non-breaking stop-gap solution to this.

We still have this problem and it looks to be a bigger discussion to solve "properly". My solution was mentioned in #608 (comment) (the first one). This would not change current behaviour but allow people to set the retries/interval value themselves if they so wish.

…ect-on-jnlp-lookup-failure

basil

New tests fail without #675 and pass with it.

basil reviewed Dec 12, 2022

View reviewed changes

sboardwell mentioned this pull request Feb 24, 2023

Expand non-fatal JNLP agent endpoint resolution to do retries #628

Closed

6 tasks

Merge remote-tracking branch 'refs/remotes/origin/master' into reconn…

e20ee34

…ect-on-jnlp-lookup-failure

basil changed the title ~~Expand non-fatal JNLP agent endpoint resolution to do retries~~ [JENKINS-72163] Add tests for agent endpoint resolution retries Oct 12, 2023

basil added the test label Oct 12, 2023

basil marked this pull request as ready for review October 12, 2023 20:11

basil approved these changes Oct 12, 2023

View reviewed changes

basil merged commit 737fe1e into jenkinsci:master Oct 12, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-72163] Add tests for agent endpoint resolution retries #608

[JENKINS-72163] Add tests for agent endpoint resolution retries #608

Vlatombe commented Dec 6, 2022

sboardwell commented Dec 12, 2022

basil left a comment

Vlatombe commented Dec 13, 2022

basil commented Dec 13, 2022

sboardwell commented Dec 21, 2022

sboardwell commented Feb 24, 2023

basil left a comment

[JENKINS-72163] Add tests for agent endpoint resolution retries #608

[JENKINS-72163] Add tests for agent endpoint resolution retries #608

Conversation

Vlatombe commented Dec 6, 2022

sboardwell commented Dec 12, 2022

basil left a comment

Choose a reason for hiding this comment

Vlatombe commented Dec 13, 2022

basil commented Dec 13, 2022

sboardwell commented Dec 21, 2022

sboardwell commented Feb 24, 2023

basil left a comment

Choose a reason for hiding this comment