-
-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-72163] Add tests for agent endpoint resolution retries #608
[JENKINS-72163] Add tests for agent endpoint resolution retries #608
Conversation
Follow-up jenkinsci#449 When using `hudson.remoting.Engine.nonFatalJnlpAgentEndpointResolutionExceptions`, the failure to resolve JNLP endpoint is considered non-fatal, so there should be retries handled by remoting. This can help with cases where a flaky network causes initial JNLP endpoint lookup to fail.
Hi @Vlatombe, I have two additions if possible.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary intended consumer of this existing property is Swarm, which wraps Remoting around a layer that does its own retries. So this is unnecessary for Swarm. Was there another consumer you had in mind that would benefit from this?
@sboardwell has a customer case where it would be beneficial for the initial jnlp endpoint lookup to apply retries. #541 also seems to stem from a similar use case. In any case, I think this needs further formalization of the expected behaviour as the various states (lookup jnlp endpoint, connecting to tcp port, connecting through websockets) all have various different retry behaviours. |
Yeah, I added this property to Remoting more or less as a hack to allow Swarm (with Swarm's higher-layer retrying) to work with Remoting. But I think ideally Remoting would natively support retries and exponential backoff which would allow Swarm to remove its own retrying functionality and delegating to Remoting. Happy to work toward that goal one step at a time. |
If we are going to implement this, would it also be possible to give a more informative error message if it does fail? For example, the following seems to suggest the (HTTP) endpoint is timing when, in actual fact, it was the connection to the underlying TCP port 5000x
|
@Vlatombe - can I have a crack at an initial non-breaking stop-gap solution to this. We still have this problem and it looks to be a bigger discussion to solve "properly". My solution was mentioned in #608 (comment) (the first one). This would not change current behaviour but allow people to set the retries/interval value themselves if they so wish. |
…ect-on-jnlp-lookup-failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New tests fail without #675 and pass with it.
Follow-up #449
When using
hudson.remoting.Engine.nonFatalJnlpAgentEndpointResolutionExceptions
, the failure to resolve JNLP endpoint is considered non-fatal, so there should be retries handled by remoting.This can help with cases where a flaky network causes initial JNLP endpoint lookup to fail.