(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171

jonathannewman · 2022-05-03T20:41:30Z

When the NodeProvisioner is building agents in the provision step:

kubernetes-plugin/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java

Line 536 in 307d979

    
           plannedNodes.add(PlannedNodeBuilderFactory.createInstance().cloud(this).template(podTemplate).label(label).numExecutors(1).build());

it currently loops through the number of
nodes to provision. This is implemented in the StandardPlannedNodeBuilder,
and is done as a blocking operation, while using a Future interface
to satisfy the consumers. In our testing, it was seen that this
operation could take upwards of 100 seconds when under load, causing
provisioning to be effectively stopped for periods of time.

This change introduces a configurable thread pool that will cause
the Agent creation step to occur in a separate thread. This, in
our testing resolves the serial bottleneck issue and allows provisioning
to continue while the blocking operations occur separately.

https://issues.jenkins.io/browse/JENKINS-68371

I'm not sure exactly how to test this effectively.

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests - that demonstrates feature works or fixes the issue

When the NodeProvisioner is building agents in the provision step: https://github.com/jenkinsci/kubernetes-plugin/blob/307d9791dcf7dfc3bbbcbdf1a7eab44ed752a4c8/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java#L536 it currently loops through the number of nodes to provision. This is implemented in the StandardPlannedNodeBuilder, and is done as a blocking operation, while using a Future interface to satisfy the consumers. In our testing, it was seen that this operation could take upwards of 100 seconds when under load, causing provisioning to be effectively stopped for periods of time. This change introduces a configurable thread pool that will cause the Agent creation step to occur in a separate thread. This, in our testing resolves the serial bottleneck issue and allows provisioning to continue while the blocking operations occur separately.

jonathannewman · 2022-05-04T00:09:54Z

Don't quite understand the test failure. Seems like some sort of race between the pod failing and it getting killed?

Vlatombe · 2022-05-04T07:27:53Z

In our testing, it was seen that this
operation could take upwards of 100 seconds when under load, causing
provisioning to be effectively stopped for periods of time.

This doesn't seem right. I would expect computation for a single agent to take up to 1 second, not 100 seconds, even on a loaded system.

jonathannewman · 2022-05-04T12:44:56Z

There are some logs in the related ticket that demonstrate the delays we have seen. Without this fix, provisioning with hundreds of agents needed can take up to an hour before those resources are available.

Vlatombe · 2022-05-12T12:38:32Z

This step is completely local and I don't think there is justification for running it asynchronously.
However, the current pod template inheritance certainly can be expensive to compute (due to yaml marshalling/unmarshalling). I think it could be improved with some refactoring. Profiling should be required to identify critical paths to improve.

jonathannewman · 2022-05-12T13:40:00Z

What we have seen in our logs is that there is some blocking operation that prevents completion of the step (when whatever resource is freed, the blocked threads return all at the same time). When the thread pool is leveraged, this blocking operation does not prevent the KS from being created, allowing things to move smoothly. Given the expected result is already a Future, and we have a complex system with various moving parts that involve locking, I don't understanding the objection to leveraging a thread pool to prevent future performance issues. We could certainly make the thread pool smaller to avoid memory overhead -- it will just queue the work to be done. For us, profiling this situation would be very difficult as it only seems to manifest itself under production level loads on our production instances. This fix demonstrably addresses the underlying issue for us.

jonathannewman · 2022-05-31T16:10:20Z

Would additional tests be beneficial? I'm not sure exactly what I would test.

Vlatombe · 2022-06-01T15:54:26Z

Would you mind trying out #1178 ? This should already speed up the main loop and I still don't think asynchrony is necessary here even if the framework allows it.

jonathannewman · 2022-06-13T14:56:01Z

Would you mind trying out #1178 ? This should already speed up the main loop and I still don't think asynchrony is necessary here even if the framework allows it.

We have tried it locally and still have issues. We would like to show you some logs, but don't really want to share them in public. Any suggestions about how to do that? cc @sbeaulie

sbeaulie mentioned this pull request May 3, 2022

[ci.jenkins.io] Container agents in a degraded state jenkins-infra/helpdesk#2893

Closed

Vlatombe mentioned this pull request May 12, 2022

[JENKINS-68371] Avoid unwrapping pod template multiple times when once is enough #1178

Merged

6 tasks

Vlatombe added the enhancement Improvements label May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171

(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171

jonathannewman commented May 3, 2022 •

edited

jonathannewman commented May 4, 2022

Vlatombe commented May 4, 2022 •

edited

jonathannewman commented May 4, 2022

Vlatombe commented May 12, 2022

jonathannewman commented May 12, 2022

jonathannewman commented May 31, 2022

Vlatombe commented Jun 1, 2022

jonathannewman commented Jun 13, 2022

(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171

Are you sure you want to change the base?

(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171

Conversation

jonathannewman commented May 3, 2022 • edited

jonathannewman commented May 4, 2022

Vlatombe commented May 4, 2022 • edited

jonathannewman commented May 4, 2022

Vlatombe commented May 12, 2022

jonathannewman commented May 12, 2022

jonathannewman commented May 31, 2022

Vlatombe commented Jun 1, 2022

jonathannewman commented Jun 13, 2022

jonathannewman commented May 3, 2022 •

edited

Vlatombe commented May 4, 2022 •

edited