Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(JENKINS-68371) improve asynchrony of StandardPlannedNodeBuilder #1171

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jonathannewman
Copy link

@jonathannewman jonathannewman commented May 3, 2022

When the NodeProvisioner is building agents in the provision step:

plannedNodes.add(PlannedNodeBuilderFactory.createInstance().cloud(this).template(podTemplate).label(label).numExecutors(1).build());
it currently loops through the number of
nodes to provision. This is implemented in the StandardPlannedNodeBuilder,
and is done as a blocking operation, while using a Future interface
to satisfy the consumers. In our testing, it was seen that this
operation could take upwards of 100 seconds when under load, causing
provisioning to be effectively stopped for periods of time.

This change introduces a configurable thread pool that will cause
the Agent creation step to occur in a separate thread. This, in
our testing resolves the serial bottleneck issue and allows provisioning
to continue while the blocking operations occur separately.

https://issues.jenkins.io/browse/JENKINS-68371

I'm not sure exactly how to test this effectively.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

When the NodeProvisioner is building agents in the provision step: https://github.com/jenkinsci/kubernetes-plugin/blob/307d9791dcf7dfc3bbbcbdf1a7eab44ed752a4c8/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java#L536 it currently loops through the number of
nodes to provision. This is implemented in the StandardPlannedNodeBuilder,
and is done as a blocking operation, while using a Future interface
to satisfy the consumers.  In our testing, it was seen that this
operation could take upwards of 100 seconds when under load, causing
provisioning to be effectively stopped for periods of time.

This change introduces a configurable thread pool that will cause
the Agent creation step to occur in a separate thread.  This, in
our testing resolves the serial bottleneck issue and allows provisioning
to continue while the blocking operations occur separately.
@jonathannewman
Copy link
Author

Don't quite understand the test failure. Seems like some sort of race between the pod failing and it getting killed?

@Vlatombe
Copy link
Member

Vlatombe commented May 4, 2022

In our testing, it was seen that this
operation could take upwards of 100 seconds when under load, causing
provisioning to be effectively stopped for periods of time.

This doesn't seem right. I would expect computation for a single agent to take up to 1 second, not 100 seconds, even on a loaded system.

@jonathannewman
Copy link
Author

There are some logs in the related ticket that demonstrate the delays we have seen. Without this fix, provisioning with hundreds of agents needed can take up to an hour before those resources are available.

@Vlatombe
Copy link
Member

This step is completely local and I don't think there is justification for running it asynchronously.
However, the current pod template inheritance certainly can be expensive to compute (due to yaml marshalling/unmarshalling). I think it could be improved with some refactoring. Profiling should be required to identify critical paths to improve.

@jonathannewman
Copy link
Author

What we have seen in our logs is that there is some blocking operation that prevents completion of the step (when whatever resource is freed, the blocked threads return all at the same time). When the thread pool is leveraged, this blocking operation does not prevent the KS from being created, allowing things to move smoothly. Given the expected result is already a Future, and we have a complex system with various moving parts that involve locking, I don't understanding the objection to leveraging a thread pool to prevent future performance issues. We could certainly make the thread pool smaller to avoid memory overhead -- it will just queue the work to be done. For us, profiling this situation would be very difficult as it only seems to manifest itself under production level loads on our production instances. This fix demonstrably addresses the underlying issue for us.

@Vlatombe Vlatombe added the enhancement Improvements label May 30, 2022
@jonathannewman
Copy link
Author

Would additional tests be beneficial? I'm not sure exactly what I would test.

@Vlatombe
Copy link
Member

Vlatombe commented Jun 1, 2022

Would you mind trying out #1178 ? This should already speed up the main loop and I still don't think asynchrony is necessary here even if the framework allows it.

@jonathannewman
Copy link
Author

Would you mind trying out #1178 ? This should already speed up the main loop and I still don't think asynchrony is necessary here even if the framework allows it.

We have tried it locally and still have issues. We would like to show you some logs, but don't really want to share them in public. Any suggestions about how to do that? cc @sbeaulie

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements
Projects
None yet
2 participants