New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consume control capacity #11665
Consume control capacity #11665
Conversation
ea945b8
to
771e07a
Compare
e353da5
to
510eeef
Compare
510eeef
to
187b3e2
Compare
b2afc9d
to
2dec0e6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I approve, let me acknowledge that the managers.py code getting ugly. There are 2 spinoff issues from this, and that messy code will become a candidate for outright deletion as those are worked on.
the test failure looks like it only needs a rebase |
e56ef31
to
90d8068
Compare
example of a job that waits on capacity for control then ends up running (Taken from a "hybrid-standalone" type deploy)
we can see the hybrid node is quite busy both running and controlling jobs:
Also, from this same system running a number of jobs/project updates/workflows/inventory updates etc I have seen no tracebacks from any related changes in this PR (nothing from scheduler etc) |
@AlanCoding I'm not seeing a fix for this in devel:
That is what is causing my test failure I'm not having luck re-creating that failure locally. As far as I can tell the import above this must be failing, and so its falling back to this old way to import it (pip < 10) |
figured out the bug...pytest-dev/pytest#9609 I imagine we'll pin in devel and I'll rebase |
Aside from rebasing, I think this is ready to merge. |
90d8068
to
29b5e4c
Compare
Consume capacity on control nodes for controlling tasks and consider remainging capacity on control nodes before selecting them. This depends on the requirement that control and hybrid nodes should all be in the instance group named 'controlplane'. Many tests do not satisfy that requirement. I'll update the tests in another commit.
We don't start any tasks if we don't have a controlplane instance group Due to updates to fixtures, update tests to set node type and capacity explicitly so they get expected result.
Update method is used to account for currently consumed capacity for instance groups in the in-memory capacity tracking data structure we initialize in after_lock_init and then update via calculate_capacity_consumed (both in task_manager.py) Also update fit_task_to_instance to consider control impact on instances Trust that these functions do the right thing looking for a node with capacity, and cut out redundant check for the whole group's capacity per Alan's reccomendation.
Deal with control type tasks before we loop over the preferred instance groups, which cuts out the need for some redundant logic. Also, fix a bug where I was missing assigning the execution node in one case!
move the job explanation for jobs that need capacity to a function so we can re-use it in the three places we need it.
Instance group ordering makes no sense on project updates because they always need to run on the control plane. Also, since hybrid nodes should always run the control processes for the jobs running on them as execution nodes, account for this when looking for a execution node.
the variables and wording were both misleading, fix to be more accurate description in the two different cases where this log may be emitted.
use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded name cache the controlplane_ig object during the after lock init to avoid an uneccesary query eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use only AWX_CONTROL_NODE_TASK_IMPACT
add test to verify that when there are 2 jobs and only capacity for one that one will move into waiting and the other stays in pending
assert that the hybrid node is used for both control and execution and capacity is deducted correctly
Test that control type tasks have the right capacity consumed and get assigned to the right instance group Also fix lint in the tests
We can either NOT use "idle instances" for control nodes, or we need to update the jobs_running property on the Instance model to count jobs where the node is the controller_node. I didn't do that because it may be an expensive query, and it would be hard to make it match with jobs_running on the InstanceGroup which filters on tasks assigned to the instance group. This change chooses to stop considering "idle" control nodes an option, since we can't acurrately identify them. The way things are without any change, is we are continuing to over consume capacity on control nodes because this method sees all control nodes as "idle" at the beginning of the task manager run, and then only counts jobs started in that run in the in-memory tracking. So jobs which last over a number of task manager runs build up consuming capacity, which is accurately reported via Instance.consumed_capacity
This is something we can experiment with as far as what users want at install time, but start with just 1 for now.
6b57511
to
65f6845
Compare
Describe usage of the new setting and the concept of control impact.
Merging! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving even though I worked on this a bit, it looks great
SUMMARY
Addresses #10694
Replaces #11651
ISSUE TYPE
COMPONENT NAME
AWX VERSION
ADDITIONAL INFORMATION
This PR focuses exclusively on implementing #10694
This implementation enforces the requirement for having all control and hybrid nodes be members of the
'controlplane'
instance group.The tests did not previously comply with this requirement so I'm having to update a number of them. I have all but about 5 passing now.
This PR:
AWX_CONTROL_NODE_TASK_IMPACT
that is a constant integer (I use 5) amount of "task_impact" for when a node is thecontroller_node
of a jobtask.execution_node
andtask.controller_node
and consumption of capacity in the in-memory capacity tracking happen before we go intostart_task
preferred_instance_groups
for project updates and system jobs.This PR does not: