Consume control capacity #11665

kdelee · 2022-02-02T12:59:22Z

SUMMARY

Addresses #10694
Replaces #11651

ISSUE TYPE

Feature Pull Request
Bugfix Pull Request

COMPONENT NAME

API

AWX VERSION

awx: 19.5.2.dev67+gbcba14e53e

ADDITIONAL INFORMATION

This PR focuses exclusively on implementing #10694

This implementation enforces the requirement for having all control and hybrid nodes be members of the 'controlplane' instance group.

The tests did not previously comply with this requirement so I'm having to update a number of them. I have all but about 5 passing now.

This PR:

selects a candidate controller node before we loop over instance groups
For non-container group jobs, if we end up selecting a hybrid node with enough capacity to do both the job and the control task, we prefer that node to do both
introduces setting AWX_CONTROL_NODE_TASK_IMPACT that is a constant integer (I use 5) amount of "task_impact" for when a node is the controller_node of a job
Selection of task.execution_node and task.controller_node and consumption of capacity in the in-memory capacity tracking happen before we go into start_task
skip looping over preferred_instance_groups for project updates and system jobs.

This PR does not:

do any refactor of the in-memory capacity tracking mechanism.

awx/settings/development.py

awx/main/scheduler/task_manager.py

awx/main/managers.py

awx/main/scheduler/task_manager.py

AlanCoding

As I approve, let me acknowledge that the managers.py code getting ugly. There are 2 spinoff issues from this, and that messy code will become a candidate for outright deletion as those are worked on.

AlanCoding · 2022-02-04T17:09:15Z

the test failure looks like it only needs a rebase

kdelee · 2022-02-04T18:11:21Z

example of a job that waits on capacity for control then ends up running (Taken from a "hybrid-standalone" type deploy)

{"type": "job", "task_id": 1261, "state": "created", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:07.120251+00:00"}
{"type": "job", "task_id": 1261, "state": "acknowledged", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:07.974560+00:00"}
{"type": "job", "task_id": 1261, "state": "needs_capacity", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:08.146437+00:00"}
{"type": "job", "task_id": 1261, "state": "needs_capacity", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:08.529294+00:00"}
{"type": "job", "task_id": 1261, "state": "needs_capacity", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:10.112022+00:00"}
{"type": "job", "task_id": 1261, "state": "needs_capacity", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:17.020540+00:00"}
{"type": "job", "task_id": 1261, "state": "needs_capacity", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:19.669164+00:00"}
{"type": "job", "task_id": 1261, "state": "controller_node_chosen", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "controller_node": "my-hybrid-awx-instance.com", "time": "2022-02-04T18:01:21.850781+00:00"}
{"type": "job", "task_id": 1261, "state": "execution_node_chosen", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "execution_node": "my-hybrid-awx-instance.com", "time": "2022-02-04T18:01:21.850976+00:00"}
{"type": "job", "task_id": 1261, "state": "waiting", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:21.883302+00:00"}
{"type": "job", "task_id": 1261, "state": "pre_run", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:22.250789+00:00"}
{"type": "job", "task_id": 1261, "state": "preparing_playbook", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:22.380333+00:00"}
{"type": "job", "task_id": 1261, "state": "running_playbook", "work_unit_id": null, "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:22.470447+00:00"}
{"type": "job", "task_id": 1261, "state": "work_unit_id_received", "work_unit_id": "vD1UnMq9", "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:22.540987+00:00"}
{"type": "job", "task_id": 1261, "state": "work_unit_id_assigned", "work_unit_id": "vD1UnMq9", "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:22.591757+00:00"}
{"type": "job", "task_id": 1261, "state": "post_run", "work_unit_id": "vD1UnMq9", "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:25.701226+00:00"}
{"type": "job", "task_id": 1261, "state": "finalize_run", "work_unit_id": "vD1UnMq9", "template_name": "JobTemplate - SurveyParking\ubbea", "time": "2022-02-04T18:01:25.781308+00:00"}

we can see the hybrid node is quite busy both running and controlling jobs:

{
    "id": 1,
    "type": "instance",
    "url": "/api/v2/instances/1/",
    "related": {
        "named_url": "/api/v2/instances/my-hybrid-awx-instance.com/",
        "jobs": "/api/v2/instances/1/jobs/",
        "instance_groups": "/api/v2/instances/1/instance_groups/",
        "health_check": "/api/v2/instances/1/health_check/"
    },
    "uuid": "5b81710c-7518-42f4-b63d-7b13c0e90ab1",
    "hostname": "my-hybrid-awx-instance.com",
    "created": "2022-02-04T16:00:46.603459Z",
    "modified": "2022-02-04T16:00:46.603543Z",
    "last_seen": "2022-02-04T18:07:29.323864Z",
    "last_health_check": "2022-02-04T18:07:29.323864Z",
    "errors": "",
    "capacity_adjustment": "1.00",
    "version": "4.2.0",
    "capacity": 54,
    "consumed_capacity": 50,
    "percent_capacity_remaining": 7.41,
    "jobs_running": 6,
    "jobs_total": 356,
    "cpu": 2,
    "memory": 7835545600,
    "cpu_capacity": 8,
    "mem_capacity": 54,
    "enabled": true,
    "managed_by_policy": true,
    "node_type": "hybrid"
}

Also, from this same system running a number of jobs/project updates/workflows/inventory updates etc I have seen no tracebacks from any related changes in this PR (nothing from scheduler etc)

kdelee · 2022-02-04T19:04:11Z

@AlanCoding I'm not seeing a fix for this in devel:

During handling of the above exception, another exception occurred:
/usr/lib64/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
awx/main/tests/functional/test_licenses.py:10: in <module>
    from pip.req import parse_requirements
E   ModuleNotFoundError: No module named 'pip.req'

That is what is causing my test failure

I'm not having luck re-creating that failure locally. As far as I can tell the import above this must be failing, and so its falling back to this old way to import it (pip < 10)
https://github.com/pypa/pip/blob/ec8edbf5df977bb88e1c777dd44e26664d81e216/src/pip/_internal/commands/uninstall.py#L11
Looks like pip on main is doing the same way as us 😕

kdelee · 2022-02-04T23:33:46Z

figured out the bug...pytest-dev/pytest#9609

I imagine we'll pin in devel and I'll rebase

AlanCoding · 2022-02-05T15:12:34Z

Aside from rebasing, I think this is ready to merge.

Consume capacity on control nodes for controlling tasks and consider remainging capacity on control nodes before selecting them. This depends on the requirement that control and hybrid nodes should all be in the instance group named 'controlplane'. Many tests do not satisfy that requirement. I'll update the tests in another commit.

We don't start any tasks if we don't have a controlplane instance group Due to updates to fixtures, update tests to set node type and capacity explicitly so they get expected result.

Update method is used to account for currently consumed capacity for instance groups in the in-memory capacity tracking data structure we initialize in after_lock_init and then update via calculate_capacity_consumed (both in task_manager.py) Also update fit_task_to_instance to consider control impact on instances Trust that these functions do the right thing looking for a node with capacity, and cut out redundant check for the whole group's capacity per Alan's reccomendation.

Deal with control type tasks before we loop over the preferred instance groups, which cuts out the need for some redundant logic. Also, fix a bug where I was missing assigning the execution node in one case!

move the job explanation for jobs that need capacity to a function so we can re-use it in the three places we need it.

Instance group ordering makes no sense on project updates because they always need to run on the control plane. Also, since hybrid nodes should always run the control processes for the jobs running on them as execution nodes, account for this when looking for a execution node.

the variables and wording were both misleading, fix to be more accurate description in the two different cases where this log may be emitted.

use settings.DEFAULT_CONTROL_PLANE_QUEUE_NAME instead of a hardcoded name cache the controlplane_ig object during the after lock init to avoid an uneccesary query eliminate mistakenly duplicated AWX_CONTROL_PLANE_TASK_IMPACT and use only AWX_CONTROL_NODE_TASK_IMPACT

add test to verify that when there are 2 jobs and only capacity for one that one will move into waiting and the other stays in pending

assert that the hybrid node is used for both control and execution and capacity is deducted correctly

Test that control type tasks have the right capacity consumed and get assigned to the right instance group Also fix lint in the tests

We can either NOT use "idle instances" for control nodes, or we need to update the jobs_running property on the Instance model to count jobs where the node is the controller_node. I didn't do that because it may be an expensive query, and it would be hard to make it match with jobs_running on the InstanceGroup which filters on tasks assigned to the instance group. This change chooses to stop considering "idle" control nodes an option, since we can't acurrately identify them. The way things are without any change, is we are continuing to over consume capacity on control nodes because this method sees all control nodes as "idle" at the beginning of the task manager run, and then only counts jobs started in that run in the in-memory tracking. So jobs which last over a number of task manager runs build up consuming capacity, which is accurately reported via Instance.consumed_capacity

This is something we can experiment with as far as what users want at install time, but start with just 1 for now.

Describe usage of the new setting and the concept of control impact.

kdelee · 2022-02-14T15:12:45Z

Merging!

rebeccahhh

Approving even though I worked on this a bit, it looks great

kdelee changed the title ~~Consume control capacity take3~~ [draft] Consume control capacity (take3) Feb 2, 2022

kdelee mentioned this pull request Feb 2, 2022

[draft] Consume control node capacity #11651

Closed

awxbot added type:enhancement component:api labels Feb 2, 2022

shanemcd reviewed Feb 2, 2022

View reviewed changes

awx/settings/development.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

kdelee commented Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

shanemcd reviewed Feb 2, 2022

View reviewed changes

awx/main/managers.py Outdated Show resolved Hide resolved

kdelee force-pushed the consume_control_capacity_take3 branch 3 times, most recently from ea945b8 to 771e07a Compare February 2, 2022 17:01

kdelee commented Feb 2, 2022

View reviewed changes

awx/main/managers.py Show resolved Hide resolved

kdelee force-pushed the consume_control_capacity_take3 branch from e353da5 to 510eeef Compare February 2, 2022 17:39

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

kdelee force-pushed the consume_control_capacity_take3 branch from 510eeef to 187b3e2 Compare February 2, 2022 19:22

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

AlanCoding reviewed Feb 2, 2022

View reviewed changes

awx/main/scheduler/task_manager.py Outdated Show resolved Hide resolved

kdelee force-pushed the consume_control_capacity_take3 branch from b2afc9d to 2dec0e6 Compare February 2, 2022 20:31

kdelee changed the title ~~[draft] Consume control capacity (take3)~~ Consume control capacity Feb 2, 2022

AlanCoding approved these changes Feb 2, 2022

View reviewed changes

kdelee force-pushed the consume_control_capacity_take3 branch from e56ef31 to 90d8068 Compare February 4, 2022 18:03

kdelee force-pushed the consume_control_capacity_take3 branch from 90d8068 to 29b5e4c Compare February 7, 2022 16:42

kdelee and others added 13 commits February 9, 2022 14:00

update tests to use controlplane

e370797

We don't start any tasks if we don't have a controlplane instance group Due to updates to fixtures, update tests to set node type and capacity explicitly so they get expected result.

Refactor now redundant code

5a9ba65

Deal with control type tasks before we loop over the preferred instance groups, which cuts out the need for some redundant logic. Also, fix a bug where I was missing assigning the execution node in one case!

set job explanation on tasks that need capacity

111045b

move the job explanation for jobs that need capacity to a function so we can re-use it in the three places we need it.

fix misleading message

842a099

the variables and wording were both misleading, fix to be more accurate description in the two different cases where this log may be emitted.

add test for control capacity consumption

52f6d55

add test to verify that when there are 2 jobs and only capacity for one that one will move into waiting and the other stays in pending

add test for hybrid node capacity consumption

d774d51

assert that the hybrid node is used for both control and execution and capacity is deducted correctly

add test for task.capacity_type = control

206c06a

Test that control type tasks have the right capacity consumed and get assigned to the right instance group Also fix lint in the tests

Reduce default task impact for control nodes

65f6845

This is something we can experiment with as far as what users want at install time, but start with just 1 for now.

kdelee force-pushed the consume_control_capacity_take3 branch from 6b57511 to 65f6845 Compare February 9, 2022 19:02

thenets approved these changes Feb 11, 2022

View reviewed changes

update capacity docs

a5a5241

Describe usage of the new setting and the concept of control impact.

github-actions bot added the component:docs label Feb 11, 2022

rebeccahhh approved these changes Feb 14, 2022

View reviewed changes

kdelee merged commit 604cbc1 into ansible:devel Feb 14, 2022

This was referenced Feb 14, 2022

Consider capacity for controller_node selection #10694

Closed

drop unused logic in task manager #11742

Merged

kdelee mentioned this pull request Apr 7, 2022

Consider task impact/resource requests for container group jobs on a per job basis #12019

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consume control capacity #11665

Consume control capacity #11665

kdelee commented Feb 2, 2022 •

edited

AlanCoding left a comment

AlanCoding commented Feb 4, 2022

kdelee commented Feb 4, 2022

kdelee commented Feb 4, 2022 •

edited

kdelee commented Feb 4, 2022

AlanCoding commented Feb 5, 2022

kdelee commented Feb 14, 2022

rebeccahhh left a comment •

edited

Consume control capacity #11665

Consume control capacity #11665

Conversation

kdelee commented Feb 2, 2022 • edited

SUMMARY

ISSUE TYPE

COMPONENT NAME

AWX VERSION

ADDITIONAL INFORMATION

AlanCoding left a comment

Choose a reason for hiding this comment

AlanCoding commented Feb 4, 2022

kdelee commented Feb 4, 2022

kdelee commented Feb 4, 2022 • edited

kdelee commented Feb 4, 2022

AlanCoding commented Feb 5, 2022

kdelee commented Feb 14, 2022

rebeccahhh left a comment • edited

Choose a reason for hiding this comment

kdelee commented Feb 2, 2022 •

edited

kdelee commented Feb 4, 2022 •

edited

rebeccahhh left a comment •

edited