Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] [Old] BMHs get stuck while deprovisioning #1123

Open
huutomerkki opened this issue Aug 10, 2023 · 11 comments
Open

[Flaky Test] [Old] BMHs get stuck while deprovisioning #1123

huutomerkki opened this issue Aug 10, 2023 · 11 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@huutomerkki
Copy link
Member

Moved here from the old flake tracker.

Sometimes the BMHs get stuck while deprovisioning.

RETRYING: Wait until "2" bmhs become available again. 147/150
RETRYING: Wait until "2" bmhs become available again. 148/150
RETRYING: Wait until "2" bmhs become available again. 149/150
RETRYING: Wait until "2" bmhs become available again. 150/150

When this happens, the following can be seen in the BMO logs:

{"level":"info","ts":1664171732.2929034,"logger":"controllers.BareMetalHost","msg":"start","baremetalhost":"metal3/eselda13u31s05"}
{"level":"info","ts":1664171732.3624265,"logger":"controllers.BareMetalHost","msg":"Retrying registration","baremetalhost":"metal3/eselda13u31s05","provisioningState":"provisioned"}
{"level":"info","ts":1664171732.3624785,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"metal3/eselda13u31s05","provisioningState":"provisioned","credentials":{"credentials":{"name":"bml-ilo-login-secret-05","namespace":"metal3"},"credentialsVersion":"9674"}}
{"level":"info","ts":1664171732.4302292,"logger":"provisioner.ironic","msg":"updating node settings in ironic","host":"metal3~eselda13u31s05"}
{"level":"info","ts":1664171732.5427444,"logger":"provisioner.ironic","msg":"could not update node settings in ironic, busy","host":"metal3~eselda13u31s05"}
{"level":"info","ts":1664171732.5427907,"logger":"controllers.BareMetalHost","msg":"host not ready","baremetalhost":"metal3/eselda13u31s05","provisioningState":"provisioned","wait":10}
{"level":"info","ts":1664171732.542827,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"metal3/eselda13u31s05","provisioningState":"provisioned","requeue":true,"after":10}

And in Ironic we see this:

2022-09-26 05:55:32.499 1 DEBUG ironic.conductor.manager [None req-97e7cdfd-1309-4a01-a42d-a503e221e900 - - - - - -] RPC update_node called for node 67b0163f-faf6-48e3-b5bf-946f37cb48d8. update_node /usr/lib/python3.9/site-packages/ironic/conductor/manager.py:193
2022-09-26 05:55:32.507 1 DEBUG ironic.conductor.task_manager [None req-97e7cdfd-1309-4a01-a42d-a503e221e900 - - - - - -] Attempting to get exclusive lock on node 67b0163f-faf6-48e3-b5bf-946f37cb48d8 (for node update) __init__ /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:235
2022-09-26 05:55:32.516 1 DEBUG ironic.conductor.task_manager [None req-97e7cdfd-1309-4a01-a42d-a503e221e900 - - - - - -] Node 67b0163f-faf6-48e3-b5bf-946f37cb48d8 successfully reserved for node update (took 0.01 seconds) reserve_node /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:352
2022-09-26 05:55:32.531 1 DEBUG ironic.conductor.task_manager [None req-97e7cdfd-1309-4a01-a42d-a503e221e900 - - - - - -] Successfully released exclusive lock for node update on node 67b0163f-faf6-48e3-b5bf-946f37cb48d8 (lock was held 0.01 sec) release_resources /usr/lib/python3.9/site-packages/ironic/conductor/task_manager.py:448
2022-09-26 05:55:32.534 1 DEBUG ironic.api.method [None req-97e7cdfd-1309-4a01-a42d-a503e221e900 - - - - - -] Client-side error: Node 67b0163f-faf6-48e3-b5bf-946f37cb48d8 is associated with instance 66b3a20d-9c99-4d0b-8f91-5d8ce7bad6f5. format_exception /usr/lib/python3.9/site-packages/ironic/api/method.py:124
2022-09-26 05:55:32.536 1 INFO eventlet.wsgi.server [None req-97e7cdfd-1309-4a01-a42d-a503e221e900 - - - - - -] 172.17.0.11,127.0.0.1 "PATCH /v1/nodes/67b0163f-faf6-48e3-b5bf-946f37cb48d8 HTTP/1.1" status: 409  len: 531 time: 0.0919740

The relevant code in Ironic can be found here.

Recreating (deleting) the Ironic Pod seems to help.

Occurrence and logs

@metal3-io-bot metal3-io-bot added the needs-triage Indicates an issue lacks a `triage/foo` label and requires one. label Aug 10, 2023
@lentzi90
Copy link
Member

/triage accepted

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels Aug 16, 2023
@Sunnatillo
Copy link
Member

Sunnatillo commented Aug 17, 2023

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2023
@Rozzii
Copy link
Member

Rozzii commented Jan 17, 2024

let's give this some more time as the CI was broken for a long time, this might be hiding among the jobs still
/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2024
@mboukhalfa
Copy link
Member

mboukhalfa commented Mar 7, 2024

https://jenkins.nordix.org/job/metal3_project_infra_main_integration_test_ubuntu/313/consoleFull

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants