Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to report reinstall by metal-hammer results in fresh install with wiped disks #207

Open
mreiger opened this issue Aug 4, 2021 · 2 comments

Comments

@mreiger
Copy link
Contributor

mreiger commented Aug 4, 2021

Apparently it can sometimes happen that an attempt from metal-hammer to report a successful reinstall of a machine fails with an error code 500; in one such instance the machine rebooted and when the metal-hammer next contacted the api was told to do a fresh install, which wiped all local disks.

This is obviously undesired behaviour; how can this issue be fixed?

Error message from metal-hammer ipmi console log when failing to report:

POST /machine/report/eff49e00-6ff4-11e9-8000-efbeaddeefbe HTTP/1.1
Host: 10.255.255.5:4242
User-Agent: Go-http-client/1.1
Content-Length: 416
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip

{"bootloaderid":"metal-debian","cmdline":"console=ttyS1,115200n8 root=UUID=3cb7a6b6-65e5-4ff7-8dfd-e00bb5abcf1b init=/bin/systemd net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 systemd.unified_cgroup_hierarchy=0","console_password":"hbPMo74i1dSQXv/F","initrd":"/tmp/initrd.img-5.10.0-6-amd64","kernel":"/tmp/vmlinuz-5.10.0-6-amd64","message":null,"os_partition":null,"primary_disk":null,"success":true}

DBUG[08-04|12:48:46] lldp                                     detectedNeighbor="Name:nbg-w8101-r01leaf01 Desc:Cumulus Linux version 3.7.13 running on Accton AS7712-32X Chassis:Mac:80:a2:35:e1:39:9c Port:Mac:80:a2:35:e1:39:a0" caller=lldpclient.go:71
DBUG[08-04|12:48:46] lldp                                     detectedNeighbor="Name:nbg-w8101-r01leaf02 Desc:Cumulus Linux version 3.7.13 running on Accton AS7712-32X Chassis:Mac:80:a2:35:32:9d:bb Port:Mac:80:a2:35:32:9d:bf" caller=lldpclient.go:71
HTTP/1.1 500 Internal Server Error
Content-Length: 0
Date: Wed, 04 Aug 2021 12:48:48 GMT


EROR[08-04|12:48:48] report                                   error="[POST /machine/report/{id}][500] reportInternalServerError " caller=report.go:42
EROR[08-04|12:48:48] reinstall failed                         error="unable to report image installation: [POST /machine/report/{id}][500] reportInternalServerError " caller=root.go:140
EROR[08-04|12:48:48] reinstall cancelled => boot into existing OS... reason="unable to report image installation: [POST /machine/report/{id}][500] reportInternalServerError " caller=reinstall.go:25
POST /machine/abort-reinstall/eff49e00-6ff4-11e9-8000-efbeaddeefbe HTTP/1.1
Host: 10.255.255.5:4242
User-Agent: Go-http-client/1.1
Content-Length: 28
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip

{"primary_disk_wiped":true}

HTTP/1.1 500 Internal Server Error
Content-Length: 27
Content-Type: application/json

Complete log files of failed reinstall followed by fresh install, and for comparison a successful reinstall, are attached.

successful-reinstall.txt
failed-reinstall-and-subsequent-install-with-wipe.txt

@Gerrit91
Copy link
Contributor

Gerrit91 commented Aug 5, 2021

Here are related metal-api error logs:

{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","status":403,"content-length":55,"duration":0.000260831}
{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:100","message":"cannot get user from request","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","error":"Wrong HMAC found","got":"667c82a5692d7e1013a88b022af0f367e3d349860818ff8714b685bed3ae91e3","want":"4a2e464619017e291fd0113c5a884b89d825a5e8ea8fbd6b1a83eb5f1eab86c8"}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"metal-edit@metal-stack.io","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","status":422,"content-length":137,"duration":0.087437639}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"service/service.go:96","message":"service error","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"metal-edit@metal-stack.io","operation":"finalizeAllocation","status":422,"error":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731","service-caller":"machine-service.go:1635","resp":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}

And these are from metal-core:

2021-08-04T12:48:48.301245+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011endpoint/abortReinstall.go:38#011Failed to abort reinstall#011{"app": "metal-core", "statusCode": 500, "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "boot information": null}
2021-08-04T12:48:48.299985+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011api/abortReinstall.go:22#011Failed to abort reinstall#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "error": "[POST /v1/machine/{id}/abort-reinstall][403] abortReinstallMachine default  Wrong HMAC found (403)"}
2021-08-04T12:48:48.279567+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.276Z#011error#011endpoint/report.go:57#011Unable to report machine back to api.#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default  the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
2021-08-04T12:48:48.278363+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.275Z#011error#011api/finalizeAllocation.go:27#011Finalize failed#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default  the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}

@Gerrit91
Copy link
Contributor

Gerrit91 commented Aug 5, 2021

I tried to reconstruct what happened, here is what found:

  1. Everything was according to plan, wiped primary disk and reinstalled OS
  2. Then metal-hammer calls metal-core /report, returning an error
    1. The metal-core sets the boot order to HD (but it did not work?)
    2. metal-core calls the metal-api on /finalize-allocation
    • The metal-api sets the reinstall field in the machine to false
    • This function then fails when updating the switch entity (probably "was changed by another", original error message was swallowed)
  3. The metal-hammer calls /abort-reinstall, returning an error as well
    • This time the request was declided immediately from the metal-api because HMAC was incorrect (can be caused by system time lags?)
  4. When this fails, the metal-hammer runs kexec.Reboot, restarting the machine
  5. The metal-hammer reboots, now finds the reinstall flag on false, thinking it's a usual released machine...

This certainly is a cluster of very unhappy events.

My question would be: Why does the metal-hammer wipe disks when there is an allocation on the machine? I think, this should be the first thing to prevent before starting to wipe disks. Maybe this condition is to specific: https://github.com/metal-stack/metal-hammer/blob/v0.9.1/cmd/root.go#L129.

I'll investigate a little more to see what we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants