Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generic worker didn't handle a failure to register well #6963

Open
bhearsum opened this issue Apr 9, 2024 · 3 comments
Open

generic worker didn't handle a failure to register well #6963

bhearsum opened this issue Apr 9, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@bhearsum
Copy link
Contributor

bhearsum commented Apr 9, 2024

I came across an instance today that's been up for 6 days doing no work. Digging into the logs I found:

15:05:14.224
Apr  3 14:04:44 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw kernel: [    4.418976] systemd[1]: unit_file_build_name_map: normal unit file: /etc/systemd/system/generic-worker.service
2024-04-03 15:05:14.225
Apr  3 14:04:44 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw kernel: [   19.662733] systemd[1]: generic-worker.service: Failed to set 'io.bfq.weight' attribute on '/system.slice/generic-worker.service' to '100': No such file or directory
2024-04-03 15:05:14.225
Apr  3 14:04:44 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw kernel: [   19.694735] systemd[1]: generic-worker.service: Passing 0 fds to service
2024-04-03 15:05:14.225
Apr  3 14:04:44 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw kernel: [   19.726710] systemd[1]: generic-worker.service: Forked /usr/local/bin/start-worker as 574
2024-04-03 15:05:14.225
Apr  3 14:04:44 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw kernel: [   19.754761] systemd[1]: generic-worker.service: Changed dead -> running
2024-04-03 15:05:14.225
Apr  3 14:04:44 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw kernel: [   19.770652] systemd[1]: generic-worker.service: Job 111 generic-worker.service/start finished, result=done
2024-04-03 15:05:14.225
Apr  3 14:04:48 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: 2024/04/03 14:04:48 Error disabling OOM killer for the start-worker process: write /proc/574/oom_adj: permission denied
2024-04-03 15:05:14.225
Apr  3 14:04:48 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: 2024/04/03 14:04:48 Loading worker-runner configuration from /etc/start-worker.yml
2024-04-03 15:05:14.225
Apr  3 14:04:49 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: 2024/04/03 14:04:48 Configuring with provider google
2024-04-03 15:05:14.373
startup-script:     configPath: /home/ubuntu/generic_worker/generic-worker.config
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [517] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [518] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [584] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [507] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [645] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [648] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [651] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [650] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [643] exited
2024-04-03 15:05:17.700
Apr  3 14:04:51 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [647] exited
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: 2024/04/03 14:04:53 could not register worker:
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: CALL SUMMARY
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: ============
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Method: POST
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Service: firefox-ci-tc.services.mozilla.com:
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Response Body:
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: {
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   "code": "ResourceNotFound",
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   "message": "Worker us-central1-a/4591743526296936345 in worker pool translations-1/b-linux-v100-gpu-4 does not exist\n\n---\n\n* method:     registerWorker\n* errorCode:  ResourceNotFound\n* statusCode: 404\n* time:       2024-04-03T14:04:53.838Z",
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   "requestInfo": {
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "method": "registerWorker",
2024-04-03 15:05:17.700
Apr  3 14:04:53 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "params": {},
2024-04-03 15:05:17.700
Apr  3 14:04:54 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "payload": "(OMITTED)",
2024-04-03 15:05:17.700
Apr  3 14:04:55 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "time": "2024-04-03T14:04:53.838Z"
2024-04-03 15:05:17.700
Apr  3 14:04:56 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   }
2024-04-03 15:05:17.700
Apr  3 14:04:56 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: }
2024-04-03 15:05:21.053
Apr  3 14:04:57 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Attempts: 1
2024-04-03 15:05:21.053
Apr  3 14:04:57 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: (Permanent) HTTP response code 404
2024-04-03 15:05:21.053
Apr  3 14:04:57 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: HTTP/2.0 404 Not Found
2024-04-03 15:05:21.053
Apr  3 14:04:57 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Content-Length: 424
2024-04-03 15:05:21.053
Apr  3 14:04:57 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin,Cache-Control
2024-04-03 15:05:21.053
Apr  3 14:04:57 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
2024-04-03 15:05:21.053
Apr  3 14:04:57 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Access-Control-Allow-Origin: *
2024-04-03 15:05:21.053
Apr  3 14:04:58 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Access-Control-Max-Age: 900
2024-04-03 15:05:21.053
Apr  3 14:04:58 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Access-Control-Request-Method: *
2024-04-03 15:05:21.054
Apr  3 14:04:59 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Alt-Svc: clear
2024-04-03 15:05:21.054
Apr  3 14:05:00 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Cache-Control: no-store no-cache must-revalidate
2024-04-03 15:05:21.054
Apr  3 14:05:01 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Content-Security-Policy: report-uri /__cspreport__;default-src 'none';frame-ancestors 'none';
2024-04-03 15:05:21.054
Apr  3 14:05:01 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Content-Type: application/json; charset=utf-8
2024-04-03 15:05:21.054
Apr  3 14:05:01 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Date: Wed, 03 Apr 2024 14:04:53 GMT
2024-04-03 15:05:21.054
Apr  3 14:05:01 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Etag: W/"1a8-YYMDDhS0cxJP+mPbqj9T3151bkU"
2024-04-03 15:05:25.730
Apr  3 14:05:01 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Server: openresty
2024-04-03 15:05:25.730
Apr  3 14:05:01 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Strict-Transport-Security: max-age=31536000
2024-04-03 15:05:25.730
Apr  3 14:05:01 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: Via: 1.1 google
2024-04-03 15:05:25.730
Apr  3 14:05:02 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: X-Content-Type-Options: nosniff
2024-04-03 15:05:25.730
Apr  3 14:05:02 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: X-For-Request-Id: 1885f8f7-fecf-45be-bd15-4b0f2a62a457
2024-04-03 15:05:25.731
Apr  3 14:05:03 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: X-For-Trace-Id: 5ca91d2155cbd77ba1e5c7e90dde122d
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: #015
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: {
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   "code": "ResourceNotFound",
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   "message": "Worker us-central1-a/4591743526296936345 in worker pool translations-1/b-linux-v100-gpu-4 does not exist\n\n---\n\n* method:     registerWorker\n* errorCode:  ResourceNotFound\n* statusCode: 404\n* time:       2024-04-03T14:04:53.838Z",
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   "requestInfo": {
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "method": "registerWorker",
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "params": {},
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "payload": "(OMITTED)",
2024-04-03 15:05:25.731
Apr  3 14:05:04 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:     "time": "2024-04-03T14:04:53.838Z"
2024-04-03 15:05:25.731
Apr  3 14:05:05 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]:   }
2024-04-03 15:05:25.731
Apr  3 14:05:06 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw start-worker[574]: }
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: Received SIGCHLD from PID 574 (start-worker).
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: Child 574 (start-worker) died (code=exited, status=1/FAILURE)
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Child 574 belongs to generic-worker.service.
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Main process exited, code=exited, status=1/FAILURE
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Failed with result 'exit-code'.
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Service will not restart (restart setting)
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Changed running -> failed
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Unit entered failed state.
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Consumed 121ms CPU time.
2024-04-03 15:05:25.731
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Control group is empty.
2024-04-03 15:05:30.081
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: loop4: Worker [819] is forked for processing SEQNUM=2868.
2024-04-03 15:05:37.772
Apr  3 14:05:07 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd-udevd[209]: Worker [819] exited
2024-04-03 15:06:19.719
Apr  3 14:05:15 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw google_metadata_script_runner[1231]: startup-script:     configPath: /home/ubuntu/generic_worker/generic-worker.config
2024-04-03 15:06:46.606
Apr  3 14:05:17 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: unit_file_build_name_map: normal unit file: /etc/systemd/system/generic-worker.service
2024-04-03 15:06:54.676
Apr  3 14:05:17 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: /etc/systemd/system/generic-worker.service:9: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
2024-04-03 15:06:54.676
Apr  3 14:05:17 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: /etc/systemd/system/generic-worker.service:10: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
2024-04-03 15:06:58.360
Apr  3 14:05:17 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Changed dead -> failed
2024-04-03 15:08:03.921
Apr  3 14:05:29 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: unit_file_build_name_map: normal unit file: /etc/systemd/system/generic-worker.service
2024-04-03 15:08:12.019
Apr  3 14:05:29 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: /etc/systemd/system/generic-worker.service:9: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
2024-04-03 15:08:12.019
Apr  3 14:05:29 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: /etc/systemd/system/generic-worker.service:10: Standard output type syslog+console is obsolete, automatically updating to journal+console. Please update your unit file, and consider removing the setting altogether.
2024-04-03 15:08:15.710
Apr  3 14:05:29 translations-1-b-linux-v100-gpu-4-kk-08mf5tiuuck-wu2ldaw systemd[1]: generic-worker.service: Changed dead -> failed

...and the dead -> failed messages continue perpetually.

I would've expected that either that worker eventually successfully comes up, or shuts itself down after some amount of time.

@bhearsum bhearsum added the bug Something isn't working label Apr 9, 2024
@lotas
Copy link
Contributor

lotas commented Apr 9, 2024

Thanks for adding this, this looks interesting!

Worker us-central1-a/4591743526296936345 in worker pool translations-1/b-linux-v100-gpu-4 does not exist

I wonder why it wasn't considered a zombie (#6333) and was killed before

No traces of such instance in logs.

One likely explanation is that by the time that worker tried to register it was already removed from the database after registrationTimeout (maybe this number needs to be larger, or maybe it's just happened that expiry job removed it right before it tried to register itself)

So when this happens on the services side, worker should be shutting itself down instead of trying to re-register
@matt-boris @petemoore do you remember if there was a limit on the number of unsuccessful registerWorker calls?

@matt-boris
Copy link
Contributor

It doesn't appear as though there's some limit on the number of unsuccessful registerWorker calls

@lotas
Copy link
Contributor

lotas commented Apr 9, 2024

Worker seems to have been created around this incident which saw a high number of "rate limit" exceptions.
It is possible that worker-manager wasn't able to query state of the worker

image

What needs to be checked:

  • is it possible to expire and remove a worker from a db that was provisioned but resources were still in use (check state of the worker)
  • generic-worker/worker-runner should shutdown after several unsuccessful attempts to register

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants