You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Worker us-central1-a/4591743526296936345 in worker pool translations-1/b-linux-v100-gpu-4 does not exist
I wonder why it wasn't considered a zombie (#6333) and was killed before
No traces of such instance in logs.
One likely explanation is that by the time that worker tried to register it was already removed from the database after registrationTimeout (maybe this number needs to be larger, or maybe it's just happened that expiry job removed it right before it tried to register itself)
So when this happens on the services side, worker should be shutting itself down instead of trying to re-register @matt-boris@petemoore do you remember if there was a limit on the number of unsuccessful registerWorker calls?
Worker seems to have been created around this incident which saw a high number of "rate limit" exceptions.
It is possible that worker-manager wasn't able to query state of the worker
What needs to be checked:
is it possible to expire and remove a worker from a db that was provisioned but resources were still in use (check state of the worker)
generic-worker/worker-runner should shutdown after several unsuccessful attempts to register
I came across an instance today that's been up for 6 days doing no work. Digging into the logs I found:
...and the
dead -> failed
messages continue perpetually.I would've expected that either that worker eventually successfully comes up, or shuts itself down after some amount of time.
The text was updated successfully, but these errors were encountered: