[K8S] livenessProbe and readinessProbe for celery beat and workers #4079

JorisAndrade · 2017-06-08T08:50:36Z

Hi,

I'm using Kubernetes to deploy my python application, Kubernetes provide a livenessProbe and readinessProbe see here.

How can I do to check if my celery beat or celery worker is alive and in correct state ?
The PID is not a solution because it cannot be used to catch a deadlock for example.

Thanks in advance for your help,

Best regards,

thedrow · 2017-08-18T06:02:08Z

Celery has a monitoring API you can use.
A pod should be considered live if the Celery worker sends a heartbeat.
A pod should be considered ready if the worker has sent the worker-online event.

If you have specific problems or feature requests, please open a separate issue.

ScottEAdams · 2017-12-05T05:40:40Z

Would this work?

readinessProbe:
          exec:
            command:
            - "/bin/sh"
            - "-c"
            - "celery -A path.to.app status | grep -o ': OK'"
          initialDelaySeconds: 30
          periodSeconds: 10

thedrow · 2017-12-05T11:13:14Z

@7wonders You'd need to extract the celery node name first. This readinessProbe will fail if any celery instance fails which is not what you want.

ScottEAdams · 2017-12-05T17:17:53Z

@thedrow Hmm, I think its actually that it will succeed even if the actual node has failed but another one is ok which is also not a great outcome.

redbaron · 2018-07-04T13:08:58Z

Looks like

/bin/sh -c 'exec celery -A path.to.app inspect ping -d celery@$HOSTNAME' is good enough for readiness check and verifies just one node.

desaintmartin · 2018-07-04T14:02:03Z

Beware that in some apps, running this command can take a few seconds using full CPU AND kubernetes defaults are to run it every 10 seconds.

It is thus much safer to have a high periodSeconds (ours is set to 300).

rana-ahmed · 2018-10-16T09:09:16Z

@redbaron did that command work for you? If it works then what are the settings for liveness and readiness prob?

msta · 2018-11-06T18:04:22Z

For some reason, this readiness probe is nowhere near satisfactory for us. The inspect responds non-deterministically with no load on our cluster. We run the format like this:

celery inspect ping -b "redis://archii-redis-master:6379" -d celery@archii-task-crawl-integration-7d96d86b9d-jwtq7

And with normal ping times (10 seconds), our cluster is completely killed by the CPU celery requires.

joekohlsdorf · 2018-11-09T16:31:47Z

~~I use this for liveness with a 30s interval: sh -c celery -A path.to.app status | grep "${HOSTNAME}:.*OK"~~
I use this for liveness with a 30s interval: sh -c celery -A path.to.app inspect ping --destination celery@${HOSTNAME}
Doesn't seem to cause any extra load, I run a fleet of well over 100 workers.

Readiness probes aren't necessary, Celery is never used in a service. I just set minReadySeconds: 10 which is good enough for delaying worker startup in rolling Deployments, but it obviously depends on the startup time of Celery for your project so examine logs and set accordingly.

WillPlatnick · 2018-12-21T16:18:30Z

Readiness probes are still useful even if they're not used in a service. Specifically, when you do a deployment of workers and want to make sure your deployment was successful, you usually use kubectl rollout status deployment. Without readinessprobes, we've deployed bad code that didn't start celery and didn't know it.

yardensachs · 2018-12-31T18:49:24Z

My solution was:

readinessProbe:
  exec:
    command:
      [
        "/usr/local/bin/python",
        "-c",
        "\"import os;from celery.task.control import inspect;from <APP> import celery_app;exit(0 if os.environ['HOSTNAME'] in ','.join(inspect(app=celery_app).stats().keys()) else 1)\""
      ]

Others seem to not work 🤷‍♂️

Strangerxxx · 2019-01-12T21:35:16Z

Thanks @yardensachs!
Spend many time for debugging what's wrong with other solutions, but no way
Seems like celery inspect ping command do not return exit(0) or something in that way

mariusburfey · 2019-02-07T13:13:41Z

celery inspect ping does work, but you need bash to replace the environment variable like this:

        livenessProbe:
          exec:
            # bash is needed to replace the environment variable
            command: [
              "bash",
              "-c",
              "celery inspect ping -A apps -d celery@$HOSTNAME"
            ]
          initialDelaySeconds: 30  # startup takes some time
          periodSeconds: 60  # default is quite often and celery uses a lot cpu/ram then.
          timeoutSeconds: 10  # default is too low

bwanglzu · 2019-03-08T13:43:08Z

good to know

WillPlatnick · 2019-03-08T13:53:12Z

We ended up ripping celery inspect ping out from our liveness probes because we found that under heavier load, the ping would just hang for minutes at a time even though the jobs were processing fine and there was no backlog. I have a feeling it had something to do with using eventlet, but we're continuing to look into it.

thedrow · 2019-03-08T13:58:24Z

@WillPlatnick That won't happen with 5.0 because Celery will be async so there will be reserved capacity for control coroutines.

shshe · 2019-07-05T16:08:35Z

I'm having trouble with inspect ping spawning defunct / zombie processes:

root      2296  0.0  0.0      0     0 ?        Z    16:04   0:00 [python] <defunct>
root      2323  0.0  0.0      0     0 ?        Z    16:05   0:00 [python] <defunct>
...

Anyone else encountering this? There isn't a --pool argument for forcing a single process execution.

mcyprian · 2019-10-31T16:02:47Z

Can I ask what are you using instead of celery inspect ping @WillPlatnick? We've encountered a similar issue with the probe failing under heavy load.

WillPlatnick · 2019-11-02T14:17:39Z

@mcyprian We got rid of the liveness probe. My gut is telling me it has something to do with eventlet, but we haven't made it a priority to figure it out.

HeathLee · 2019-11-07T10:33:43Z

we meet the same CPU problem with Redis broker

zffocussss · 2022-04-18T03:17:35Z

can flower do the monitor tasks indirectly?

taleinat · 2022-04-21T09:03:18Z

    command:
    - test
    - -e /tmp/worker_ready

Note that the above Kubernetes configuration for the readiness probe from @beje2k15's comment has a bug, which causes the readiness probe to never fail. Here's one way to fix it:

    command:
    - sh
    - -c
    - test -e /tmp/worker_ready

joekohlsdorf · 2022-05-17T14:38:01Z

The approach is good but it requires you to spam your broker with heartbeat garbage.
Here is an improved version which doesn't require heartbeats:

from pathlib import Path

from celery import bootsteps
from celery.signals import worker_ready, worker_shutdown

HEARTBEAT_FILE = Path("/tmp/worker_heartbeat")
READINESS_FILE = Path("/tmp/worker_ready")

class LivenessProbe(bootsteps.StartStopStep):
    requires = {'celery.worker.components:Timer'}

    def __init__(self, worker, **kwargs):
        self.requests = []
        self.tref = None

    def start(self, worker):
        self.tref = worker.timer.call_repeatedly(
            1.0, self.update_heartbeat_file, (worker,), priority=10,
        )

    def stop(self, worker):
       HEARTBEAT_FILE.unlink(missing_ok=True)

    def update_heartbeat_file(self, worker):
       HEARTBEAT_FILE.touch()


@worker_ready.connect
def worker_ready(**_):
   READINESS_FILE.touch()


@worker_shutdown.connect
def worker_shutdown(**_):
   READINESS_FILE.unlink(missing_ok=True)


app = Celery("appname")
app.steps["worker"].add(LivenessProbe)

The liveness file will be created once the connection to the broker is successfully established so you could also use that for readiness.

chicocvenancio · 2022-06-30T15:58:50Z

Since we are already checking the connection to the celery broker, does it still make sense to check the status of the worker?

In my experience, yes. Broker connection only guarantees the pod can connect to the broker, but its not uncomon for celery worker to be in a broken state even if that passes (stopped queue with rabbitmq is a scenario that comes to mind). We use a file do only do these checks every 5 minutes though, as they are a bit expensive.

ishallbethat · 2022-07-20T04:41:48Z

How about beat ? above healthcheck setting doesnt' seem to work with beat

ahopkins · 2022-07-25T12:46:05Z

You can do something similar where beat runs a "health" task and writes the timestamp to a file and you just check that.

lachtanek · 2022-07-27T08:01:57Z

as others have suggested, we've started using celery -b $BROKER_URL -d celery@$HOSTNAME for our health checks with a 60 second period and it seems not to raise the overall CPU usage too much (make sure not to use with -A though, as that'd load the whole app), I'd recommend to use that

suciua · 2022-10-06T13:48:50Z

The approach is good but it requires you to spam your broker with heartbeat garbage. Here is an improved version which doesn't require heartbeats:

from pathlib import Path

from celery import bootsteps
from celery.signals import worker_ready, worker_shutdown

HEARTBEAT_FILE = Path("/dev/shm/worker_heartbeat")
READINESS_FILE = Path("/dev/shm/worker_ready")

class LivenessProbe(bootsteps.StartStopStep):
    requires = {'celery.worker.components:Timer'}

    def __init__(self, worker, **kwargs):
        self.requests = []
        self.tref = None

    def start(self, worker):
        self.tref = worker.timer.call_repeatedly(
            1.0, self.update_heartbeat_file, (worker,), priority=10,
        )

    def stop(self, worker):
       HEARTBEAT_FILE.unlink(missing_ok=True)

    def update_heartbeat_file(self, worker):
       HEARTBEAT_FILE.touch()


@worker_ready.connect
def worker_ready(**_):
   READINESS_FILE.touch()


@worker_shutdown.connect
def worker_shutdown(**_):
   READINESS_FILE.unlink(missing_ok=True)


app = Celery("appname")
app.steps["worker"].add(LivenessProbe)

The liveness file will be created once the connection to the broker is successfully established so you could also use that for readiness.

@joekohlsdorf Thanks a lot for your solution to implement the probes! It might be worth mentioning that /dev/shm doesn't necessarily get removed upon pod restarts which has quite some implications on these probes (see kubernetes/kubernetes#81001). Thus it would be better to use the path suggested by @limedaniel and store the files in /tmp which would lead to:

HEARTBEAT_FILE = Path("/tmp/worker_heartbeat")
READINESS_FILE = Path("/tmp/worker_ready")

nickjj · 2022-10-08T15:40:10Z

I really like the signal approach + file checks and it's what I was planning to implement as well, then I Google'd around and was pleasantly surprised to find these solutions posted here.

Celery also has the option to write out a pid file when it starts. Is the worker_ready signal better because the pid file gets created before the worker is ready to do work? I casually skimmed the code base and didn't see much around the life cycle of when the pid file gets created vs some of the signals.

GitRon · 2022-10-18T06:38:08Z

Great approach! Finally something, that works and doesn't look hacky.

Any thought on how to make the same for the celery-beat? I found a signal for start-up:

@beat_init.connect
def beat_ready(**_):
    READINESS_FILE.touch()

But I just can't find a solution for the liveness check...

nickjj · 2022-10-19T14:23:50Z

Now that this is tagged for a milestone which hints it'll be in Celery at some point, how necessary is that timer liveness probe?

If the worker can write that file every second it's healthy, no doubt about it. But if a container is running 1 process and that process crashes then the container dies with it. In a Kubernetes context the pod will be removed and restarted.

Where I'm going with this one is can we remove that entire LivenessProbe class and hook up a startup, liveness and readiness probe to only do this along with having the worker_ready and worker_shutdown signals as they are defined above.

    command:
    - sh
    - -c
    - test -e /tmp/worker_ready

My thought process there is the file either always exists (the worker is working) or the process is dead in which case the file won't be there because the container isn't up.

ibachar-es · 2022-12-04T12:47:35Z

It seems the heartbeat using the file based approach is not working when using the gevent pool when the worker is blocking. Does anyone have a solution that works for gevent as well?

auvipy · 2022-12-04T17:31:54Z

have anyone tried keda?

sondrelg · 2022-12-04T22:51:41Z

have anyone tried keda?

We're about to, for scaling workers by queue size. We just need to switch brokers to rabbitmq first.

How does KEDA relate to readiness/liveness probes?

benedikt-bartscher · 2023-03-31T07:10:39Z

@some1ataplace please do us all a favour and don't just post random chatgpt crap on github

SnoozeFreddo · 2023-07-07T18:40:42Z

I also can't configure beat. There is an init without cleanup option (on_shutdown) and no liveness check Anyone figured this out?

grapo · 2023-09-15T10:42:06Z

We tried a lot of ideas proposed here like @beje2k15 solution with heartbeat signals. Sadly due to bugs like #7276 worker can respond to heartbeats (so heartbeat_sent signal is catched) but still don't receive real tasks from broker.

Only fully reliable solution for us is to send own heartbeat task with celery beat, handle it with celery worker and use task_success signal to touch liveness file. This solution is only good if you have only one worker! If there are more workers they may not get tasks from broker and fail liveness probe

from pathlib import Path

from celery.signals import (
    after_task_publish,
    beat_init,
    task_success,
    worker_ready,
    worker_shutdown,
)

from .tasks import celery_heartbeat

HEARTBEAT_FILE = Path("/tmp/celery_live")
READINESS_FILE = Path("/tmp/celery_ready")

######
# celery worker rediness and liveness checks

@worker_ready.connect
def worker_ready(**_):
    READINESS_FILE.touch()


@worker_shutdown.connect
def worker_shutdown(**_):
    for f in (HEARTBEAT_FILE, READINESS_FILE):
        f.unlink(missing_ok=True)


@task_success.connect(sender=celery_heartbeat)
def heartbeat(**_):
    HEARTBEAT_FILE.touch()

######
# celery beat rediness and liveness checks

@beat_init.connect
def beat_ready(**_):
    READINESS_FILE.touch()


@after_task_publish.connect(sender="healthcheck.tasks.celery_heartbeat")
def task_published(**_):
    HEARTBEAT_FILE.touch()

This is our code for liveness probe. We check if /tmp/celery_live was touched in the last minute.

          livenessProbe:
            initialDelaySeconds: 90
            periodSeconds: 30
            failureThreshold: 3
            timeoutSeconds: 3
            exec:
              command:
                - /bin/sh
                - c
                - find  /tmp/celery_live -mmin -1 | grep .

GitRon · 2023-09-15T14:17:57Z

@grapo Thx for the detailed answer but I think that your case of only having one worker won't do for most of the crowd. You have celery to be able to scale, right? Nevertheless, thx for sharing!

sbasu-mdsol · 2023-10-05T18:53:07Z

What @GitRon said. I am working on a scenario where there will be at least 4-6 workers and I need to have a readiness probe that indicates that all of them are ready to process tasks.

joekohlsdorf · 2023-10-05T20:34:18Z

Testing if the broker is alive from the worker makes no sense to me. It will lead to all workers restarting unnecessarily if the broker has a temporary issue which is unrelated to Celery workers. Workers can recover from broker failure without restart.
The Celery worker liveness check should only monitor if the worker is ready to accept tasks which is achieved by the code I shared in #4079 (comment)

If you really want full roundtrip monitoring you can use the celery inspect ping command I shared previously but it uses more CPU. This sends a message to the individual pidbox of the worker which goes through the broker. You can probably implement what the ping command does with a timer in the worker itself to optimize this.

All of these checks still don't mean that you are able to process real tasks but there is really no good way to check this on the worker. You can count processed tasks but that doesn't mean anything because your system might only process one task per month. I recommend to also monitor queue size and ack rate on your broker.

Timers can be used to do all kinds of cool things, I have code which emits currently active tasks and general worker statistics as statsd metics. This is way better than Flower or inspect.

rbehal · 2024-04-26T07:17:40Z

This worked best for me

          livenessProbe:
            exec:
              command:
                - sh
                - -c
                - celery inspect ping -d celery@$(hostname) | grep -q OK
            initialDelaySeconds: 30
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            exec:
              command:
                - sh
                - -c
                - celery inspect ping -d celery@$(hostname) | grep -q OK
            initialDelaySeconds: 60
            periodSeconds: 15
            failureThreshold: 3

It relies on your base image being able to do $(hostname) though

thedrow added Category: Deployment Issue Type: Question labels Aug 18, 2017

thedrow closed this as completed Aug 18, 2017

thedrow mentioned this issue Aug 18, 2017

Add a Celery chart for Helm #4213

Open

This was referenced Apr 13, 2018

gce service/deploy files TwoRavens/raven-metadata-service#87

Closed

gce 3 - persistent db TwoRavens/raven-metadata-service#89

Closed

raprasad mentioned this issue May 24, 2018

add celery health check to k8s config TwoRavens/raven-metadata-service#152

Open

1 task

auvipy reopened this Nov 2, 2019

shshe mentioned this issue Nov 5, 2019

Containers appear to be crashing with fork/exec resource unavailable awslabs/amazon-eks-ami#239

Closed

auvipy added this to the 4.5 milestone Nov 7, 2019

innovate-invent mentioned this issue Sep 15, 2022

celery ping doesn't work when using SQS #6727

Open

18 tasks

auvipy modified the milestones: 5.3, 5.3.x Oct 19, 2022

bufke mentioned this issue May 5, 2023

Add liveness probes for all celery workers kobotoolbox/kobo-helm-chart#40

Merged

racheldaniel mentioned this issue May 17, 2023

Add health checks for server processes dbt-labs/dbt-server#227

Merged

14 tasks

simensma-fresh mentioned this issue Jun 6, 2023

[MDS-5300] Added health check probe to docman celery bcgov/mds#2554

Merged

MJedr mentioned this issue Jun 20, 2023

Investigate how to deal with not responding celery workers cern-sis/issues-inspire#339

Closed

ntarocco mentioned this issue Oct 6, 2023

worker-beat: readiness probe inveniosoftware/helm-invenio#71

Open

2 tasks

Kircheneer mentioned this issue Feb 23, 2024

Provide health check for celery beat nautobot/nautobot#1102

Closed

[K8S] livenessProbe and readinessProbe for celery beat and workers #4079

[K8S] livenessProbe and readinessProbe for celery beat and workers #4079

Comments

JorisAndrade commented Jun 8, 2017 • edited by sync-by-unito bot

thedrow commented Aug 18, 2017

ScottEAdams commented Dec 5, 2017

thedrow commented Dec 5, 2017

ScottEAdams commented Dec 5, 2017

redbaron commented Jul 4, 2018 • edited

desaintmartin commented Jul 4, 2018

rana-ahmed commented Oct 16, 2018

msta commented Nov 6, 2018

joekohlsdorf commented Nov 9, 2018 • edited

WillPlatnick commented Dec 21, 2018

yardensachs commented Dec 31, 2018

Strangerxxx commented Jan 12, 2019

mariusburfey commented Feb 7, 2019

bwanglzu commented Mar 8, 2019

WillPlatnick commented Mar 8, 2019

thedrow commented Mar 8, 2019

shshe commented Jul 5, 2019

mcyprian commented Oct 31, 2019

WillPlatnick commented Nov 2, 2019 • edited

HeathLee commented Nov 7, 2019 • edited

zffocussss commented Apr 18, 2022

taleinat commented Apr 21, 2022

joekohlsdorf commented May 17, 2022 • edited

chicocvenancio commented Jun 30, 2022

ishallbethat commented Jul 20, 2022

ahopkins commented Jul 25, 2022

lachtanek commented Jul 27, 2022 • edited

suciua commented Oct 6, 2022

nickjj commented Oct 8, 2022

GitRon commented Oct 18, 2022 • edited

nickjj commented Oct 19, 2022 • edited

ibachar-es commented Dec 4, 2022

auvipy commented Dec 4, 2022

sondrelg commented Dec 4, 2022 • edited

benedikt-bartscher commented Mar 31, 2023

SnoozeFreddo commented Jul 7, 2023

grapo commented Sep 15, 2023 • edited

GitRon commented Sep 15, 2023

sbasu-mdsol commented Oct 5, 2023

joekohlsdorf commented Oct 5, 2023

rbehal commented Apr 26, 2024

JorisAndrade commented Jun 8, 2017 •

edited by sync-by-unito bot

redbaron commented Jul 4, 2018 •

edited

joekohlsdorf commented Nov 9, 2018 •

edited

WillPlatnick commented Nov 2, 2019 •

edited

HeathLee commented Nov 7, 2019 •

edited

joekohlsdorf commented May 17, 2022 •

edited

lachtanek commented Jul 27, 2022 •

edited

GitRon commented Oct 18, 2022 •

edited

nickjj commented Oct 19, 2022 •

edited

sondrelg commented Dec 4, 2022 •

edited

grapo commented Sep 15, 2023 •

edited