Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[K8S] livenessProbe and readinessProbe for celery beat and workers #4079

Open
JorisAndrade opened this issue Jun 8, 2017 · 59 comments
Open

Comments

@JorisAndrade
Copy link

JorisAndrade commented Jun 8, 2017

Hi,

I'm using Kubernetes to deploy my python application, Kubernetes provide a livenessProbe and readinessProbe see here.

How can I do to check if my celery beat or celery worker is alive and in correct state ?
The PID is not a solution because it cannot be used to catch a deadlock for example.

Thanks in advance for your help,

Best regards,

@thedrow
Copy link
Member

thedrow commented Aug 18, 2017

Celery has a monitoring API you can use.
A pod should be considered live if the Celery worker sends a heartbeat.
A pod should be considered ready if the worker has sent the worker-online event.

If you have specific problems or feature requests, please open a separate issue.

@ScottEAdams
Copy link

Would this work?

readinessProbe:
          exec:
            command:
            - "/bin/sh"
            - "-c"
            - "celery -A path.to.app status | grep -o ': OK'"
          initialDelaySeconds: 30
          periodSeconds: 10

@thedrow
Copy link
Member

thedrow commented Dec 5, 2017

@7wonders You'd need to extract the celery node name first. This readinessProbe will fail if any celery instance fails which is not what you want.

@ScottEAdams
Copy link

@thedrow Hmm, I think its actually that it will succeed even if the actual node has failed but another one is ok which is also not a great outcome.

@redbaron
Copy link

redbaron commented Jul 4, 2018

Looks like

/bin/sh -c 'exec celery -A path.to.app inspect ping -d celery@$HOSTNAME' is good enough for readiness check and verifies just one node.

@desaintmartin
Copy link

Beware that in some apps, running this command can take a few seconds using full CPU AND kubernetes defaults are to run it every 10 seconds.

It is thus much safer to have a high periodSeconds (ours is set to 300).

@rana-ahmed
Copy link

@redbaron did that command work for you? If it works then what are the settings for liveness and readiness prob?

@msta
Copy link

msta commented Nov 6, 2018

For some reason, this readiness probe is nowhere near satisfactory for us. The inspect responds non-deterministically with no load on our cluster. We run the format like this:

celery inspect ping -b "redis://archii-redis-master:6379" -d celery@archii-task-crawl-integration-7d96d86b9d-jwtq7

And with normal ping times (10 seconds), our cluster is completely killed by the CPU celery requires.

@joekohlsdorf
Copy link

joekohlsdorf commented Nov 9, 2018

I use this for liveness with a 30s interval: sh -c celery -A path.to.app status | grep "${HOSTNAME}:.*OK"
I use this for liveness with a 30s interval: sh -c celery -A path.to.app inspect ping --destination celery@${HOSTNAME}
Doesn't seem to cause any extra load, I run a fleet of well over 100 workers.

Readiness probes aren't necessary, Celery is never used in a service. I just set minReadySeconds: 10 which is good enough for delaying worker startup in rolling Deployments, but it obviously depends on the startup time of Celery for your project so examine logs and set accordingly.

@WillPlatnick
Copy link

Readiness probes are still useful even if they're not used in a service. Specifically, when you do a deployment of workers and want to make sure your deployment was successful, you usually use kubectl rollout status deployment. Without readinessprobes, we've deployed bad code that didn't start celery and didn't know it.

@yardensachs
Copy link

My solution was:

readinessProbe:
  exec:
    command:
      [
        "/usr/local/bin/python",
        "-c",
        "\"import os;from celery.task.control import inspect;from <APP> import celery_app;exit(0 if os.environ['HOSTNAME'] in ','.join(inspect(app=celery_app).stats().keys()) else 1)\""
      ]

Others seem to not work 🤷‍♂️

@Strangerxxx
Copy link

Thanks @yardensachs!
Spend many time for debugging what's wrong with other solutions, but no way
Seems like celery inspect ping command do not return exit(0) or something in that way

@mariusburfey
Copy link

celery inspect ping does work, but you need bash to replace the environment variable like this:

        livenessProbe:
          exec:
            # bash is needed to replace the environment variable
            command: [
              "bash",
              "-c",
              "celery inspect ping -A apps -d celery@$HOSTNAME"
            ]
          initialDelaySeconds: 30  # startup takes some time
          periodSeconds: 60  # default is quite often and celery uses a lot cpu/ram then.
          timeoutSeconds: 10  # default is too low

@bwanglzu
Copy link

bwanglzu commented Mar 8, 2019

good to know

@WillPlatnick
Copy link

We ended up ripping celery inspect ping out from our liveness probes because we found that under heavier load, the ping would just hang for minutes at a time even though the jobs were processing fine and there was no backlog. I have a feeling it had something to do with using eventlet, but we're continuing to look into it.

@thedrow
Copy link
Member

thedrow commented Mar 8, 2019

@WillPlatnick That won't happen with 5.0 because Celery will be async so there will be reserved capacity for control coroutines.

@shshe
Copy link

shshe commented Jul 5, 2019

I'm having trouble with inspect ping spawning defunct / zombie processes:

root      2296  0.0  0.0      0     0 ?        Z    16:04   0:00 [python] <defunct>
root      2323  0.0  0.0      0     0 ?        Z    16:05   0:00 [python] <defunct>
...

Anyone else encountering this? There isn't a --pool argument for forcing a single process execution.

@mcyprian
Copy link

Can I ask what are you using instead of celery inspect ping @WillPlatnick? We've encountered a similar issue with the probe failing under heavy load.

@WillPlatnick
Copy link

WillPlatnick commented Nov 2, 2019

@mcyprian We got rid of the liveness probe. My gut is telling me it has something to do with eventlet, but we haven't made it a priority to figure it out.

@HeathLee
Copy link

HeathLee commented Nov 7, 2019

we meet the same CPU problem with Redis broker

@auvipy auvipy added this to the 4.5 milestone Nov 7, 2019
@zffocussss
Copy link

can flower do the monitor tasks indirectly?

@taleinat
Copy link

    command:
    - test
    - -e /tmp/worker_ready

Note that the above Kubernetes configuration for the readiness probe from @beje2k15's comment has a bug, which causes the readiness probe to never fail. Here's one way to fix it:

    command:
    - sh
    - -c
    - test -e /tmp/worker_ready

@joekohlsdorf
Copy link

joekohlsdorf commented May 17, 2022

The approach is good but it requires you to spam your broker with heartbeat garbage.
Here is an improved version which doesn't require heartbeats:

from pathlib import Path

from celery import bootsteps
from celery.signals import worker_ready, worker_shutdown

HEARTBEAT_FILE = Path("/tmp/worker_heartbeat")
READINESS_FILE = Path("/tmp/worker_ready")

class LivenessProbe(bootsteps.StartStopStep):
    requires = {'celery.worker.components:Timer'}

    def __init__(self, worker, **kwargs):
        self.requests = []
        self.tref = None

    def start(self, worker):
        self.tref = worker.timer.call_repeatedly(
            1.0, self.update_heartbeat_file, (worker,), priority=10,
        )

    def stop(self, worker):
       HEARTBEAT_FILE.unlink(missing_ok=True)

    def update_heartbeat_file(self, worker):
       HEARTBEAT_FILE.touch()


@worker_ready.connect
def worker_ready(**_):
   READINESS_FILE.touch()


@worker_shutdown.connect
def worker_shutdown(**_):
   READINESS_FILE.unlink(missing_ok=True)


app = Celery("appname")
app.steps["worker"].add(LivenessProbe)

The liveness file will be created once the connection to the broker is successfully established so you could also use that for readiness.

@chicocvenancio
Copy link

Since we are already checking the connection to the celery broker, does it still make sense to check the status of the worker?

In my experience, yes. Broker connection only guarantees the pod can connect to the broker, but its not uncomon for celery worker to be in a broken state even if that passes (stopped queue with rabbitmq is a scenario that comes to mind). We use a file do only do these checks every 5 minutes though, as they are a bit expensive.

@ishallbethat
Copy link

How about beat ? above healthcheck setting doesnt' seem to work with beat

@ahopkins
Copy link
Member

You can do something similar where beat runs a "health" task and writes the timestamp to a file and you just check that.

@lachtanek
Copy link

lachtanek commented Jul 27, 2022

as others have suggested, we've started using celery -b $BROKER_URL -d celery@$HOSTNAME for our health checks with a 60 second period and it seems not to raise the overall CPU usage too much (make sure not to use with -A though, as that'd load the whole app), I'd recommend to use that

@suciua
Copy link

suciua commented Oct 6, 2022

The approach is good but it requires you to spam your broker with heartbeat garbage. Here is an improved version which doesn't require heartbeats:

from pathlib import Path

from celery import bootsteps
from celery.signals import worker_ready, worker_shutdown

HEARTBEAT_FILE = Path("/dev/shm/worker_heartbeat")
READINESS_FILE = Path("/dev/shm/worker_ready")

class LivenessProbe(bootsteps.StartStopStep):
    requires = {'celery.worker.components:Timer'}

    def __init__(self, worker, **kwargs):
        self.requests = []
        self.tref = None

    def start(self, worker):
        self.tref = worker.timer.call_repeatedly(
            1.0, self.update_heartbeat_file, (worker,), priority=10,
        )

    def stop(self, worker):
       HEARTBEAT_FILE.unlink(missing_ok=True)

    def update_heartbeat_file(self, worker):
       HEARTBEAT_FILE.touch()


@worker_ready.connect
def worker_ready(**_):
   READINESS_FILE.touch()


@worker_shutdown.connect
def worker_shutdown(**_):
   READINESS_FILE.unlink(missing_ok=True)


app = Celery("appname")
app.steps["worker"].add(LivenessProbe)

The liveness file will be created once the connection to the broker is successfully established so you could also use that for readiness.

@joekohlsdorf Thanks a lot for your solution to implement the probes! It might be worth mentioning that /dev/shm doesn't necessarily get removed upon pod restarts which has quite some implications on these probes (see kubernetes/kubernetes#81001). Thus it would be better to use the path suggested by @limedaniel and store the files in /tmp which would lead to:

HEARTBEAT_FILE = Path("/tmp/worker_heartbeat")
READINESS_FILE = Path("/tmp/worker_ready")

@nickjj
Copy link

nickjj commented Oct 8, 2022

I really like the signal approach + file checks and it's what I was planning to implement as well, then I Google'd around and was pleasantly surprised to find these solutions posted here.

Celery also has the option to write out a pid file when it starts. Is the worker_ready signal better because the pid file gets created before the worker is ready to do work? I casually skimmed the code base and didn't see much around the life cycle of when the pid file gets created vs some of the signals.

@GitRon
Copy link

GitRon commented Oct 18, 2022

Great approach! Finally something, that works and doesn't look hacky.

Any thought on how to make the same for the celery-beat? I found a signal for start-up:

@beat_init.connect
def beat_ready(**_):
    READINESS_FILE.touch()

But I just can't find a solution for the liveness check...

@auvipy auvipy modified the milestones: 5.3, 5.3.x Oct 19, 2022
@nickjj
Copy link

nickjj commented Oct 19, 2022

Now that this is tagged for a milestone which hints it'll be in Celery at some point, how necessary is that timer liveness probe?

If the worker can write that file every second it's healthy, no doubt about it. But if a container is running 1 process and that process crashes then the container dies with it. In a Kubernetes context the pod will be removed and restarted.

Where I'm going with this one is can we remove that entire LivenessProbe class and hook up a startup, liveness and readiness probe to only do this along with having the worker_ready and worker_shutdown signals as they are defined above.

    command:
    - sh
    - -c
    - test -e /tmp/worker_ready

My thought process there is the file either always exists (the worker is working) or the process is dead in which case the file won't be there because the container isn't up.

@ibachar-es
Copy link

It seems the heartbeat using the file based approach is not working when using the gevent pool when the worker is blocking. Does anyone have a solution that works for gevent as well?

@auvipy
Copy link
Member

auvipy commented Dec 4, 2022

have anyone tried keda?

@sondrelg
Copy link
Contributor

sondrelg commented Dec 4, 2022

have anyone tried keda?

We're about to, for scaling workers by queue size. We just need to switch brokers to rabbitmq first.

How does KEDA relate to readiness/liveness probes?

@benedikt-bartscher
Copy link

@some1ataplace please do us all a favour and don't just post random chatgpt crap on github

@SnoozeFreddo
Copy link

I also can't configure beat. There is an init without cleanup option (on_shutdown) and no liveness check Anyone figured this out?

@grapo
Copy link

grapo commented Sep 15, 2023

We tried a lot of ideas proposed here like @beje2k15 solution with heartbeat signals. Sadly due to bugs like #7276 worker can respond to heartbeats (so heartbeat_sent signal is catched) but still don't receive real tasks from broker.

Only fully reliable solution for us is to send own heartbeat task with celery beat, handle it with celery worker and use task_success signal to touch liveness file. This solution is only good if you have only one worker! If there are more workers they may not get tasks from broker and fail liveness probe

from pathlib import Path

from celery.signals import (
    after_task_publish,
    beat_init,
    task_success,
    worker_ready,
    worker_shutdown,
)

from .tasks import celery_heartbeat

HEARTBEAT_FILE = Path("/tmp/celery_live")
READINESS_FILE = Path("/tmp/celery_ready")

######
# celery worker rediness and liveness checks

@worker_ready.connect
def worker_ready(**_):
    READINESS_FILE.touch()


@worker_shutdown.connect
def worker_shutdown(**_):
    for f in (HEARTBEAT_FILE, READINESS_FILE):
        f.unlink(missing_ok=True)


@task_success.connect(sender=celery_heartbeat)
def heartbeat(**_):
    HEARTBEAT_FILE.touch()

######
# celery beat rediness and liveness checks

@beat_init.connect
def beat_ready(**_):
    READINESS_FILE.touch()


@after_task_publish.connect(sender="healthcheck.tasks.celery_heartbeat")
def task_published(**_):
    HEARTBEAT_FILE.touch()

This is our code for liveness probe. We check if /tmp/celery_live was touched in the last minute.

          livenessProbe:
            initialDelaySeconds: 90
            periodSeconds: 30
            failureThreshold: 3
            timeoutSeconds: 3
            exec:
              command:
                - /bin/sh
                - c
                - find  /tmp/celery_live -mmin -1 | grep .

@GitRon
Copy link

GitRon commented Sep 15, 2023

@grapo Thx for the detailed answer but I think that your case of only having one worker won't do for most of the crowd. You have celery to be able to scale, right? Nevertheless, thx for sharing!

@sbasu-mdsol
Copy link

What @GitRon said. I am working on a scenario where there will be at least 4-6 workers and I need to have a readiness probe that indicates that all of them are ready to process tasks.

@joekohlsdorf
Copy link

Testing if the broker is alive from the worker makes no sense to me. It will lead to all workers restarting unnecessarily if the broker has a temporary issue which is unrelated to Celery workers. Workers can recover from broker failure without restart.
The Celery worker liveness check should only monitor if the worker is ready to accept tasks which is achieved by the code I shared in #4079 (comment)

If you really want full roundtrip monitoring you can use the celery inspect ping command I shared previously but it uses more CPU. This sends a message to the individual pidbox of the worker which goes through the broker. You can probably implement what the ping command does with a timer in the worker itself to optimize this.

All of these checks still don't mean that you are able to process real tasks but there is really no good way to check this on the worker. You can count processed tasks but that doesn't mean anything because your system might only process one task per month. I recommend to also monitor queue size and ack rate on your broker.

Timers can be used to do all kinds of cool things, I have code which emits currently active tasks and general worker statistics as statsd metics. This is way better than Flower or inspect.

@rbehal
Copy link

rbehal commented Apr 26, 2024

This worked best for me

          livenessProbe:
            exec:
              command:
                - sh
                - -c
                - celery inspect ping -d celery@$(hostname) | grep -q OK
            initialDelaySeconds: 30
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            exec:
              command:
                - sh
                - -c
                - celery inspect ping -d celery@$(hostname) | grep -q OK
            initialDelaySeconds: 60
            periodSeconds: 15
            failureThreshold: 3

It relies on your base image being able to do $(hostname) though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Celery 5.1.0
  
Postponed
Development

No branches or pull requests