Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery workers not sending heartbeat #4980

Open
arvindnrbt opened this issue Aug 15, 2018 · 11 comments
Open

Celery workers not sending heartbeat #4980

arvindnrbt opened this issue Aug 15, 2018 · 11 comments

Comments

@arvindnrbt
Copy link

arvindnrbt commented Aug 15, 2018

Similar to #3649, but I'm not using celery beat. Just with the celery worker instance.

Celery 4.2.1 I'm still facing this issue.
RabbitMQ log

rabbitmq_1  | =ERROR REPORT==== 15-Aug-2018::10:05:42 ===
rabbitmq_1  | closing AMQP connection <0.28344.0> (172.18.0.5:46918 -> 172.18.0.3:5672):
rabbitmq_1  | missed heartbeats from client, timeout: 60s

Celery -A proj inspect active

15-08-2018:10:08:41,476 DEBUG    [connection.py:360] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@09ef45b313ed', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP', 'product': 'RabbitMQ', 'version': '3.6.1'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
15-08-2018:10:08:41,490 DEBUG    [channel.py:106] using channel_id: 1
15-08-2018:10:08:41,493 DEBUG    [channel.py:443] Channel open
15-08-2018:10:08:41,516 DEBUG    [connection.py:360] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@09ef45b313ed', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP', 'product': 'RabbitMQ', 'version': '3.6.1'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
15-08-2018:10:08:41,521 DEBUG    [channel.py:106] using channel_id: 1
15-08-2018:10:08:41,524 DEBUG    [channel.py:443] Channel open
Error: No nodes replied within time constraint.

After 5 min,
RabbitMQ

rabbitmq_1  | =WARNING REPORT==== 15-Aug-2018::10:10:30 ===
rabbitmq_1  | closing AMQP connection <0.1904.1> (172.18.0.5:47004 -> 172.18.0.3:5672):
rabbitmq_1  | client unexpectedly closed TCP connection

celery -A proj inspect active

15-08-2018:10:12:03,837 DEBUG    [connection.py:360] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@09ef45b313ed', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP', 'product': 'RabbitMQ', 'version': '3.6.1'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
15-08-2018:10:12:03,844 DEBUG    [channel.py:106] using channel_id: 1
15-08-2018:10:12:03,848 DEBUG    [channel.py:443] Channel open
15-08-2018:10:12:03,869 DEBUG    [connection.py:360] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@09ef45b313ed', 'copyright': 'Copyright (C) 2007-2016 Pivotal Software, Inc.', 'information': 'Licensed under the MPL.  See http://www.rabbitmq.com/', 'platform': 'Erlang/OTP', 'product': 'RabbitMQ', 'version': '3.6.1'}, mechanisms: [b'AMQPLAIN', b'PLAIN'], locales: ['en_US']
15-08-2018:10:12:03,876 DEBUG    [channel.py:106] using channel_id: 1
15-08-2018:10:12:03,879 DEBUG    [channel.py:443] Channel open
-> celery@ocrworker: OK
    - empty -

In celery log:

[2018-08-15 11:25:24,298: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:9 exited with 'signal 15 (SIGTERM)'
[2018-08-15 11:25:27,154: ERROR/MainProcess] Process 'ForkPoolWorker-2' pid:10 exited with 'signal 15 (SIGTERM)'
[2018-08-15 11:32:44,723: ERROR/MainProcess] Control command error: ConnectionResetError(104, 'Connection reset by peer')
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/celery/worker/pidbox.py", line 46, in on_message
    self.node.handle_message(body, message)
  File "/usr/lib/python3.6/site-packages/kombu/pidbox.py", line 129, in handle_message
    return self.dispatch(**body)
  File "/usr/lib/python3.6/site-packages/kombu/pidbox.py", line 112, in dispatch
    ticket=ticket)
  File "/usr/lib/python3.6/site-packages/kombu/pidbox.py", line 135, in reply
    serializer=self.mailbox.serializer)
  File "/usr/lib/python3.6/site-packages/kombu/pidbox.py", line 265, in _publish_reply
    **opts
  File "/usr/lib/python3.6/site-packages/kombu/messaging.py", line 181, in publish
    exchange_name, declare,
  File "/usr/lib/python3.6/site-packages/kombu/messaging.py", line 203, in _publish
    mandatory=mandatory, immediate=immediate,
  File "/usr/lib/python3.6/site-packages/amqp/channel.py", line 1732, in _basic_publish
    (0, exchange, routing_key, mandatory, immediate), msg
  File "/usr/lib/python3.6/site-packages/amqp/abstract_channel.py", line 50, in send_method
    conn.frame_writer(1, self.channel_id, sig, args, content)
  File "/usr/lib/python3.6/site-packages/amqp/method_framing.py", line 166, in write_frame
    write(view[:offset])
  File "/usr/lib/python3.6/site-packages/amqp/transport.py", line 275, in write
    self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer

I'm running celery inside docker container and I'm worried about running celery with no downtime as we had Connection reset by peer error in our tests often. Please throw some light on this.

@olii
Copy link
Contributor

olii commented Aug 16, 2018

Releated to #4817. Heartbeat is not being called on some connections in a connection pool.

@arvindnrbt
Copy link
Author

Is there any way to fix this now? From #4817 I can see that the only workaround is to set heartbeat to 0 and get it over with. But what about heartbeat monitoring in production applications?

@olii
Copy link
Contributor

olii commented Aug 16, 2018

You don't need a heartbeat when running TCP connection with a keepalive interval set reasonably low. But anyway, this is a bug and it needs a proper fix.
Read a few hints here: https://www.cloudamqp.com/docs/celery.html

rheise added a commit to rheise/celery that referenced this issue Sep 20, 2018
@sposs
Copy link

sposs commented Oct 23, 2018

@olii what do you recommend as 'reasonably low'? I have the same sort of issue and try to apply the 'broker_heartbeat=0' fix...

@olii
Copy link
Contributor

olii commented Oct 23, 2018

@sposs it is completely up to you. You can experiment with any value you like for your use case. Note that the default Linux TCP timeout is 2 hours. http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

@kleysonr
Copy link

Facing the same error, using prefork.

python3
celery==4.2.1
amqp==2.3.2

With RabbitMQ running inside a docker container.

@YuanBLQ
Copy link

YuanBLQ commented Nov 27, 2018

Facing the same dilemma as @kleysonr

python3
amqp==2.2.2
celery==4.2.1

RabbitMQ running inside a docker container.

@CrowbarKZ
Copy link

I've had this issue with all pool types with
py3.6
celery==4.2.1 + rabbitmq (both running in docker containers)

Downgrading to celery==4.1.1 seems to solve the issue for me

@auvipy auvipy added this to the 4.3.x Maintenance milestone Mar 12, 2019
@auvipy auvipy modified the milestones: 4.4.0, 4.5 May 7, 2019
pauleggleton-intel pushed a commit to intel/clear-linux-dissector-web that referenced this issue May 14, 2019
I have been seeing repeated emailed errors from Django reporting
"ConnectionResetError: [Errno 104] Connection reset by peer" in the call
to get task status i.e:

File "/opt/layerindex/layerindex/views.py" in task_log_view
  1572.         if result.ready():

Digging around this seems to be some sort of known bug:

celery/celery#4817
celery/celery#4980

The workaround suggested is to disable the broker heartbeat, so try
that in order to avoid the errors.

Signed-off-by: Paul Eggleton <paul.eggleton@linux.intel.com>
@avinashbakshi01
Copy link

Facing similar issue
RabbitMQ and Celery 4.1.1 in docker container

@auvipy
Copy link
Member

auvipy commented Jan 12, 2020

celery==4.4.0 is the latest release

anelliot pushed a commit to anelliot/layerindex-web that referenced this issue Apr 7, 2020
I have been seeing repeated emailed errors from Django reporting
"ConnectionResetError: [Errno 104] Connection reset by peer" in the call
to get task status i.e:

File "/opt/layerindex/layerindex/views.py" in task_log_view
  1572.         if result.ready():

Digging around this seems to be some sort of known bug:

celery/celery#4817
celery/celery#4980

The workaround suggested is to disable the broker heartbeat, so try
that in order to avoid the errors.

Signed-off-by: Paul Eggleton <paul.eggleton@linux.intel.com>
@auvipy auvipy modified the milestones: 4.5, 5.1.0 Jan 29, 2021
@thedrow thedrow added this to To do in Celery 5.1.0 Feb 24, 2021
@thedrow thedrow moved this from To do to Backlog in Celery 5.1.0 Mar 23, 2021
@auvipy auvipy modified the milestones: 5.1.0, 5.2 Mar 28, 2021
@auvipy auvipy added the Worker label Mar 28, 2021
@auvipy
Copy link
Member

auvipy commented Dec 12, 2021

can any of you check this celery/py-amqp#374?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Celery 5.1.0
  
Postponed
Development

No branches or pull requests

9 participants