Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery retries tasks for ever #8947

Open
RamachandranSitaraman-IB opened this issue Apr 5, 2024 Discussed in #8912 · 0 comments
Open

Celery retries tasks for ever #8947

RamachandranSitaraman-IB opened this issue Apr 5, 2024 Discussed in #8912 · 0 comments

Comments

@RamachandranSitaraman-IB

Discussed in #8912

Originally posted by RamachandranSitaraman-IB March 14, 2024
I have a celery (5.3.5) based deployment with rabbitmq. Whenever i have a long running task at times that celery task gets executed repeatedly for ever. I am seeing this error at worker
Notice the same task that succeeded is received again and starts executing again and this goes on for ever, how can i avoid this? Any help appreciated!
[2024-03-14 20:27:43,624: INFO/MainProcess] Task process_bias_profiles_task.celery_task_app.fairness_tasks.BiasTask[99d8b9aa-fe09-4c8e-ad79-e6fe2302e057] succeeded in 1983.877235870983s: '{
"id": "-1",
"task_name": "dummy_task",
"task_instance_id": "dummy_instance_id",
"scan_report": [
{
"id": "unique_id_for_document",
"report_title": "AI Model Vulnerability Scan Report",
"created_at": "2024-03-14T19:57:59.615942Z",
"date": "2024-03-14T19:57:59.615948Z",
"model_info": {
"bucket": "vardaan-2023213",
"key": "fairness/model-1/",
"model_filename": "model.tar.gz",
"optimizer": "Adam",
"loss_criterion": "BinaryCrossEntropy",
"input_shape": "(6)",
"num_classes": "2",
"framework": "Pytorch",
"data_source": "S3",
"data_properties": {
"dataSource": "S3",
"dataFormat": "Values",
"dataLocation": {
"bucket": "sitaraman1",
"key": "fairness/data/",
...'
[2024-03-14 20:27:43,624: WARNING/MainProcess] Substantial drift from worker8@aitrism-deployment-844bd6cb9d-2bpjd may mean clocks are out of sync. Current drift is 1926 seconds. [orig: 2024-03-14 20:27:43.624642 recv: 2024-03-14 19:55:37.357487]
[2024-03-14 20:27:43,624: DEBUG/MainProcess] worker8@aitrism-deployment-844bd6cb9d-2bpjd joined the party
[2024-03-14 20:27:43,624: CRITICAL/MainProcess] Couldn't ack 1, reason:ConnectionResetError(104, 'Connection reset by peer')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kombu/message.py", line 131, in ack_log_error
self.ack(multiple=multiple)
File "/usr/local/lib/python3.10/site-packages/kombu/message.py", line 126, in ack
self.channel.basic_ack(self.delivery_tag, multiple=multiple)
File "/usr/local/lib/python3.10/site-packages/amqp/channel.py", line 1407, in basic_ack
return self.send_method(
File "/usr/local/lib/python3.10/site-packages/amqp/abstract_channel.py", line 70, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.10/site-packages/amqp/method_framing.py", line 186, in write_frame
write(buffer_store.view[:offset])
File "/usr/local/lib/python3.10/site-packages/amqp/transport.py", line 347, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
[2024-03-14 20:27:43,626: INFO/MainProcess] missed heartbeat from worker14@aitrism-deployment-844bd6cb9d-tg48j
[2024-03-14 20:27:43,626: ERROR/MainProcess] Error cleaning up after event loop: RecoverableConnectionError(None, 'Socket was disconnected', None, '')
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/celery/worker/loops.py", line 97, in asynloop
next(loop)
File "/usr/local/lib/python3.10/site-packages/kombu/asynchronous/hub.py", line 373, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.10/site-packages/kombu/transport/base.py", line 248, in on_readable
reader(loop)
File "/usr/local/lib/python3.10/site-packages/kombu/transport/base.py", line 228, in _read
raise RecoverableConnectionError('Socket was disconnected')
amqp.exceptions.RecoverableConnectionError: Socket was disconnected

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/celery/worker/loops.py", line 102, in asynloop
hub.reset()
File "/usr/local/lib/python3.10/site-packages/kombu/asynchronous/hub.py", line 116, in reset
self.close()
File "/usr/local/lib/python3.10/site-packages/kombu/asynchronous/hub.py", line 275, in close
item()
File "/usr/local/lib/python3.10/site-packages/vine/promises.py", line 161, in call
return self.throw()
File "/usr/local/lib/python3.10/site-packages/vine/promises.py", line 158, in call
retval = fun(*final_args, **final_kwargs)
File "/usr/local/lib/python3.10/site-packages/kombu/transport/base.py", line 228, in _read
raise RecoverableConnectionError('Socket was disconnected')
amqp.exceptions.RecoverableConnectionError: Socket was disconnected
[2024-03-14 20:27:43,627: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/celery/worker/consumer/consumer.py", line 340, in start
blueprint.start(self)
File "/usr/local/lib/python3.10/site-packages/celery/bootsteps.py", line 116, in start
step.start(parent)
File "/usr/local/lib/python3.10/site-packages/celery/worker/consumer/consumer.py", line 742, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.10/site-packages/celery/worker/loops.py", line 97, in asynloop
next(loop)
File "/usr/local/lib/python3.10/site-packages/kombu/asynchronous/hub.py", line 373, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.10/site-packages/kombu/transport/base.py", line 248, in on_readable
reader(loop)
File "/usr/local/lib/python3.10/site-packages/kombu/transport/base.py", line 228, in _read
raise RecoverableConnectionError('Socket was disconnected')
amqp.exceptions.RecoverableConnectionError: Socket was disconnected
[2024-03-14 20:27:43,627: DEBUG/MainProcess] Closed channel #1
[2024-03-14 20:27:43,627: DEBUG/MainProcess] Closed channel #2
[2024-03-14 20:27:43,627: DEBUG/MainProcess] Closed channel #3
[2024-03-14 20:27:43,627: DEBUG/MainProcess] | Consumer: Restarting event loop...
[2024-03-14 20:27:43,627: DEBUG/MainProcess] | Consumer: Restarting Heart...
[2024-03-14 20:27:43,628: DEBUG/MainProcess] | Consumer: Restarting Control...
[2024-03-14 20:27:43,628: DEBUG/MainProcess] | Consumer: Restarting Tasks...
[2024-03-14 20:27:43,628: DEBUG/MainProcess] Canceling task consumer...
[2024-03-14 20:27:43,628: DEBUG/MainProcess] | Consumer: Restarting Gossip...
[2024-03-14 20:27:43,628: DEBUG/MainProcess] | Consumer: Restarting Events...
[2024-03-14 20:27:43,628: DEBUG/MainProcess] | Consumer: Restarting Connection...
[2024-03-14 20:27:43,628: DEBUG/MainProcess] | Consumer: Starting Connection
[2024-03-14 20:27:43,630: WARNING/MainProcess] /usr/local/lib/python3.10/site-packages/celery/worker/consumer/consumer.py:507: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(

[2024-03-14 20:27:43,635: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@aitrism-deployment-647fc487f8-7pb9f', 'copyright': 'Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries', 'information': 'Licensed under the MPL 2.0. Website: https://rabbitmq.com', 'platform': 'Erlang/OTP 26.2.2', 'product': 'RabbitMQ', 'version': '3.13.0'}, mechanisms: [b'PLAIN', b'AMQPLAIN'], locales: ['en_US']
[2024-03-14 20:27:43,639: INFO/MainProcess] Connected to amqp://aitrism:**@34.192.217.116:5672//
[2024-03-14 20:27:43,639: DEBUG/MainProcess] ^-- substep ok
[2024-03-14 20:27:43,639: DEBUG/MainProcess] | Consumer: Starting Events
[2024-03-14 20:27:43,640: DEBUG/MainProcess] Closed channel #1
[2024-03-14 20:27:43,643: WARNING/MainProcess] /usr/local/lib/python3.10/site-packages/celery/worker/consumer/consumer.py:507: CPendingDeprecationWarning: The broker_connection_retry configuration setting will no longer determine
whether broker connection retries are made during startup in Celery 6.0 and above.
If you wish to retain the existing behavior for retrying connections on startup,
you should set broker_connection_retry_on_startup to True.
warnings.warn(

[2024-03-14 20:27:43,648: DEBUG/MainProcess] Start from server, version: 0.9, properties: {'capabilities': {'publisher_confirms': True, 'exchange_exchange_bindings': True, 'basic.nack': True, 'consumer_cancel_notify': True, 'connection.blocked': True, 'consumer_priorities': True, 'authentication_failure_close': True, 'per_consumer_qos': True, 'direct_reply_to': True}, 'cluster_name': 'rabbit@aitrism-deployment-647fc487f8-7pb9f', 'copyright': 'Copyright (c) 2007-2024 Broadcom Inc and/or its subsidiaries', 'information': 'Licensed under the MPL 2.0. Website: https://rabbitmq.com', 'platform': 'Erlang/OTP 26.2.2', 'product': 'RabbitMQ', 'version': '3.13.0'}, mechanisms: [b'PLAIN', b'AMQPLAIN'], locales: ['en_US']
[2024-03-14 20:27:43,651: DEBUG/MainProcess] ^-- substep ok
[2024-03-14 20:27:43,651: DEBUG/MainProcess] | Consumer: Starting Gossip
[2024-03-14 20:27:43,651: DEBUG/MainProcess] using channel_id: 1
[2024-03-14 20:27:43,653: DEBUG/MainProcess] Channel open
[2024-03-14 20:27:43,663: DEBUG/MainProcess] ^-- substep ok
[2024-03-14 20:27:43,663: DEBUG/MainProcess] | Consumer: Starting Tasks
[2024-03-14 20:27:43,665: DEBUG/MainProcess] using channel_id: 2
[2024-03-14 20:27:43,667: DEBUG/MainProcess] Channel open
[2024-03-14 20:27:43,674: DEBUG/MainProcess] ^-- substep ok
[2024-03-14 20:27:43,674: DEBUG/MainProcess] | Consumer: Starting Control
[2024-03-14 20:27:43,674: DEBUG/MainProcess] using channel_id: 3
[2024-03-14 20:27:43,676: DEBUG/MainProcess] Channel open
[2024-03-14 20:27:43,686: DEBUG/MainProcess] ^-- substep ok
[2024-03-14 20:27:43,686: DEBUG/MainProcess] | Consumer: Starting Heart
[2024-03-14 20:27:43,686: DEBUG/MainProcess] using channel_id: 1
[2024-03-14 20:27:43,689: DEBUG/MainProcess] Channel open
[2024-03-14 20:27:43,690: DEBUG/MainProcess] ^-- substep ok
[2024-03-14 20:27:43,691: DEBUG/MainProcess] | Consumer: Starting event loop
[2024-03-14 20:27:43,691: DEBUG/MainProcess] | Worker: Hub.register Pool...
[2024-03-14 20:27:43,693: DEBUG/MainProcess] basic.qos: prefetch_count->4
[2024-03-14 20:27:48,654: INFO/MainProcess] missed heartbeat from worker8@aitrism-deployment-844bd6cb9d-2bpjd
[2024-03-14 20:32:58,977: INFO/MainProcess] Task process_bias_profiles_task.celery_task_app.fairness_tasks.BiasTask[99d8b9aa-fe09-4c8e-ad79-e6fe2302e057] received

This is my rabbitmq advanced.config

    {rabbit, [
      {consumer_timeout, 360000000}
    ]}
  ].

my celeryconfig.py is

broker_transport_options = {'max_retries': 0, 'interval_start': 0, 'interval_step': 10, 'interval_max': 30}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant