Add a configuration option to disable prefetch completely #7106

george-miller · 2020-11-24T21:15:08Z

george-miller
Nov 24, 2020

Checklist

I have verified that the issue exists against the master branch of Celery.
This has already been asked to the discussion group first.
I have read the relevant section in the
contribution guide
on reporting bugs.
I have checked the issues list
for similar or identical bug reports.
I have checked the pull requests list
for existing proposed fixes.
I have checked the commit log
to find out if the bug was already fixed in the master branch.
I have included all related issues and possible duplicate issues
in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

I have included the output of celery -A proj report in the issue.
(if you are not able to do this, then at least specify the Celery
version affected).
I have verified that the issue exists against the master branch of Celery.
I have included the contents of pip freeze in the issue.
I have included all the versions of all the external dependencies required
to reproduce this bug.

Optional Debugging Information

I have tried reproducing the issue on more than one Python version
and/or implementation.
I have tried reproducing the issue on more than one message broker and/or
result backend.
I have tried reproducing the issue on more than one version of the message
broker and/or result backend.
I have tried reproducing the issue on more than one operating system.
I have tried reproducing the issue on more than one workers pool.
I have tried reproducing the issue with autoscaling, retries,
ETA/Countdown & rate limits disabled.
I have tried reproducing the issue after downgrading
and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

None

Possible Duplicates

None

Environment & Settings

Celery version: 4.3.0 (rhubarb)

celery report Output:

software -> celery:4.3.0 (rhubarb) kombu:4.6.11 py:3.6.12
            billiard:3.6.3.0 redis:3.3.11
platform -> system:Linux arch:64bit, ELF
            kernel version:4.14.106-97.85.amzn2.x86_64 imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:redis results:redis://query-runner-redis-master:6379/0

BROKER_URL: 'redis://query-runner-redis-master:6379/0'
CELERY_RESULT_BACKEND: 'redis://query-runner-redis-master:6379/0'
CELERY_INCLUDE: ['tasks.query']
CELERY_TIMEZONE: 'UTC'
CELERY_TASK_SERIALIZER: 'json'
CELERY_IGNORE_RESULT: True
CELERY_TASK_ACKS_LATE: True
CELERYD_PREFETCH_MULTIPLIER: 1
CELERY_ROUTES: {
    'tasks.query.kill_query': {'queue': 'meta'},
    'tasks.query.run_query': {'queue': 'default'}}
CELERY_ALWAYS_EAGER: False
CELERYBEAT_SCHEDULE: {
 }
CELERY_TASK_CREATE_MISSING_QUEUES: True

Steps to Reproduce

Required Dependencies

Minimal Python Version: N/A or Unknown
Minimal Celery Version: N/A or Unknown
Minimal Kombu Version: N/A or Unknown
Minimal Broker Version: N/A or Unknown
Minimal Result Backend Version: N/A or Unknown
Minimal OS and/or Kernel Version: N/A or Unknown
Minimal Broker Client Version: N/A or Unknown
Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

amqp==2.6.1
billiard==3.6.3.0
blinker==1.4
boto==2.49.0
boto3==1.16.21
botocore==1.19.21
celery==4.3.0
celery-prometheus-exporter==1.7.0
certifi==2020.11.8
chardet==3.0.4
cheroot==8.4.7
CherryPy==18.6.0
click==7.1.2
datadog==0.39.0
decorator==4.4.2
env-excavator==1.5.0
Flask==1.1.2
Flask-Cors==3.0.9
flask-swagger==0.2.14
flask-swagger-ui==3.36.0
flower==0.9.5
future==0.18.2
gunicorn==20.0.4
humanize==3.1.0
idna==2.10
importlib-metadata==2.0.0
importlib-resources==3.3.0
inflection==0.5.1
influxdb==5.3.1
itsdangerous==1.1.0
jaraco.classes==3.1.0
jaraco.collections==3.0.0
jaraco.functools==3.0.1
jaraco.text==3.2.0
JayDeBeApi==1.1.1
Jinja2==2.11.2
jmespath==0.10.0
JPype1==0.6.3
kombu==4.6.11
MarkupSafe==1.1.1
more-itertools==8.6.0
msgpack==1.0.0
numpy==1.17.0
portend==2.7.0
prometheus-client==0.9.0
prometheus-flask-exporter==0.18.1
psycopg2==2.8.6
pycrypto==2.6.1
PyHive==0.6.3
PyMySQL==0.10.1
python-dateutil==2.8.1
pytz==2020.4
PyYAML==5.3.1
qds-sdk==1.16.1
raven==6.10.0
redis==3.3.11
redlock==1.2.0
requests==2.25.0
s3transfer==0.3.3
six==1.15.0
tempora==4.0.1
thrift==0.13.0
tornado==6.1
urllib3==1.26.2
vine==1.3.0
Werkzeug==1.0.1
zc.lockfile==2.0
zipp==3.4.0

Other Dependencies

N/A

Minimally Reproducible Test Case

Expected Behavior

I was hoping by specifying -Ofair that tasks would not be reserved by workers as stated here https://docs.celeryproject.org/en/v4.3.0/userguide/optimizing.html?highlight=optimization#prefork-pool-prefetch-settings

Actual Behavior

Workers have reserved tasks while other workers are free to do work.

I am going to paste the state we are seeing in prod when a lot of tasks come in. Notice how query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj has a reserved task while celery@query-runner-celery-datalake-interval-refresh-784bb858c8-dgmf9 is empty and could run that task. Please note the name of the worker is the name of the queue they are reading from.

celery inspect active and celery inspect reserved Output:


root@query-runner-celery-datalake-687847cbb6-mxrnc:/opt/workdir# celery -A tasks inspect active
2020-11-24 21:03:59,312 [WARNING] datadog.api: No agent or invalid configuration file found
2020-11-24 21:03:59,316 [INFO] tasks.datadog_helper: initialized with
2020-11-24 21:03:59,316 [INFO] tasks.datadog_helper: {'statsd_host': '172.29.156.91', 'statsd_port': 8125, 'statsd_namespace': 'query_runner_celery_tasks'}
-> celery@query-runner-celery-default-6c9bd65fd7-4cdz6: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-v8w22: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-68xqn: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-lsk2z: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-mxrnc: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-9f8r2: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-hnrhw: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-9jp2n: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-4dx2b: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-q82hx: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-nqhtk: OK
    * {'id': '93a32a04-7a6d-4a32-a2ee-8b88ca4f9049', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_aa42f1c1970bc1ac2778a65c60b2e08b', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3107', 'caller_json': {'queryId': 3107, 'interval': 'HOURLY', 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-nqhtk', 'time_start': 1606251603.8548772, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}
    * {'id': '747a1dd7-209a-4870-a636-c18176587723', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_373fafa1388d047c83cae9d40f249655', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3149 on dash 616', 'caller_json': {'queryId': 3149, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-nqhtk', 'time_start': 1606251609.6809852, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 13}
    * {'id': '28ec904f-197f-4aef-99e2-0887f6f85ec7', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_829e5787764013fbf5eb0958eec86a8f', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2400 on dash 609', 'caller_json': {'queryId': 2400, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-nqhtk', 'time_start': 1606251604.9346218, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
    * {'id': '705d4f46-662d-4930-9011-265e1065346d', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_04cd73d13695b2a7d9503148018c2650', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3179 on dash 616', 'caller_json': {'queryId': 3179, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-nqhtk', 'time_start': 1606240820.285725, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 14}
-> celery@query-runner-celery-default-6c9bd65fd7-4s4lz: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-92hxc: OK
    * {'id': '7afd7a99-368e-4995-8e5a-a1802dff7476', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_9fe008b693e6dc12999bf58212430afa', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2403 on dash 609', 'caller_json': {'queryId': 2403, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-92hxc', 'time_start': 1606251605.1041203, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}
    * {'id': '4d59fc3f-40b9-4d51-ad41-e241518c629f', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_9a6f5ab9f25ab21bddcc128aee8677d6', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2396 on dash 609', 'caller_json': {'queryId': 2396, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-92hxc', 'time_start': 1606251604.2693179, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
    * {'id': '05bc6789-2183-47d2-abe0-7093bad200a5', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_a937a6def4c1010c15dd3c5e1b488d2d', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3116 on dash 616', 'caller_json': {'queryId': 3116, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-92hxc', 'time_start': 1606251609.2829497, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 14}
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-269rc: OK
    * {'id': 'd6ddf309-e58a-4f2d-be71-1824e618716e', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_b7253439941188da3776ffdc8bab2b07', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2401 on dash 609', 'caller_json': {'queryId': 2401, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-269rc', 'time_start': 1606251605.0180476, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
    * {'id': '62bb08c5-d6a6-4966-8de1-0d872faba09f', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_b75d10ba197e1bb1ab787217ca5daa96', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2395 on dash 609', 'caller_json': {'queryId': 2395, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-269rc', 'time_start': 1606251604.1448753, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}
    * {'id': '82240c16-ec84-4279-9cd4-694a73a145eb', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_6435b6004863fd5a10de1abc8bf294cd', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3188 on dash 624', 'caller_json': {'queryId': 3188, 'interval': 'HOURLY', 'dashboard_id': 624, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-269rc', 'time_start': 1606251609.9815636, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 14}
    * {'id': '9c062ae6-bc8d-44d1-b861-068cd1032c38', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_6912fef8aee81d8aa0a68f1326e5bbf6', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3117 on dash 616', 'caller_json': {'queryId': 3117, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-269rc', 'time_start': 1606251609.2058136, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 13}
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-94qs4: OK
    * {'id': '87228aca-e02f-48a3-9096-5540c7b4fb22', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_e97a5e28903b3652a4a2c386fc287cc6', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3115 on dash 616', 'caller_json': {'queryId': 3115, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-94qs4', 'time_start': 1606251609.4260392, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 13}
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-6xldj: OK
    * {'id': '1a8982d5-32cd-4fcf-a134-72a79d7e5da7', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_02758ef185f77a7397ead9e8e90e9743', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2397 on dash 609', 'caller_json': {'queryId': 2397, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-6xldj', 'time_start': 1606251604.4064894, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
    * {'id': 'b2804139-6d51-4ffe-817b-8b9efc7a9cd4', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_1920e1f67f84ac24fb3d2b8e8bb4bd22', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3114 on dash 616', 'caller_json': {'queryId': 3114, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-6xldj', 'time_start': 1606251609.3467886, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}
-> celery@query-runner-celery-ga-55cdb77d4-j6994: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-m8wfp: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-dgmf9: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-drdl2: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-lgstj: OK
    * {'id': '741f2751-ca24-4bcd-b78c-1d92e679952d', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_e526e7f0e7a7c28271dc8f08ee19504b', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3021 on dash 595', 'caller_json': {'queryId': 3021, 'interval': 'HOURLY', 'dashboard_id': 595, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-lgstj', 'time_start': 1606251616.69369, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
-> celery@query-runner-celery-ga-55cdb77d4-6jxzz: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-g9swh: OK
    * {'id': '650009cc-4d6b-42c4-96b1-9a3112a1e5c8', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_711f6cdfe13c22710887f00f5678baeb', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2399 on dash 609', 'caller_json': {'queryId': 2399, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-g9swh', 'time_start': 1606251604.6513581, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}
    * {'id': 'd74e3272-45f0-44ee-bcab-7b67de60b407', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_5ca350ca8b4b185ad42f94c6dd3b6fc9', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3189 on dash 624', 'caller_json': {'queryId': 3189, 'interval': 'HOURLY', 'dashboard_id': 624, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-g9swh', 'time_start': 1606251611.4904687, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 14}
    * {'id': '6f61b237-fa9e-4b30-8131-8d8ee75306c5', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_9bc92aabee38e8d6d607bbb24785d814', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2137 on dash 557', 'caller_json': {'queryId': 2137, 'interval': 'HOURLY', 'dashboard_id': 557, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-g9swh', 'time_start': 1606248030.1126661, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 13}
-> celery@query-runner-celery-datalake-687847cbb6-dvdpl: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-pdxsk: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-vl2t6: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-g55gl: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-7spd4: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj: OK
    * {'id': 'ab45784f-8d85-452c-b690-43ead1a010c6', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_bb8a680bab37f6bc86eb9f5c7435d656', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2816 on dash 609', 'caller_json': {'queryId': 2816, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606251605.576281, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 14}
    * {'id': '4e64157e-9356-4362-88af-318013c0c4de', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_4859064fc4b96df99eec68e94f733157', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3143 on dash 616', 'caller_json': {'queryId': 3143, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606251609.5426545, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 13}
    * {'id': 'ac0c403b-ade6-4377-9a41-6d305268204b', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_b04a82ce9376c9c6618c3a300abe4182', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3109', 'caller_json': {'queryId': 3109, 'interval': 'HOURLY', 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606240815.8285584, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
    * {'id': 'e9c368de-d47f-4a84-b05d-81cbda00a394', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_4d03e965dbd57a05d6071c0b5cbf699d', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2398 on dash 609', 'caller_json': {'queryId': 2398, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606251604.517987, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}
-> celery@query-runner-celery-default-6c9bd65fd7-8zdvm: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-2zxhc: OK
    * {'id': '03e786df-44b1-412a-8354-3acbc762528f', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_0687aa16794011f4d1a0ee1023173bf8', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3105 on dash 557', 'caller_json': {'queryId': 3105, 'interval': 'HOURLY', 'dashboard_id': 557, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-2zxhc', 'time_start': 1606251618.9102547, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
    * {'id': '646617ce-9835-4b60-b5fd-c4a4a5631749', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_ed09cf8c1166115a4ed1a5aa33b032c7', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3144 on dash 616', 'caller_json': {'queryId': 3144, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-2zxhc', 'time_start': 1606251609.6082683, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}
root@query-runner-celery-datalake-687847cbb6-mxrnc:/opt/workdir# celery -A tasks inspect reserved
2020-11-24 21:04:03,851 [WARNING] datadog.api: No agent or invalid configuration file found
2020-11-24 21:04:03,854 [INFO] tasks.datadog_helper: initialized with
2020-11-24 21:04:03,855 [INFO] tasks.datadog_helper: {'statsd_host': '172.29.156.91', 'statsd_port': 8125, 'statsd_namespace': 'query_runner_celery_tasks'}
-> celery@query-runner-celery-default-6c9bd65fd7-4cdz6: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-v8w22: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-mxrnc: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-68xqn: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-q82hx: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-lsk2z: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-4dx2b: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-hnrhw: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-lgstj: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-9f8r2: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-j6994: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-9jp2n: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-92hxc: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-269rc: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-nqhtk: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-4s4lz: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-94qs4: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-m8wfp: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-6xldj: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-dgmf9: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-g9swh: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-6jxzz: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-dvdpl: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-g55gl: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-drdl2: OK
    - empty -
-> celery@query-runner-celery-ga-55cdb77d4-vl2t6: OK
    - empty -
-> celery@query-runner-celery-default-6c9bd65fd7-8zdvm: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj: OK
    * {'id': '3f501bbe-6c56-48e2-ba64-9b1adac30695', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_6eb04958c42a71ebd014f9be9e116f73', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2823 on dash 557', 'caller_json': {'queryId': 2823, 'interval': 'HOURLY', 'dashboard_id': 557, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': None, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': None}
-> celery@query-runner-celery-datalake-687847cbb6-7spd4: OK
    - empty -
-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-2zxhc: OK
    - empty -
-> celery@query-runner-celery-datalake-687847cbb6-pdxsk: OK
    - empty -

Also, here is the output of celery inspect stats where you can see fair is getting in properly

celery inspect stats for worker query-runner-celery-datalake-interval-refresh-784bb858c8-dgmf9 Output:

``` -> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-dgmf9: OK { "broker": { "alternates": [], "connect_timeout": 4, "failover_strategy": "round-robin", "heartbeat": 120.0, "hostname": "query-runner-redis-master", "insist": false, "login_method": null, "port": 6379, "ssl": false, "transport": "redis", "transport_options": {}, "uri_prefix": null, "userid": null, "virtual_host": "0" }, "clock": "8600271", "pid": 8, "pool": { "max-concurrency": 4, "max-tasks-per-child": "N/A", "processes": [ 13, 14, 15, 16 ], "put-guarded-by-semaphore": false, "timeouts": [ 0, 0 ], "writes": { "all": "42.86%, 42.86%, 14.29%", "avg": "33.33%", "inqueues": { "active": 0, "total": 4 }, "raw": "3, 3, 1", "strategy": "fair", "total": 7 } }, "prefetch_count": 4, "rusage": { "idrss": 0, "inblock": 72400, "isrss": 0, "ixrss": 0, "majflt": 158, "maxrss": 68800, "minflt": 69557, "msgrcv": 0, "msgsnd": 0, "nivcsw": 2942, "nsignals": 0, "nswap": 0, "nvcsw": 334278, "oublock": 1584, "stime": 7.775837, "utime": 152.889586 }, "total": { "tasks.query.run_query": 7 } } ```

Generally, we have a lot of tasks long running tasks. This issue is making our tasks take double the amount of time because they are unnecessarily waiting behind other tasks without utilizing the other workers.

xirdneh · 2020-11-26T17:30:19Z

xirdneh
Nov 26, 2020
Collaborator

This is a recurrent misconception. There are two prefetch at play here. One is for the Main worker process and the second one is for the worker threads (or spawned processes)

We can take a step back to remember that a Celery worker has a main process which is the one that takes messages from the broker. In this case your output is showing the tasks that each worker is pulling from the broker.
Then, each worker main process spawns multiple processes and each task that's pulled from the broker is then sent to these spawned processes.

Using -O fair makes the main worker send tasks only to spawned processes that are free and not send as many tasks as the spawned processes can take (effectively making the spawned processes do another layer of prefetch). When I say "send tasks to the spawned processes" I really mean that the main process writes these messages to the pipe assigned to the spawned process. So, without -O fair the main process writes as much data to the pipe as possible

All of this means that if you want to make sure workers are not too greedy you have to play with the prefetch multiplier
So, if you want to make sure every worker only prefetches as many tasks as the worker's spawned processes can take care of then use -O fair and set worker_prefetch_multiplier to 1

0 replies

george-miller · 2020-11-30T16:19:16Z

george-miller
Nov 30, 2020
Author

Thanks so much for the info!

Sadly, we've already set -O fair and worker_prefetch_multiplier to 1. What I am saying is this Using -O fair makes the main worker send tasks only to spawned processes that are free is not the case. Our workers are pulling tasks from the broker and reserving them even when all the worker's processes are full.

In my original message under the celery inspect active and celery inspect reserved we can look at this worker celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj

Active

-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj: OK
    * {'id': 'ab45784f-8d85-452c-b690-43ead1a010c6', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_bb8a680bab37f6bc86eb9f5c7435d656', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2816 on dash 609', 'caller_json': {'queryId': 2816, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606251605.576281, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 14}
    * {'id': '4e64157e-9356-4362-88af-318013c0c4de', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_4859064fc4b96df99eec68e94f733157', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3143 on dash 616', 'caller_json': {'queryId': 3143, 'interval': 'HOURLY', 'dashboard_id': 616, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606251609.5426545, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 13}
    * {'id': 'ac0c403b-ade6-4377-9a41-6d305268204b', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_b04a82ce9376c9c6618c3a300abe4182', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 3109', 'caller_json': {'queryId': 3109, 'interval': 'HOURLY', 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606240815.8285584, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 16}
    * {'id': 'e9c368de-d47f-4a84-b05d-81cbda00a394', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_4d03e965dbd57a05d6071c0b5cbf699d', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2398 on dash 609', 'caller_json': {'queryId': 2398, 'interval': 'HOURLY', 'dashboard_id': 609, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': 1606251604.517987, 'acknowledged': True, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': 15}

Reserved

-> celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj: OK
    * {'id': '3f501bbe-6c56-48e2-ba64-9b1adac30695', 'name': 'tasks.query.run_query', 'args': '()', 'kwargs': "{'query_id': 'query_6eb04958c42a71ebd014f9be9e116f73', 'queue': 'datalake-interval-refresh', 'caller_info': {'caller': 'Interval refresh HOURLY for Query id 2823 on dash 557', 'caller_json': {'queryId': 2823, 'interval': 'HOURLY', 'dashboard_id': 557, 'action': 'refresh_data'}, 'user_id': '0', 'username': 'intervalRefreshUser'}}", 'type': 'tasks.query.run_query', 'hostname': 'celery@query-runner-celery-datalake-interval-refresh-784bb858c8-qskvj', 'time_start': None, 'acknowledged': False, 'delivery_info': {'exchange': '', 'routing_key': 'datalake-interval-refresh', 'priority': 0, 'redelivered': None}, 'worker_pid': None}

This worker pulled a task from the broker even when it already had 4 processes running (I have 4 worker processes per worker node). This is exactly what we don't want to happen, it makes that reserved task wait a long time before it starts executing when there are plenty of other empty workers waiting for tasks.

What I would like to happen is for the worker to wait until one of it's processes is finished, then go look at the broker and grab a task. I want to completely disable prefetching.

0 replies

george-miller · 2020-11-30T16:24:57Z

george-miller
Nov 30, 2020
Author

Oh I think I get what you're saying. -O fair only affect the scheduling of worker main processes to worker threads, while prefetch applies to scheduling from broker to worker main process.

In that case, we really need a way to disable prefetching altogether. Afaik, this is not possible right? Why not? Is there any way I could monkeypatch or config to make prefetching only happen when the worker has a free thread?

0 replies

thedrow · 2021-02-24T14:17:07Z

thedrow
Feb 24, 2021
Maintainer

It should be possible but I don't think anyone has requested us to implement it.
What we need is a boolean configuration value that will completely disable worker prefetch.

I'm currently scheduling this for the future milestone. If you want to dive in and write a patch, feel free to submit a PR.

0 replies

samdoolin · 2021-04-09T15:05:25Z

samdoolin
Apr 9, 2021

We have also been struggling with this (workers prefetching tasks from the queue even though all pool processes / subprocesses are currently busy).

We note from the documentation that it is recommended to set task_acks_late = True, however we are not able to do this.
So we have been trying to expose the mechanism by which a worker decides whether it can accept more messages.

The code below is a loader (invoked via celery -A APP --loader path.to.SingleTaskLoader worker) which patches kombu.transport.virtual.base.QoS.can_consume to call a delegate function, which can be supplied later.
This example fetches a single message from the queue, executes it, and only then fetches the next message.
Additional logic could be packed in the delegate can_consume() towards the bottom.

It seems to work with celery version 4.4.7, prefork pool. But use at your own risk. Hope it helps somebody. Advice appreciated.

from celery.loaders.app import AppLoader


class SingleTaskLoader(AppLoader):

    def on_worker_init(self):
        # called when the worker starts, before logging setup
        super().on_worker_init()

        """
        Celery depends on kombu for messaging.

        The criteria for whether a celery worker can accept more messages
        from a queue is @ kombu.transport.virtual.base.QoS.can_consume()

        The standard implementation depends on `<QoS>.prefetch_count`,
        which is a function of `--concurrency` * `--prefetch-multiplier`.

        If `<QoS>.prefetch_count == 0` prefetching of
        messages is unrestricted.

        The standard implementation also depends on a count of received
        messages `<QoS>._delivered` vs ack-ed messages `<QoS>._dirty`,
        and hence the prefetch behaviour is different depending on the
        `ACKS_LATE` setting.

        https://docs.celeryproject.org/en/stable/userguide/optimizing.html#reserve-one-task-at-a-time
        https://github.com/celery/celery/issues/6500
        https://github.com/celery/celery/issues/2788
        https://stackoverflow.com/questions/16040039/understanding-celery-task-prefetching

        here we override kombu.transport.virtual.base.QoS.can_consume()
        to run a delegate function, instead of the builtin implementation.
        """

        def can_consume(_self):
            """
            override for kombu.transport.virtual.base.QoS.can_consume

            run a delegate function, instead of the builtin implementation
            """
            return getattr(_self, 'delegate_can_consume', lambda: False)()

        import kombu.transport.virtual.base
        kombu.transport.virtual.base.QoS.can_consume = can_consume

        """
        Celery instances are built from "blueprints"
        https://docs.celeryproject.org/en/latest/userguide/extending.html#blueprints

        > The Worker is the first blueprint to start...
        > When the worker is fully started it continues with the Consumer
        > blueprint, that sets up how tasks are executed,
        > connects to the broker and starts the message consumers.

        In the Consumer blueprint, celery.worker.consumer.tasks.Tasks is
        responsible for setting
        `<Consumer>.task_consumer = kombu.messaging.Consumer(...)`

        hence `<Consumer>.task_consumer.channel` is type
        kombu.transport.virtual.base.Channel

        hence `<Consumer>.task_consumer.channel.qos` is type
        kombu.transport.virtual.base.QoS

        here we add a new bootstep to the celery Consumer,
        to set `<QoS>.delegate_can_consume`
        """

        from celery import bootsteps
        from celery.worker import state as worker_state

        class Set_QoS_Delegate(bootsteps.StartStopStep):

            requires = {'celery.worker.consumer.tasks:Tasks'}

            def start(self, c):

                def can_consume():
                    """
                    delegate for QoS.can_consume

                    only fetch a message from the queue if the worker has
                    no other messages
                    """
                    # note: reserved_requests includes active_requests
                    return len(worker_state.reserved_requests) == 0

                c.task_consumer.channel.qos.delegate_can_consume = can_consume

        # add bootstep
        self.app.steps['consumer'].add(Set_QoS_Delegate)

0 replies

thedrow · 2021-04-11T11:52:45Z

thedrow
Apr 11, 2021
Maintainer

That is an interesting monkeypatch you got there.
The proper way to do this right now is to inherit from the Channel and replace the QoS implementation.

If you can come up with a patch that does this properly, we'll include this as an option.

EDIT:
Or maybe this is a suitable implementation. @celery/core-developers Thoughts?

0 replies

galCohen88 · 2021-04-11T12:27:28Z

galCohen88
Apr 11, 2021
Collaborator

@samdoolin @george-miller If I'm getting you correctly, you like the worker to consumes messages only when it's free for processing.

I tried reproducing this with a single worker with acks_late=true & worker_prefetch_multiplier=1 and it seems to be fine (consuming only when it can)

What am I missing?

0 replies

samdoolin · 2021-04-11T14:42:28Z

samdoolin
Apr 11, 2021

@galCohen88 thank you for your reply. Our problem is that we cannot set acks_late=True (as I said above).

The implementation of kombu.transport.virtual.base.QoS.can_consume is such that the command line argument --prefetch-multiplier limits the number of tasks that a worker will reserve in addition to tasks that have been ack-ed.

Hence with acks_late=True everything looks good, but with acks_late=False it is not possible to prevent a worker from reserving additional tasks when the worker does not have any available pool processes / subprocesses to execute them.

0 replies

samdoolin · 2021-04-11T14:57:04Z

samdoolin
Apr 11, 2021

@thedrow thank you for your reply. I agree that it's a bit of a grungy monkey patch.

I could inherit from the concrete kombu channel of the transport type that I'm going to use (e.g. kombu.transport.redis.Channel), and then override the concrete QoS implementation of that channel. But I couldn't find a legitimate way to patch into the base QoS class (kombu.transport.virtual.base.QoS), which would then work irrespective of transport.

0 replies

galCohen88 · 2021-04-11T16:48:36Z

galCohen88
Apr 11, 2021
Collaborator

@samdoolin Can you give me some context? Why acks_late can not be set to true? Is it mission critical / specific for your usecase?

0 replies

samdoolin · 2021-04-11T18:19:49Z

samdoolin
Apr 11, 2021

Can you give me some context? Why acks_late can not be set to true? Is it mission critical / specific for your usecase?

Unfortunately our tasks are not idempotent, and so per the documentation we should not use acks_late=True.

0 replies

samdoolin · 2021-04-12T13:17:50Z

samdoolin
Apr 12, 2021

The proper way to do this right now is to inherit from the Channel and replace the QoS implementation.

The concrete Transport class is resolved by the method kombu.connection.Connection.get_transport_cls(), and then the class hierarchy is Transport.Channel.QoS. So if I want this to work for all transports (perhaps unreasonably), I end up writing something very fragile like...

class QoS_Mixin:
    def can_consume(self):  # override
        ...


class Connection(kombu.connection.Connection):
    def get_transport_cls(self):
        transport_cls = super().get_transport_cls()
        channel_cls = transport_cls.Channel
        qos_cls = channel_cls.QoS
        QoS = type('QoS', (QoS_Mixin, qos_cls), {})
        Channel = type('Channel', (channel_cls,), {'QoS': QoS})
        Transport = type('Transport', (transport_cls,), {'Channel': Channel})
        return Transport

Perhaps I should instead override the setting kombu.transport.__init__.TRANSPORT_ALIASES for the one transport that I care about.

0 replies

thedrow · 2021-04-12T13:42:25Z

thedrow
Apr 12, 2021
Maintainer

Maybe that's the right solution.

0 replies

Thijsvandepoll · 2021-11-16T12:12:39Z

Thijsvandepoll
Nov 16, 2021

Hi @samdoolin, I am facing a similar issue with tasks which cannot be used with acks_late. Could you maybe elaborate on how to use your solution on a Redis broker. I have tried copying the SingleTaskLoader class and adding --loader path.to.SingleTaskLoader as an argument in the celery worker (with and without --prefetch-multiplier), without any result. Still tasks are in a RECEIVED state, causing it to take a while before being picked up.

More information on my issue can be found in this StackOverflow issue: https://stackoverflow.com/questions/69987419/celery-prefetched-tasks-stuck-behind-other-tasks-on-ecs-cluster

0 replies

samdoolin · 2021-11-16T14:06:35Z

samdoolin
Nov 16, 2021

Hi @Thijsvandepoll

We have been running with a Redis broker, with code for the loader similar to my post back in April (updated below).

We start up the worker with celery worker ... --prefetch-multiplier -1 --loader path.to.SingleTaskLoader. Without the loader, I think that --prefetch-multiplier -1 should force the worker to never consume tasks from the queue. With the loader, the worker is permitted to consume tasks from the queue only by the delegate can_consume(), which is configured by the Set_QoS_Delegate bootstep.

For other backends I think that kombu.transport.virtual.base.QoS.can_consume_max_estimate() might be an additional complication.

Still feels like an unpalatable hack, but it has been working for us.

class SingleTaskLoader(AppLoader):

    def on_worker_init(self):
        # called when the worker starts, before logging setup
        super().on_worker_init()

        """
        STEP 1:
        monkey patch kombu.transport.virtual.base.QoS.can_consume()
        to prefer to run a delegate function,
        instead of the builtin implementation.
        """

        import kombu.transport.virtual

        builtin_can_consume = kombu.transport.virtual.QoS.can_consume

        def can_consume(self):
            """
            monkey patch for kombu.transport.virtual.QoS.can_consume

            if self.delegate_can_consume exists, run it instead
            """
            if delegate := getattr(self, 'delegate_can_consume', False):
                return delegate()
            else:
                return builtin_can_consume(self)

        kombu.transport.virtual.QoS.can_consume = can_consume

        """
        STEP 2:
        add a bootstep to the celery Consumer blueprint
        to supply the delegate function above.
        """

        from celery import bootsteps
        from celery.worker import state as worker_state

        class Set_QoS_Delegate(bootsteps.StartStopStep):

            requires = {'celery.worker.consumer.tasks:Tasks'}

            def start(self, c):

                def can_consume():
                    """
                    delegate for QoS.can_consume

                    only fetch a message from the queue if the worker has
                    no other messages
                    """
                    # note: reserved_requests includes active_requests
                    return len(worker_state.reserved_requests) == 0

                # types...
                # c: celery.worker.consumer.consumer.Consumer
                # c.task_consumer: kombu.messaging.Consumer
                # c.task_consumer.channel: kombu.transport.virtual.Channel
                # c.task_consumer.channel.qos: kombu.transport.virtual.QoS
                c.task_consumer.channel.qos.delegate_can_consume = can_consume

        # add bootstep to Consumer blueprint
        self.app.steps['consumer'].add(Set_QoS_Delegate)

0 replies

Thijsvandepoll · 2021-11-16T14:57:05Z

Thijsvandepoll
Nov 16, 2021

Which version of celery are you using? And do you use any concurrency? I am currently running 5.1.1 together with Flask. For some reason the --loader is not picked up. I tried pass it directly to the Celery class:

class FlaskCelery(Celery):

    def __init__(self, *args, **kwargs):

        super(FlaskCelery, self).__init__(*args, **kwargs)
        self.patch_task()

        if 'app' in kwargs:
            self.init_app(kwargs['app'])

    def patch_task(self):
        TaskBase = self.Task
        _celery = self

        class ContextTask(TaskBase):
            abstract = True

            def __call__(self, *args, **kwargs):
                if flask.has_app_context():
                    return TaskBase.__call__(self, *args, **kwargs)
                else:
                    with _celery.app.app_context():
                        return TaskBase.__call__(self, *args, **kwargs)

            def after_return(self, status, retval, task_id, args, kwargs, einfo):
                super().after_return(status, retval, task_id, args, kwargs, einfo)
                db.session.remove()

        self.Task = ContextTask

    def init_app(self, app):
        self.app = app
        self.config_from_object(app.config)

celery = FlaskCelery(
    __name__,
    broker=config.PublishedConfig.broker_url,
    backend=config.PublishedConfig.result_backend,
    broker_use_ssl={"ssl_cert_reqs": ssl.CERT_REQUIRED},
    redis_backend_use_ssl={"ssl_cert_reqs": ssl.CERT_REQUIRED},
    include=["tasks"],
    loader=SingleTaskLoader
)

Then it indeed starts reaching the can_consume monkeypatch, as expected. However, interestingly it does consume a single task at a time. Even though the concurrency is set to 4 (-c 4). I guess you also use concurrent jobs right? It seems to be only applied on a single ForkPoolWorker for some reason. Expected is that it should work on any of them right? Any ideas?

0 replies

samdoolin · 2021-11-16T15:54:16Z

samdoolin
Nov 16, 2021

interestingly it does consume a single task at a time. Even though the concurrency is set to 4 (-c 4)

The delegate can_consume() in my example above is intended to do that.
You should be able to inject more logic, so I guess something like this might work...

def can_consume():
    return len(worker_state.reserved_requests) < c.controller.concurrency

0 replies

Thijsvandepoll · 2021-11-16T16:07:26Z

Thijsvandepoll
Nov 16, 2021

Ah I was under the impression that you also intended to have more than a single concurrent task at the same time! Anyways this indeed works as expected! The --loader argument does not work as expected, but passing it to the constructor does the job. Thanks a lot! I will test if it indeed works together with autoscaling in ECS, but so far it works great.

0 replies

Diggsey · 2021-11-17T12:23:32Z

Diggsey
Nov 17, 2021

This is a problem for me too :( We have our own more reliable retry mechanism (based on DB outbox pattern) and so really don't want the retry behaviour associated with acks_late=True.

@samdoolin

For other backends I think that kombu.transport.virtual.base.QoS.can_consume_max_estimate() might be an additional complication.

We use rabbitmq - are you saying we'd need to monkey patch can_consume_max_estimate too?

0 replies

samdoolin · 2021-11-17T13:08:13Z

samdoolin
Nov 17, 2021

@Diggsey the docstring for can_consume_max_estimate() only mentions SQS, and kombu/transport/SQS.py seems to be the only usage, so I guess not.

0 replies

auvipy · 2021-11-17T15:21:42Z

auvipy
Nov 17, 2021
Maintainer

Was thinking transfaring this issue to discussion before any concrete consensus is reached

2 replies

samdoolin Nov 18, 2021

I hope recent interest shows a desire for a mechanism to disable task prefetch onto the worker in the context of acks_late=False

auvipy Nov 19, 2021
Maintainer

lets reach to the consensus, we will certainly consider anything useful

emarbo · 2021-11-18T10:22:36Z

emarbo
Nov 18, 2021

Hello there, I have found that the prefetch related limit takes into account the reserved tasks, but not the scheduled.

Let's suppose we have a main worker with concurrency 2, prefetch limit 1, and acks_late. The Main Process gets 1 task from RabbitMQ but it's scheduled in the future, this task goes to the scheduled queue (celery inspect scheduled). The next 2 tasks are ready to run and delivered to the worker processes (active). In this scenario, the first task is blocked until the others finish, no matter if there are other main workers ready to consume jobs. If the active tasks run for days, the scheduled tasks will be blocked for days.

@samdoolin I think the worker_state.reserved_requests does not include these scheduled tasks. Have you encountered/tested this scenario using your patch?

Edit: The scheduled tasks are moved to the reserved queue once the ETA is met. If the main process needs to consume 100 scheduled tasks until it encounters 2 tasks ready to run, there will be 100 scheduled tasks blocked in this process. After the ETA time is met, you can see the 100 tasks in the celery inspect reserved (even having the prefetch multiplier 1 and using acks_late)

1 reply

samdoolin Nov 18, 2021

I think the worker_state.reserved_requests does not include these scheduled tasks

I believe you are correct. It does seem a little counterintuitive that tasks with ETA are pulled off the broker onto a single worker, where they then wait until they are due to run. As you point out, the worker might be busy when the ETA is due, whilst there could be another worker sitting idle.

Diggsey · 2021-11-18T18:30:36Z

Diggsey
Nov 18, 2021

@samdoolin I tried your solution with RabbitMQ, but I'm unable to specify --prefetch-multiplier -1, as it complains that -1 is outside the allowed range.

Is the change to can_consume alone sufficient?

2 replies

samdoolin Nov 19, 2021

interesting. We haven't yet encountered validation on --prefetch-multiplier. Without that, I guess you have to put more faith in the monkey patch, and accept that during boot-up of the worker, there is a brief moment before the Set_QoS_Delegate bootstep runs. Sorry that's not much help.

Diggsey Nov 19, 2021

Eh, I think this solution is not going to work for me at all 😢 - it looks like it uses a completely different Channel implementation for rabbitmq, that doesn't have a qos attribute at all.

emarbo · 2021-11-30T21:18:59Z

emarbo
Nov 30, 2021

After a bit of research, I have discovered that the Consumer Prefetch is a RabbitMQ feature. Both librabbitmq and py-amqp (Kombu transports) set the prefetch limit through the RabbitMQ API, never calling a can_consume method. Only the non-AMQP brokers use the virtual QoS.

In my opinion, the virtual implementation must preserve the original behaviour, and therefore we can't develop a "disable prefetch" feature as discussed before. However, from an AMQP perspective, the right approach is to acknowledge a task only after it has been processed (i.e. use acks_late) so the native RabbitMQ Consumer Prefetch works as expected.

ETA/Countdown tasks

The ETA/Countdown tasks implementation complicates everything a bit. When the MainProcess reserves a task with an ETA in the future, it stores the task in memory and fetches the next one. It repeats this process until it finds a task without an ETA or one of the ETAs is reached. This means that a MainProcess can reserve an infinite number of tasks if all the tasks have an ETA in the future. In case that the prefetch limit is set, the MainProcess increases its value by one every time it needs to reserve an additional task bypassing the original value of the --prefetch-multiplier flag. With a small experiment, you can see how this value changes in the RabbitMQ web interface (channels tab).

This is a serious issue no matter if you're using the prefetch limit or not. Here is a case example:

A worker reserves thousands of tasks scheduled in the next minute (because none of them met the ETA yet). When the minute has passed, it starts processing them, but it turns out that every task will take 1 hour (they're long tasks). At this point, the system looks running slow, so we manually start new workers, but the new workers do not process any task because the first worker has all the tasks reserved.

Exploring ETA solutions

IMHO, the MainProcess must not reserve tasks that can't process. Moreover, it must never bypass the prefetch limits to avoid the issues described above. So here are my two first thoughts.

A native RabbitMQ solution is the RabbitMQ Delayed Message Plugin (github, blog 1, blog 2). It has some limitations (see Github) but delegates the ETA responsibility to the RabbitMQ bypassing the current Celery implementation. I think it could be an optional feature compatible with the current Celery API (although it would require more configuration)

A general solution for all brokers would require changing the current algorithm. Instead of storing the ETA tasks in memory, they could be requeued using the Celery 'retry' feature when the ETA isn't yet met. However, they can't be requeued using the native RabbitMQ reject/nack because the tasks will be fetched and requeued in an infinite loop (see docs). With this algorithm, a task scheduled first may be processed later, but the current algorithm has the same problem.

6 replies

emarbo Feb 8, 2022

Hi @thedrow, @xirdneh, @galCohen88 (sorry for mentioning everyone), does this make sense for you?

auvipy Feb 8, 2022
Maintainer

any proof of concept would be great to decide IMHO

emarbo Feb 14, 2022

@auvipy, I have patched the code to illustrate the concept and tested that it addresses the ETA issues described above. How should I proceed? Should I create a PR to review it? It's just the POC and likely far away to be anything complete.

auvipy Feb 15, 2022
Maintainer

you can start with draft PR :)

emarbo Feb 17, 2022

Here is the PR! #7305

Diggsey · 2021-11-30T22:46:26Z

Diggsey
Nov 30, 2021

Yeah that's in line with what I found whilst investigating the issue.

It looks like the AMQP protocol (and therefore RabbitMQ) is simply unsuitable for "at most once delivery" if you also need low latency. Which is insane given that this seems to be one of the primary usecases of RabbitMQ.

It seems to stem from the fact that AMQP is a "push" based protocol rather than a "pull" based one, but doesn't implement any kind of back-pressure system.

And if you need "at least once delivery", it's a pain to configure correctly and the tooling for managing persistent data within RabbitMQ is just inadequate.

So, it seems the ideal use-case for RabbitMQ is "less or more than once delivery"... Can't say I'll be using it again any time soon.

6 replies

Diggsey Dec 1, 2021

I don't see why rabbit is suitable for "at most once delivery". Please, correct me if I'm missing something.

I assume you meant unsuitable :) The problem with sending the ACK early is that it causes messages to be uncontrollably prefetched (ie. the broker will push you a new message as soon as you ACK the previous one). This is incompatible with low-latency, because it means messages are waiting to be processed (stuck behind another message) when workers are sitting idle elsewhere.

This is not a problem for pull-based protocols, since the caller can ack the message, and just not request a new one until it is ready. Alternatively, there could be a separate "ACK" from "READY" message (ie. backpressure). In AMQP this is not possible, therefore AMQP (and so RabbitMQ) is unsuitable for this task.

emarbo Dec 1, 2021

Ops, hehe. Yes, I have edited my original comment to avoid further confusion.

Thanks for the explanation 🙌 I see what you meant now. As far as I understand, the issue you describe can't be avoided in any push-based protocol. I suppose that AMQP performs well for short-running tasks but has some problems with long-running tasks and acking before processing.

Diggsey Dec 1, 2021

I believe Apache Pulsar uses a push based protocol, but manages to avoid the issue with flow-control built in. (The client says, "I can accept up to N messages", and then the server pushes them as needed, and ACKs are independent)

Diggsey Dec 1, 2021

So you can set N = number of free worker threads and avoid any latency problems.

emarbo Dec 1, 2021

This is definitely a more flexible algorithm for a push-based protocol. There are many ways to do the "same" thing - more than I had initially imagined.

grunzwei · 2023-07-16T07:35:30Z

grunzwei
Jul 16, 2023

We use the solution proposed by @samdoolin, but not limited to 1 concurrent task, but several concurrent tasks. still, no prefetch.
It almost works great!
but
there is prefetch like behavior when recovering from a redis broker disconnect.

in some scenarios a celery worker can spawn several processes, all of which will run for longer than the serverside redis timeout configuration.

after that happens, next time the celery worker tries to fetch tasks, it will encounter "Connection to broker lost".
it will then try to recover.
and when it does, somehow, it prefetches.

we tried configuration heartbeat to redis broker, but might have done it incorrectly as it didn't resolve the issue (on kubernetes, with memorystore redis in GCP).
we ended up increasing the redis server timeout, and that solved our problem.

FYI to anyone having the same issue

0 replies

Add a configuration option to disable prefetch completely #7106

Checklist

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

Replies: 26 comments · 17 replies

xirdneh Nov 26, 2020 Collaborator

george-miller Nov 30, 2020 Author

george-miller Nov 30, 2020 Author

thedrow Feb 24, 2021 Maintainer

thedrow Apr 11, 2021 Maintainer

galCohen88 Apr 11, 2021 Collaborator

galCohen88 Apr 11, 2021 Collaborator

thedrow Apr 12, 2021 Maintainer

auvipy Nov 17, 2021 Maintainer

auvipy Nov 19, 2021 Maintainer

auvipy Feb 8, 2022 Maintainer

auvipy Feb 15, 2022 Maintainer

Replies: 26 comments 17 replies

xirdneh
Nov 26, 2020
Collaborator

george-miller
Nov 30, 2020
Author

george-miller
Nov 30, 2020
Author

thedrow
Feb 24, 2021
Maintainer

thedrow
Apr 11, 2021
Maintainer

galCohen88
Apr 11, 2021
Collaborator

galCohen88
Apr 11, 2021
Collaborator

thedrow
Apr 12, 2021
Maintainer

auvipy
Nov 17, 2021
Maintainer

auvipy Nov 19, 2021
Maintainer

auvipy Feb 8, 2022
Maintainer

auvipy Feb 15, 2022
Maintainer