FluentD log buffer not being processed properly #271

linwalth · 2022-07-26T10:09:36Z

Rolling out FluentD per osism-kolla common role results in the fluentD building up log buffers, but not properly reducing them by sending them to the ES. Instead, buffer files keep building up unseemingly in the fluentd-data volume.

Restarting the FluentD helps for a minute, raising the transmission rate to ES, but then it gets stuck again.

Internally, the container uses up 100% of CPU on the FluentD process. We experimented by raising the thread number for the process to 8 threads, which lowers CPU usage, but does not meaningfully change transmission rate. Instead, the buffer keeps growing and creating more files.

A potential cause would be fluent/fluentd#3817 or uken/fluent-plugin-elasticsearch#909 but then again i would suspect more people than just us running into this problem with kolla.

linwalth · 2022-07-26T14:55:52Z

maybe related:
uken/fluent-plugin-elasticsearch#909

matfechner · 2022-08-01T11:00:23Z

@linwalth as first aim it is helpful to reduce debug logging (keystone), for longterm we must observe it. possible it is required to restart fluentd container regular

linwalth · 2022-08-23T14:46:25Z

With some help from the Monitoring SIG (specifically https://github.com/nerdicbynature) i was able to figure out a config that has now run 3 weeks without trouble. I am going to post it here, if someone runs into similar problems. This issue can be closed.

<match **>
    @type copy
    <store>
       @type elasticsearch
       host {{ elasticsearch_address }}
       port {{ elasticsearch_port }}
       scheme {{ fluentd_elasticsearch_scheme }}
{% if fluentd_elasticsearch_path != '' %}
       path {{ fluentd_elasticsearch_path }}
{% endif %}
       bulk_message_request_threshold 20M
{% if fluentd_elasticsearch_scheme == 'https' %}
       ssl_version {{ fluentd_elasticsearch_ssl_version }}
       ssl_verify {{ fluentd_elasticsearch_ssl_verify }}
{% if fluentd_elasticsearch_cacert | length > 0 %}
       ca_file {{ fluentd_elasticsearch_cacert }}
{% endif %}
{% endif %}
{% if fluentd_elasticsearch_user != '' and fluentd_elasticsearch_password != ''%}
       user {{ fluentd_elasticsearch_user }}
       password {{ fluentd_elasticsearch_password }}
{% endif %}
       logstash_format true
       logstash_prefix {{ kibana_log_prefix }}
       reconnect_on_error true
       request_timeout 15s
       suppress_type_name true
       reload_connections true
       reload_after 1000
       <buffer>
         @type file
         path /var/lib/fluentd/data/elasticsearch.buffer/openstack.*
         flush_thread_count 1
         flush_interval 15s
         retry_max_interval = 2h
         retry_forever true
       </buffer>
    </store>
</match>

berendt · 2022-09-07T07:56:15Z

Let's try to change the upstream configuration with https://review.opendev.org/c/openstack/kolla-ansible/+/856241.

berendt · 2022-09-07T08:00:39Z

@linwalth @nerdicbynature

Could you please provide details?

Also, please document in the release notes what the different options are introduced and what they are meant to do.

nerdicbynature · 2022-09-08T09:45:06Z

Moin,

the modification addresses multiple issues:

request_timeout needs to match ulk_message_request_threshold. HTTP-POST takes longer for a bigger ulk_message_request_threshold, hence the timeout should be significant higher than the usual upload time to ES. In our case 15MB usually need about 5 seconds, but sometimes need 15s.

retry_max_interval: Fluentd uses exponential backoff. If target ES has been configured to enforce an incoming rate limit, a series of failed HTTP-Uploads (maybe due to (1)) may lead to a buffer size, that always tops the rate limit and Fluentd does not recover from it.

retry_forever/reload_connections/reload_after: Fluentd sometimes silently drops the connection and gets stuck without any obvious reason. This params may help to reduce that. But actually does not prevengt Fluentd from getting stuck. Maybe it's a false assumption. Reloading connections may be a good idea tho.

Kind regards,
André.

linwalth · 2023-02-23T13:47:48Z

Is this still being pursued?

berendt · 2024-03-04T17:10:47Z

We now have a new Fluentd version. I'm closing this because I think it's no longer relevant.

berendt added the bug Something isn't working label Jul 26, 2022

berendt added the SCS Sovereign Cloud Stack label Jul 27, 2022

berendt added the upstream Implemented directly in the upstream label Sep 7, 2022

linwalth mentioned this issue Mar 12, 2023

Things to upstream @ Kolla/Kolla-Ansible #465

Closed

artificial-intelligence self-assigned this May 26, 2023

berendt closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FluentD log buffer not being processed properly #271

FluentD log buffer not being processed properly #271

linwalth commented Jul 26, 2022 •

edited

linwalth commented Jul 26, 2022

matfechner commented Aug 1, 2022 •

edited

linwalth commented Aug 23, 2022

berendt commented Sep 7, 2022

berendt commented Sep 7, 2022

nerdicbynature commented Sep 8, 2022

linwalth commented Feb 23, 2023

berendt commented Mar 4, 2024

FluentD log buffer not being processed properly #271

FluentD log buffer not being processed properly #271

Comments

linwalth commented Jul 26, 2022 • edited

linwalth commented Jul 26, 2022

matfechner commented Aug 1, 2022 • edited

linwalth commented Aug 23, 2022

berendt commented Sep 7, 2022

berendt commented Sep 7, 2022

nerdicbynature commented Sep 8, 2022

linwalth commented Feb 23, 2023

berendt commented Mar 4, 2024

linwalth commented Jul 26, 2022 •

edited

matfechner commented Aug 1, 2022 •

edited