Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FluentD log buffer not being processed properly #271

Closed
linwalth opened this issue Jul 26, 2022 · 8 comments
Closed

FluentD log buffer not being processed properly #271

linwalth opened this issue Jul 26, 2022 · 8 comments
Assignees
Labels
bug Something isn't working SCS Sovereign Cloud Stack upstream Implemented directly in the upstream

Comments

@linwalth
Copy link

linwalth commented Jul 26, 2022

Rolling out FluentD per osism-kolla common role results in the fluentD building up log buffers, but not properly reducing them by sending them to the ES. Instead, buffer files keep building up unseemingly in the fluentd-data volume.

Restarting the FluentD helps for a minute, raising the transmission rate to ES, but then it gets stuck again.

Internally, the container uses up 100% of CPU on the FluentD process. We experimented by raising the thread number for the process to 8 threads, which lowers CPU usage, but does not meaningfully change transmission rate. Instead, the buffer keeps growing and creating more files.

A potential cause would be fluent/fluentd#3817 or uken/fluent-plugin-elasticsearch#909 but then again i would suspect more people than just us running into this problem with kolla.

@berendt berendt added the bug Something isn't working label Jul 26, 2022
@linwalth
Copy link
Author

maybe related:
uken/fluent-plugin-elasticsearch#909

@berendt berendt added the SCS Sovereign Cloud Stack label Jul 27, 2022
@matfechner
Copy link

matfechner commented Aug 1, 2022

@linwalth as first aim it is helpful to reduce debug logging (keystone), for longterm we must observe it. possible it is required to restart fluentd container regular

@linwalth
Copy link
Author

With some help from the Monitoring SIG (specifically https://github.com/nerdicbynature) i was able to figure out a config that has now run 3 weeks without trouble. I am going to post it here, if someone runs into similar problems. This issue can be closed.

<match **>
    @type copy
    <store>
       @type elasticsearch
       host {{ elasticsearch_address }}
       port {{ elasticsearch_port }}
       scheme {{ fluentd_elasticsearch_scheme }}
{% if fluentd_elasticsearch_path != '' %}
       path {{ fluentd_elasticsearch_path }}
{% endif %}
       bulk_message_request_threshold 20M
{% if fluentd_elasticsearch_scheme == 'https' %}
       ssl_version {{ fluentd_elasticsearch_ssl_version }}
       ssl_verify {{ fluentd_elasticsearch_ssl_verify }}
{% if fluentd_elasticsearch_cacert | length > 0 %}
       ca_file {{ fluentd_elasticsearch_cacert }}
{% endif %}
{% endif %}
{% if fluentd_elasticsearch_user != '' and fluentd_elasticsearch_password != ''%}
       user {{ fluentd_elasticsearch_user }}
       password {{ fluentd_elasticsearch_password }}
{% endif %}
       logstash_format true
       logstash_prefix {{ kibana_log_prefix }}
       reconnect_on_error true
       request_timeout 15s
       suppress_type_name true
       reload_connections true
       reload_after 1000
       <buffer>
         @type file
         path /var/lib/fluentd/data/elasticsearch.buffer/openstack.*
         flush_thread_count 1
         flush_interval 15s
         retry_max_interval = 2h
         retry_forever true
       </buffer>
    </store>
</match>

@berendt
Copy link
Member

berendt commented Sep 7, 2022

Let's try to change the upstream configuration with https://review.opendev.org/c/openstack/kolla-ansible/+/856241.

@berendt berendt added the upstream Implemented directly in the upstream label Sep 7, 2022
@berendt
Copy link
Member

berendt commented Sep 7, 2022

@linwalth @nerdicbynature

Could you please provide details?

Also, please document in the release notes what the different options are introduced and what they are meant to do.

@nerdicbynature
Copy link

Moin,

the modification addresses multiple issues:

request_timeout needs to match ulk_message_request_threshold. HTTP-POST takes longer for a bigger ulk_message_request_threshold, hence the timeout should be significant higher than the usual upload time to ES. In our case 15MB usually need about 5 seconds, but sometimes need 15s.

retry_max_interval: Fluentd uses exponential backoff. If target ES has been configured to enforce an incoming rate limit, a series of failed HTTP-Uploads (maybe due to (1)) may lead to a buffer size, that always tops the rate limit and Fluentd does not recover from it.

retry_forever/reload_connections/reload_after: Fluentd sometimes silently drops the connection and gets stuck without any obvious reason. This params may help to reduce that. But actually does not prevengt Fluentd from getting stuck. Maybe it's a false assumption. Reloading connections may be a good idea tho.

Kind regards,
André.

@linwalth
Copy link
Author

Is this still being pursued?

@berendt
Copy link
Member

berendt commented Mar 4, 2024

We now have a new Fluentd version. I'm closing this because I think it's no longer relevant.

@berendt berendt closed this as completed Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SCS Sovereign Cloud Stack upstream Implemented directly in the upstream
Projects
None yet
Development

No branches or pull requests

5 participants