Fleuntd stops sending logs and overflowing the buffer or queued_limit_size #909

adrdimitrov · 2021-07-29T09:31:19Z

Problem

Fluentd is working fine for hours and then it gets one of the two, either the buffer total_limit_size get reached and fleuntd stops working (even after I have set overflow_action to drop_oldest_chunk) or the queued_chunks_limit_size gets reached and again fleuntd stops sending. I have tried a lot of different configuration including default ones and using memory as a buffer. In any of the cases I am facing one of the two issues. Using the above configuration (this is my latest test) I got queued_chunk_limit_size reached (over 8200 file with the .meta files). The only errors i see in logs are for slow_flush_threshold from time to time. During the failure I am not observing extensive memory or CPU usage in either side Fleuntd nor Elasticsearch. Seems like the connection is just lost and never regained.

Steps to replicate

Input config on the forwarders:

        <source>
	  @type tail
	  path /var/log/containers/*.log
	  # exclude Fluentd logs
	  exclude_path [\"/var/log/containers/*fluentd*.log\", \"/var/log/containers/*kube-dash*.log\"]
	  pos_file /opt/bitnami/fluentd/logs/buffers/fluentd-docker.pos
	  tag kubernetes.*
	  read_from_head true
	  <parse>
	    @type json
	    keep_time_key true
	  </parse>
	</source>

Output config on aggregator:

		<match **>
		  @type copy
		  <store>
		    @type elasticsearch
		    hosts hostname
		    request_timeout 30s
		    resurrect_after 5
		    # avoid https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get
		    # -new-connection-from-pool-error/36252/6
		    reload_connections false
		    reconnect_on_error true
		    reload_on_failure true
		    logstash_format true
		    logstash_prefix logs-eks-s-test-1
		    logstash_dateformat %Y.%m.%d
		    # @timestamp: use event time, not time of indexing
		    time_key time
		    include_tag_key true
		    include_timestamp true
		    <buffer>
		      @type file
		      path /opt/bitnami/fluentd/logs/buffers
		      flush_interval 1s
		      flush_thread_count 20
		      chunk_limit_size 16m
		      total_limit_size 2048m
		      queued_chunks_limit_size 4096
		      overflow_action drop_oldest_chunk
		      retry_forever true
		    </buffer>
		  </store>

Expected Behavior or What you need to ask

Constantly running fleuntd and sending logs to elasticsearch without interruptions

Using Fluentd and ES plugin versions

OS version:

[ec2-user@ip~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Bare Metal or within Docker or Kubernetes or others?:

Kubernetes

Fluentd v0.12 or v0.14/v1.0:
fluentd-0:/opt/bitnami/fluentd$ fluentd --version
fluentd 1.13.0
ES plugin 3.x.y/2.x.y or 1.x.y

fluentd 09:25:12.45 Welcome to the Bitnami fluentd container
fluentd 09:25:12.45 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-fluentd
fluentd 09:25:12.45 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-fluentd/issues
fluentd 09:25:12.45 
fluentd 09:25:12.45 INFO  ==> ** Starting Fluentd setup **
fluentd 09:25:12.47 INFO  ==> ** Fluentd setup finished! **
fluentd 09:25:12.48 INFO  ==> ** Starting Fluentd **
2021-07-29 09:25:13 +0000 [info]: parsing config file is succeeded path="/opt/bitnami/fluentd/conf/fluentd.conf"
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-concat' version '2.5.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.13'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.0.3'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-grafana-loki' version '1.2.16'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-kafka' version '0.16.3'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.7.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.1'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-s3' version '1.6.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.5'
2021-07-29 09:25:13 +0000 [info]: gem 'fluentd' version '1.13.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluentd' version '1.12.4'
2021-07-29 09:25:14 +0000 [warn]: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2021-07-29 09:25:14 +0000 [info]: using configuration file: <ROOT>
  <match fluent.**>
    @type stdout
    @log_level "debug"
  </match>
  <source>
    @type forward
    @log_level "debug"
    bind "0.0.0.0"
    port 24224
  </source>
  <source>
    @type http
    bind "0.0.0.0"
    port 9880
  </source>
  <match fluentd.healthcheck>
    @type stdout
  </match>
  <match **>
    @type copy
    <store>
      @type "elasticsearch"
      hosts "https://search.*.es.amazonaws.com"
      request_timeout 30s
      resurrect_after 5
      reload_connections false
      reconnect_on_error true
      reload_on_failure true
      logstash_format true
      logstash_prefix "logs-eks-s-test-1"
      logstash_dateformat "%Y.%m.%d"
      time_key "time"
      include_tag_key true
      include_timestamp true
      <buffer>
        @type "file"
        path "/opt/bitnami/fluentd/logs/buffers"
        flush_interval 1s
        flush_thread_count 20
        chunk_limit_size 16m
        total_limit_size 2048m
        queued_chunks_limit_size 4096
        overflow_action drop_oldest_chunk
        retry_forever true
      </buffer>
    </store>
  </match>
</ROOT>
2021-07-29 09:25:14 +0000 [info]: starting fluentd-1.13.0 pid=1 ruby="2.6.7"
2021-07-29 09:25:14 +0000 [info]: spawn command to main:  cmdline=["/opt/bitnami/ruby/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/bitnami/fluentd/bin/fluentd", "--config", "/opt/bitnami/fluentd/conf/fluentd.conf", "--plugin", "/opt/bitnami/fluentd/plugins", "--under-supervisor"]
2021-07-29 09:25:15 +0000 [info]: adding match pattern="fluent.**" type="stdout"
2021-07-29 09:25:15 +0000 [info]: adding match pattern="fluentd.healthcheck" type="stdout"
2021-07-29 09:25:15 +0000 [info]: adding match pattern="**" type="copy"
2021-07-29 09:25:16 +0000 [info]: #0 Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
2021-07-29 09:25:16 +0000 [info]: adding source type="forward"
2021-07-29 09:25:16 +0000 [info]: adding source type="http"
2021-07-29 09:25:16 +0000 [warn]: #0 define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2021-07-29 09:25:16 +0000 [info]: #0 starting fluentd worker pid=29 ppid=1 worker=0
2021-07-29 09:25:16 +0000 [info]: #0 listening port port=24224 bind="0.0.0.0"
2021-07-29 09:25:16 +0000 [info]: #0 fluentd worker is now running worker=0
2021-07-29 09:25:16.547156525 +0000 fluent.info: {"pid":29,"ppid":1,"worker":0,"message":"starting fluentd worker pid=29 ppid=1 worker=0"}
2021-07-29 09:25:16.558448292 +0000 fluent.info: {"port":24224,"bind":"0.0.0.0","message":"listening port port=24224 bind=\"0.0.0.0\""}
2021-07-29 09:25:16.559035109 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}

The text was updated successfully, but these errors were encountered:

ashie · 2021-07-29T12:25:56Z

Is it same cause with #885?

ruby="2.6.7"

It's an affected version for #885

adrdimitrov · 2021-07-29T13:31:33Z

Hmm seems similar:

But i don't get my pod restarted or anything else, anyway, how to update the ruby if it is deployed with the helm chart, or am I getting something wrong ?

matcasx · 2022-03-02T10:58:51Z

I'm encountering a similar problem on a kubernetes AKS installation using fluentd deamonset.
Configuration:

Kubernetes AKS v1.20.7
Fluentd Daemonset image fluent/fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7-1
ES plugin v5.1.5
Elastic cloud managed v7.15.1

Fluent plugin configuration

2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-concat' version '2.5.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-dedot_filter' version '1.0.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.14'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.1.5'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-json-in-json-2' version '1.0.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.9.4'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-parser-cri' version '0.1.1'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.5'
2022-03-02 09:03:34 +0000 [info]: gem 'fluentd' version '1.14.5'

Output Configuration

   <match **>
       @type elasticsearch_data_stream
      @id out_es
       @log_level "info"
       include_tag_key true
       request_timeout 15s
       data_stream_name "logs-cluster-test"
       host "my-elastic-cloud.com"
       port 9243
       path ""
       scheme https
       ssl_verify true
       ssl_version TLSv1_2
       user "elastic"
       password xxxxxx
       <buffer>
         flush_thread_count 8
         flush_interval 5s
         chunk_limit_size 2M
         queue_limit_length 32
         retry_max_interval 30
         retry_forever true
       </buffer>
     </match>

After some hours this error occurs:

2022-03-02 02:18:42 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-02 02:18:58 +0000 chunk="5d932e3443b0395a4f70f649ed37c03e" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
  2022-03-02 02:18:42 +0000 [warn]: #0 suppressed same stacktrace
2022-03-02 02:18:42 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-02 02:18:58 +0000 chunk="5d932e52a843cbf6d6be5aa934c9f0be" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-elasticsearch-5.1.5/lib/fluent/plugin/out_elasticsearch_data_stream.rb:249:in 'rescue in write'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-elasticsearch-5.1.5/lib/fluent/plugin/out_elasticsearch_data_stream.rb:243:in 'write'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1179:in 'try_flush'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1495:in 'flush_thread_run'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:499:in 'block (2 levels) in start'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin_helper/thread.rb:78:in 'block in thread_create

After this error occurs fluentd keeps trying to connect to ES but no more logs are send to ES

 2022-03-02 02:21:32 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=10 next_retry_time=2022-03-02 02:22:02 +0000 chunk="5d932e5a693fa1544cc32d138971f5f8" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
 2022-03-02 02:21:32 +0000 [warn]: #0 suppressed same stacktrace

Eventually the buffer reach it's limit and fluientd start logging errors of type BufferOverflowError. Metrics don't show anything noticeable, there are no spikes in communication between fluentd and ES before the error. After the error there in no output traffic and the buffer starts to get filled. There is no abnormal usage of cpu by the pods, neither before nor after the error occurs.

AlKapkone · 2022-06-07T18:57:52Z

Hello, we have the same issue.
Fluentd version - v1.14.6-alpine-4
Buffer just stucks after a few minutes after the start of the fluentd pod.

cosmo0920 added the Ruby Core Issue label Jul 30, 2021

adrdimitrov mentioned this issue Jul 30, 2021

Fluentd buffer gets overflow or queue limit size gets filled bitnami/charts#7089

Closed

smlx mentioned this issue Aug 30, 2021

Upgrade fluentd and enforce Opensearch compatibility in logs-concentrator uselagoon/lagoon#2821

Merged

3 tasks

linwalth mentioned this issue Jul 26, 2022

FluentD log buffer not being processed properly osism/issues#271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleuntd stops sending logs and overflowing the buffer or queued_limit_size #909

Fleuntd stops sending logs and overflowing the buffer or queued_limit_size #909

adrdimitrov commented Jul 29, 2021 •

edited by cosmo0920

ashie commented Jul 29, 2021

adrdimitrov commented Jul 29, 2021

matcasx commented Mar 2, 2022 •

edited

AlKapkone commented Jun 7, 2022

Fleuntd stops sending logs and overflowing the buffer or queued_limit_size #909

Fleuntd stops sending logs and overflowing the buffer or queued_limit_size #909

Comments

adrdimitrov commented Jul 29, 2021 • edited by cosmo0920

Problem

Steps to replicate

Expected Behavior or What you need to ask

Using Fluentd and ES plugin versions

ashie commented Jul 29, 2021

adrdimitrov commented Jul 29, 2021

matcasx commented Mar 2, 2022 • edited

AlKapkone commented Jun 7, 2022

adrdimitrov commented Jul 29, 2021 •

edited by cosmo0920

matcasx commented Mar 2, 2022 •

edited