Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleuntd stops sending logs and overflowing the buffer or queued_limit_size #909

Open
adrdimitrov opened this issue Jul 29, 2021 · 4 comments · Fixed by uselagoon/lagoon#2821

Comments

@adrdimitrov
Copy link

adrdimitrov commented Jul 29, 2021

Problem

Fluentd is working fine for hours and then it gets one of the two, either the buffer total_limit_size get reached and fleuntd stops working (even after I have set overflow_action to drop_oldest_chunk) or the queued_chunks_limit_size gets reached and again fleuntd stops sending. I have tried a lot of different configuration including default ones and using memory as a buffer. In any of the cases I am facing one of the two issues. Using the above configuration (this is my latest test) I got queued_chunk_limit_size reached (over 8200 file with the .meta files). The only errors i see in logs are for slow_flush_threshold from time to time. During the failure I am not observing extensive memory or CPU usage in either side Fleuntd nor Elasticsearch. Seems like the connection is just lost and never regained.

Steps to replicate

Input config on the forwarders:

        <source>
	  @type tail
	  path /var/log/containers/*.log
	  # exclude Fluentd logs
	  exclude_path [\"/var/log/containers/*fluentd*.log\", \"/var/log/containers/*kube-dash*.log\"]
	  pos_file /opt/bitnami/fluentd/logs/buffers/fluentd-docker.pos
	  tag kubernetes.*
	  read_from_head true
	  <parse>
	    @type json
	    keep_time_key true
	  </parse>
	</source>

Output config on aggregator:

		<match **>
		  @type copy
		  <store>
		    @type elasticsearch
		    hosts hostname
		    request_timeout 30s
		    resurrect_after 5
		    # avoid https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get
		    # -new-connection-from-pool-error/36252/6
		    reload_connections false
		    reconnect_on_error true
		    reload_on_failure true
		    logstash_format true
		    logstash_prefix logs-eks-s-test-1
		    logstash_dateformat %Y.%m.%d
		    # @timestamp: use event time, not time of indexing
		    time_key time
		    include_tag_key true
		    include_timestamp true
		    <buffer>
		      @type file
		      path /opt/bitnami/fluentd/logs/buffers
		      flush_interval 1s
		      flush_thread_count 20
		      chunk_limit_size 16m
		      total_limit_size 2048m
		      queued_chunks_limit_size 4096
		      overflow_action drop_oldest_chunk
		      retry_forever true
		    </buffer>
		  </store>

Expected Behavior or What you need to ask

Constantly running fleuntd and sending logs to elasticsearch without interruptions

Using Fluentd and ES plugin versions

  • OS version:
[ec2-user@ip~]$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Bare Metal or within Docker or Kubernetes or others?:

Kubernetes

  • Fluentd v0.12 or v0.14/v1.0:
    fluentd-0:/opt/bitnami/fluentd$ fluentd --version
    fluentd 1.13.0

  • ES plugin 3.x.y/2.x.y or 1.x.y

fluentd 09:25:12.45 Welcome to the Bitnami fluentd container
fluentd 09:25:12.45 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-fluentd
fluentd 09:25:12.45 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-fluentd/issues
fluentd 09:25:12.45 
fluentd 09:25:12.45 INFO  ==> ** Starting Fluentd setup **
fluentd 09:25:12.47 INFO  ==> ** Fluentd setup finished! **
fluentd 09:25:12.48 INFO  ==> ** Starting Fluentd **
2021-07-29 09:25:13 +0000 [info]: parsing config file is succeeded path="/opt/bitnami/fluentd/conf/fluentd.conf"
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-concat' version '2.5.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.13'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.0.3'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-grafana-loki' version '1.2.16'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-kafka' version '0.16.3'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.7.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.1'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-s3' version '1.6.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.5'
2021-07-29 09:25:13 +0000 [info]: gem 'fluentd' version '1.13.0'
2021-07-29 09:25:13 +0000 [info]: gem 'fluentd' version '1.12.4'
2021-07-29 09:25:14 +0000 [warn]: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2021-07-29 09:25:14 +0000 [info]: using configuration file: <ROOT>
  <match fluent.**>
    @type stdout
    @log_level "debug"
  </match>
  <source>
    @type forward
    @log_level "debug"
    bind "0.0.0.0"
    port 24224
  </source>
  <source>
    @type http
    bind "0.0.0.0"
    port 9880
  </source>
  <match fluentd.healthcheck>
    @type stdout
  </match>
  <match **>
    @type copy
    <store>
      @type "elasticsearch"
      hosts "https://search.*.es.amazonaws.com"
      request_timeout 30s
      resurrect_after 5
      reload_connections false
      reconnect_on_error true
      reload_on_failure true
      logstash_format true
      logstash_prefix "logs-eks-s-test-1"
      logstash_dateformat "%Y.%m.%d"
      time_key "time"
      include_tag_key true
      include_timestamp true
      <buffer>
        @type "file"
        path "/opt/bitnami/fluentd/logs/buffers"
        flush_interval 1s
        flush_thread_count 20
        chunk_limit_size 16m
        total_limit_size 2048m
        queued_chunks_limit_size 4096
        overflow_action drop_oldest_chunk
        retry_forever true
      </buffer>
    </store>
  </match>
</ROOT>
2021-07-29 09:25:14 +0000 [info]: starting fluentd-1.13.0 pid=1 ruby="2.6.7"
2021-07-29 09:25:14 +0000 [info]: spawn command to main:  cmdline=["/opt/bitnami/ruby/bin/ruby", "-Eascii-8bit:ascii-8bit", "/opt/bitnami/fluentd/bin/fluentd", "--config", "/opt/bitnami/fluentd/conf/fluentd.conf", "--plugin", "/opt/bitnami/fluentd/plugins", "--under-supervisor"]
2021-07-29 09:25:15 +0000 [info]: adding match pattern="fluent.**" type="stdout"
2021-07-29 09:25:15 +0000 [info]: adding match pattern="fluentd.healthcheck" type="stdout"
2021-07-29 09:25:15 +0000 [info]: adding match pattern="**" type="copy"
2021-07-29 09:25:16 +0000 [info]: #0 Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
2021-07-29 09:25:16 +0000 [info]: adding source type="forward"
2021-07-29 09:25:16 +0000 [info]: adding source type="http"
2021-07-29 09:25:16 +0000 [warn]: #0 define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2021-07-29 09:25:16 +0000 [info]: #0 starting fluentd worker pid=29 ppid=1 worker=0
2021-07-29 09:25:16 +0000 [info]: #0 listening port port=24224 bind="0.0.0.0"
2021-07-29 09:25:16 +0000 [info]: #0 fluentd worker is now running worker=0
2021-07-29 09:25:16.547156525 +0000 fluent.info: {"pid":29,"ppid":1,"worker":0,"message":"starting fluentd worker pid=29 ppid=1 worker=0"}
2021-07-29 09:25:16.558448292 +0000 fluent.info: {"port":24224,"bind":"0.0.0.0","message":"listening port port=24224 bind=\"0.0.0.0\""}
2021-07-29 09:25:16.559035109 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}
@ashie
Copy link
Contributor

ashie commented Jul 29, 2021

Is it same cause with #885?

ruby="2.6.7"

It's an affected version for #885

@adrdimitrov
Copy link
Author

Hmm seems similar:

image

image

But i don't get my pod restarted or anything else, anyway, how to update the ruby if it is deployed with the helm chart, or am I getting something wrong ?

@matcasx
Copy link

matcasx commented Mar 2, 2022

I'm encountering a similar problem on a kubernetes AKS installation using fluentd deamonset.
Configuration:

  • Kubernetes AKS v1.20.7
  • Fluentd Daemonset image fluent/fluentd-kubernetes-daemonset:v1.14-debian-elasticsearch7-1
  • ES plugin v5.1.5
  • Elastic cloud managed v7.15.1

Fluent plugin configuration

2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-concat' version '2.5.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-dedot_filter' version '1.0.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.14'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.1.5'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-grok-parser' version '2.6.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-json-in-json-2' version '1.0.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.9.4'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-parser-cri' version '0.1.1'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-prometheus' version '2.0.2'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.1.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.4.0'
2022-03-02 09:03:34 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.5'
2022-03-02 09:03:34 +0000 [info]: gem 'fluentd' version '1.14.5'

Output Configuration

   <match **>
       @type elasticsearch_data_stream
      @id out_es
       @log_level "info"
       include_tag_key true
       request_timeout 15s
       data_stream_name "logs-cluster-test"
       host "my-elastic-cloud.com"
       port 9243
       path ""
       scheme https
       ssl_verify true
       ssl_version TLSv1_2
       user "elastic"
       password xxxxxx
       <buffer>
         flush_thread_count 8
         flush_interval 5s
         chunk_limit_size 2M
         queue_limit_length 32
         retry_max_interval 30
         retry_forever true
       </buffer>
     </match>

After some hours this error occurs:

2022-03-02 02:18:42 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-02 02:18:58 +0000 chunk="5d932e3443b0395a4f70f649ed37c03e" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
  2022-03-02 02:18:42 +0000 [warn]: #0 suppressed same stacktrace
2022-03-02 02:18:42 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=5 next_retry_time=2022-03-02 02:18:58 +0000 chunk="5d932e52a843cbf6d6be5aa934c9f0be" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-elasticsearch-5.1.5/lib/fluent/plugin/out_elasticsearch_data_stream.rb:249:in 'rescue in write'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluent-plugin-elasticsearch-5.1.5/lib/fluent/plugin/out_elasticsearch_data_stream.rb:243:in 'write'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1179:in 'try_flush'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:1495:in 'flush_thread_run'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin/output.rb:499:in 'block (2 levels) in start'
  2022-03-02 02:18:42 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.7.0/gems/fluentd-1.14.5/lib/fluent/plugin_helper/thread.rb:78:in 'block in thread_create

After this error occurs fluentd keeps trying to connect to ES but no more logs are send to ES

 2022-03-02 02:21:32 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_times=10 next_retry_time=2022-03-02 02:22:02 +0000 chunk="5d932e5a693fa1544cc32d138971f5f8" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster (logs-cluster-test): connect_write timeout reached"
 2022-03-02 02:21:32 +0000 [warn]: #0 suppressed same stacktrace

Eventually the buffer reach it's limit and fluientd start logging errors of type BufferOverflowError. Metrics don't show anything noticeable, there are no spikes in communication between fluentd and ES before the error. After the error there in no output traffic and the buffer starts to get filled. There is no abnormal usage of cpu by the pods, neither before nor after the error occurs.

@AlKapkone
Copy link

Hello, we have the same issue.
Fluentd version - v1.14.6-alpine-4
Buffer just stucks after a few minutes after the start of the fluentd pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants