Fluentd buffer gets overflow or queue limit size gets filled #7089

adrdimitrov · 2021-07-29T09:10:16Z

Hello team,

I am testing fluentd for our logging purposes and I am facing an issue with my buffer configuration (i guess). The set up is as follows:

Deployment: I am deploying fluentd on our kubernetes cluster which consist of 4 nodes, one generating almost 90% of the logs. Everything is done using terraform and bitnami/fleuntd helm chart. Fluentd is in kube-system namespace. It is sending logs to aws elasticsearch.

Input config on the forwarders:

        <source>
	  @type tail
	  path /var/log/containers/*.log
	  # exclude Fluentd logs
	  exclude_path [\"/var/log/containers/*fluentd*.log\", \"/var/log/containers/*kube-dash*.log\"]
	  pos_file /opt/bitnami/fluentd/logs/buffers/fluentd-docker.pos
	  tag kubernetes.*
	  read_from_head true
	  <parse>
	    @type json
	    keep_time_key true
	  </parse>
	</source>

Output config on aggregator:

	<match **>
	  @type copy
	  <store>
	    @type elasticsearch
	    hosts hostname
	    request_timeout 30s
	    resurrect_after 5
	    # avoid https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get
	    # -new-connection-from-pool-error/36252/6
	    reload_connections false
	    reconnect_on_error true
	    reload_on_failure true
	    logstash_format true
	    logstash_prefix logs-eks-s-test-1
	    logstash_dateformat %Y.%m.%d
	    # @timestamp: use event time, not time of indexing
	    time_key time
	    include_tag_key true
	    include_timestamp true
	    <buffer>
	      @type file
	      path /opt/bitnami/fluentd/logs/buffers
	      flush_interval 1s
	      flush_thread_count 20
	      chunk_limit_size 16m
	      total_limit_size 2048m
	      queued_chunks_limit_size 4096
	      overflow_action drop_oldest_chunk
	      retry_forever true
	    </buffer>
	  </store>

Issue: Fluentd is working fine for hours and then it gets one of the two, either the buffer total_limit_size get reached and fleuntd stops working (even after I have set overflow_action to drop_oldest_chunk) or the queued_chunks_limit_size gets reached and again fleuntd stops sending. I have tried a lot of different configuration including default ones and using memory as a buffer. In any of the cases I am facing one of the two issues. Using the above configuration (this is my latest test) I got queued_chunk_limit_size reached (over 8200 file with the .meta files). The only errors i see in logs are for slow_flush_threshold from time to time. During the failure I am not observing extensive memory or CPU usage in either side Fleuntd nor Elasticsearch. Seems like the connection is just lost and never regained.

Restart of the pod is getting fluentd back to normal state and working, but this way i am losing logs and it is not sustainable to restart manually every time it stops sending logs.

The text was updated successfully, but these errors were encountered:

juan131 · 2021-07-30T06:57:20Z

Hi @adrdimitrov

Thanks so much for the detailed information!!

We mainly provide support in this repository to solve problems related to the Bitnami charts or the containers. For information regarding the application itself or customization of the content within the application, we highly recommend checking forums and user guides made available by the project behind the application.

In your case, I recommend you to post your question in the forum below, you'll find there more experienced users that can help you customizing the configuration:

https://discuss.fluentd.org

adrdimitrov · 2021-07-30T12:36:48Z

Hello @juan131 ,

Thanks, I wrote for this in the fluentd channel and on few other places and got this:

uken/fluent-plugin-elasticsearch#909

the issue seems caused by the ruby version and it is fixed, but still seems that the helm chart is deploying old ruby version that have this issue. Is it possible to update the Ruby version for this chart ? Coz it is not very pleasant to deploy via helm charts and then to upgrade versions and etc.. Doing this for multiple kubernetes clusters will not be cool.

adrdimitrov · 2021-08-02T06:50:36Z

Hey @juan131 ,

Have you managed to check the above.

juan131 · 2021-08-02T07:44:32Z

Hi @adrdimitrov

Using the latest image available (tag 1.13.3-debian-10-r0) in the latest chart version (4.1.3), see:

https://github.com/bitnami/charts/blob/master/bitnami/fluentd/values.yaml#L66

I checked the version of excon as suggested by @ashie:

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used uken/fluent-plugin-elasticsearch#885

$ docker run --rm -it bitnami/fluentd:1.13.3-debian-10-r0 -- bash
$ ruby --version
ruby 2.6.8p205 (2021-07-07 revision 67951) [x86_64-linux]
$ gem list excon

*** LOCAL GEMS ***

excon (0.85.0)

It's supposed to be fine and it shouldn't be affected by the issue unless I'm missing something. What version of the container and chart are you using?

adrdimitrov · 2021-08-02T10:18:17Z

Hey @juan131 ,

I am currently using helm_chart_version_fluentd = "3.7.5", so i will update and revert back.

Thanks very much for your prompt response.

juan131 · 2021-08-02T13:34:18Z

Thanks! Please keep up updated about your insights using the latest chart version.

adrdimitrov · 2021-08-06T06:34:45Z

Hello @juan131

I redeployed my fluentd with helm chart 4.1.3 version and the result is the same:

I saw that even with 4.1.3 helm chart the ruby version is still the old one:

Please note that i did not update it in place, i removed the fluentd and redeployed it using terraform.

Am I doing something wrong ?

juan131 · 2021-08-06T07:55:45Z

Hi @adrdimitrov

We're including the latest ruby version available in the branch 2.6.x when building the image, see:

https://github.com/bitnami/bitnami-docker-fluentd/blob/master/1/debian-10/Dockerfile

The current chart is pointing to the version image version 1.13.3-debian-10-r0 which uses Ruby 2.6.8. Could you please describe the Fluentd pod (running kubectl describe pod POD_NAME) and let us know the specific image tag you're using? It shouldn't list ruby="2.6.7" but ruby="2.6.8" instead.

By the way, I thought the problem was related to the Ruby gem excon. Could you please confirm that Ruby 2.7.x is required? If that's the case we can release a new version of the images bundling that branch.

adrdimitrov · 2021-08-06T08:07:52Z

Hey @juan131 ,

Just saw your answer, turned out that I am still using the old debian package. I did not see that the image is described in the values file. I changed it and now I am running fluentd with ruby 2.6.8.

I cannot confirm that 2.7.x is required or if 2.6.8 is fixing the issue, but I am now running 2.6.8 and will leave it like this and report back if it fixes the issue.

Thanks a lot for your time !

juan131 · 2021-08-06T08:09:48Z

Thanks so much! Please keep us updated about your insights.

adrdimitrov · 2021-08-11T09:31:49Z

Hello @juan131 ,

I managed to deploy the latest helm chart version and as said above changed in values file to use 1.13.3-debian-10-r0 . Haven't faced the issue since then (5 days). So I guess I can confirm now that this issue is fixed in ruby 2.6.8.

Thanks a lot for your time and efforts! It's appreciated.

juan131 · 2021-08-12T06:18:05Z

👏 !!! That's great!!! I'm very glad the problem was fixed using the latest image!

Thanks so much for sharing your insights @adrdimitrov. I'll keep the issue opened for a few more days just in case you face it again.

adrdimitrov · 2021-08-12T12:48:47Z

Hey @juan131 ,

Some bad news, although I haven't stop receiving logs (or at least i don't see gaps). The issue with the 100% CPU utilization of the pod is still there and the pod will get frequently restarted gracefully with SIGKILL:

I will monitor this closely and try to catch the errors and the exact behaviour. Maybe upgrading to ruby 2.7.x is a good idea.

adrdimitrov · 2021-08-13T10:47:11Z

Hello again,

As mentioned yesterday I am still facing the CPU issue, but i left the fluentd to see how it deals with it. Unfortunately this night it stopped with similar as before scenario. It suddenly filled the chunk_queued_limit and stopped sending logs to ES.

There are 8k files like this:

the total size is 40MB.

As seen below the moment we hit 100% we lost the logs:

I don't see errors in the logs, it just dies without any notifications.

juan131 · 2021-08-17T06:55:03Z

Hi @adrdimitrov

I can build an exact copy of the current container replacing Ruby with 2.7.x version and share it with you. This way, you can install the chart replacing the image and see if the problem also persists with this version

What do you think? Does it make sense?

adrdimitrov · 2021-08-23T08:55:15Z

Hey @juan131

Sorry for the late response I was on a vacation.

Yes, it makes sense and will be great! Meanwhile me and my colleagues are monitoring this and so far we haven't seen the issue again. I am not sure if this happens under some specific circumstances or it is absolutely random.

Will keep you posted.

juan131 · 2021-08-23T12:04:43Z

Thanks so much @adrdimitrov

By the way, I built and published an image based on Ruby 2.7.x for you to try. You can download it from my DockerHub account, the image is juanariza131/fluentd:development. To test it in your chart, install it using the values.yaml below:

image:
  registry: docker.io
  repository: juanariza131/fluentd
  tag: development

adrdimitrov · 2021-08-30T07:12:41Z

Hey @juan131 ,

Quick update, I managed to deploy this custom image and it runs for a week now without issues. Will continue monitoring it for another week and keep you posted.

adrdimitrov · 2021-09-07T06:37:22Z

Hey @juan131,

Two weeks now with the image based on Ruby 2.7.x and I have faced no issues. I believe this is solved.

juan131 · 2021-09-07T06:59:57Z

These are great news!!! I'll do the required changes in our system to release a new bitnami/fluentd image based on Ruby 2.7.x

juan131 · 2021-09-07T12:21:42Z

@adrdimitrov a new revision of the bitnami/fluentd image based on Ruby 2.7.x was released: 1.14.0-debian-10-r9. See:

https://github.com/bitnami/bitnami-docker-fluentd/commit/4ff092a7b4c82a07d497e8f329c973f97f17f5aa

Please give it a try when you have a chance! I proceed to close the issue as "solved" but please feel free to reopen it if you require further assistance.

danielhoherd mentioned this issue Aug 9, 2021

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used uken/fluent-plugin-elasticsearch#885

Closed

juan131 closed this as completed Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd buffer gets overflow or queue limit size gets filled #7089

Fluentd buffer gets overflow or queue limit size gets filled #7089

adrdimitrov commented Jul 29, 2021

juan131 commented Jul 30, 2021

adrdimitrov commented Jul 30, 2021

adrdimitrov commented Aug 2, 2021

juan131 commented Aug 2, 2021

adrdimitrov commented Aug 2, 2021

juan131 commented Aug 2, 2021

adrdimitrov commented Aug 6, 2021

juan131 commented Aug 6, 2021

adrdimitrov commented Aug 6, 2021

juan131 commented Aug 6, 2021

adrdimitrov commented Aug 11, 2021

juan131 commented Aug 12, 2021

adrdimitrov commented Aug 12, 2021

adrdimitrov commented Aug 13, 2021

juan131 commented Aug 17, 2021

adrdimitrov commented Aug 23, 2021

juan131 commented Aug 23, 2021

adrdimitrov commented Aug 30, 2021

adrdimitrov commented Sep 7, 2021

juan131 commented Sep 7, 2021

juan131 commented Sep 7, 2021

Fluentd buffer gets overflow or queue limit size gets filled #7089

Fluentd buffer gets overflow or queue limit size gets filled #7089

Comments

adrdimitrov commented Jul 29, 2021

juan131 commented Jul 30, 2021

adrdimitrov commented Jul 30, 2021

adrdimitrov commented Aug 2, 2021

juan131 commented Aug 2, 2021

adrdimitrov commented Aug 2, 2021

juan131 commented Aug 2, 2021

adrdimitrov commented Aug 6, 2021

juan131 commented Aug 6, 2021

adrdimitrov commented Aug 6, 2021

juan131 commented Aug 6, 2021

adrdimitrov commented Aug 11, 2021

juan131 commented Aug 12, 2021

adrdimitrov commented Aug 12, 2021

adrdimitrov commented Aug 13, 2021

juan131 commented Aug 17, 2021

adrdimitrov commented Aug 23, 2021

juan131 commented Aug 23, 2021

adrdimitrov commented Aug 30, 2021

adrdimitrov commented Sep 7, 2021

juan131 commented Sep 7, 2021

juan131 commented Sep 7, 2021