Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd buffer gets overflow or queue limit size gets filled #7089

Closed
adrdimitrov opened this issue Jul 29, 2021 · 21 comments
Closed

Fluentd buffer gets overflow or queue limit size gets filled #7089

adrdimitrov opened this issue Jul 29, 2021 · 21 comments

Comments

@adrdimitrov
Copy link

Hello team,

I am testing fluentd for our logging purposes and I am facing an issue with my buffer configuration (i guess). The set up is as follows:

Deployment: I am deploying fluentd on our kubernetes cluster which consist of 4 nodes, one generating almost 90% of the logs. Everything is done using terraform and bitnami/fleuntd helm chart. Fluentd is in kube-system namespace. It is sending logs to aws elasticsearch.

Input config on the forwarders:

        <source>
	  @type tail
	  path /var/log/containers/*.log
	  # exclude Fluentd logs
	  exclude_path [\"/var/log/containers/*fluentd*.log\", \"/var/log/containers/*kube-dash*.log\"]
	  pos_file /opt/bitnami/fluentd/logs/buffers/fluentd-docker.pos
	  tag kubernetes.*
	  read_from_head true
	  <parse>
	    @type json
	    keep_time_key true
	  </parse>
	</source>

Output config on aggregator:

	<match **>
	  @type copy
	  <store>
	    @type elasticsearch
	    hosts hostname
	    request_timeout 30s
	    resurrect_after 5
	    # avoid https://discuss.elastic.co/t/elasitcsearch-ruby-raises-cannot-get
	    # -new-connection-from-pool-error/36252/6
	    reload_connections false
	    reconnect_on_error true
	    reload_on_failure true
	    logstash_format true
	    logstash_prefix logs-eks-s-test-1
	    logstash_dateformat %Y.%m.%d
	    # @timestamp: use event time, not time of indexing
	    time_key time
	    include_tag_key true
	    include_timestamp true
	    <buffer>
	      @type file
	      path /opt/bitnami/fluentd/logs/buffers
	      flush_interval 1s
	      flush_thread_count 20
	      chunk_limit_size 16m
	      total_limit_size 2048m
	      queued_chunks_limit_size 4096
	      overflow_action drop_oldest_chunk
	      retry_forever true
	    </buffer>
	  </store>

Issue: Fluentd is working fine for hours and then it gets one of the two, either the buffer total_limit_size get reached and fleuntd stops working (even after I have set overflow_action to drop_oldest_chunk) or the queued_chunks_limit_size gets reached and again fleuntd stops sending. I have tried a lot of different configuration including default ones and using memory as a buffer. In any of the cases I am facing one of the two issues. Using the above configuration (this is my latest test) I got queued_chunk_limit_size reached (over 8200 file with the .meta files). The only errors i see in logs are for slow_flush_threshold from time to time. During the failure I am not observing extensive memory or CPU usage in either side Fleuntd nor Elasticsearch. Seems like the connection is just lost and never regained.

elasticsearch

Restart of the pod is getting fluentd back to normal state and working, but this way i am losing logs and it is not sustainable to restart manually every time it stops sending logs.

@juan131
Copy link
Contributor

juan131 commented Jul 30, 2021

Hi @adrdimitrov

Thanks so much for the detailed information!!

We mainly provide support in this repository to solve problems related to the Bitnami charts or the containers. For information regarding the application itself or customization of the content within the application, we highly recommend checking forums and user guides made available by the project behind the application.

In your case, I recommend you to post your question in the forum below, you'll find there more experienced users that can help you customizing the configuration:

@adrdimitrov
Copy link
Author

Hello @juan131 ,

Thanks, I wrote for this in the fluentd channel and on few other places and got this:

uken/fluent-plugin-elasticsearch#909

the issue seems caused by the ruby version and it is fixed, but still seems that the helm chart is deploying old ruby version that have this issue. Is it possible to update the Ruby version for this chart ? Coz it is not very pleasant to deploy via helm charts and then to upgrade versions and etc.. Doing this for multiple kubernetes clusters will not be cool.

@adrdimitrov
Copy link
Author

Hey @juan131 ,

Have you managed to check the above.

@juan131
Copy link
Contributor

juan131 commented Aug 2, 2021

Hi @adrdimitrov

Using the latest image available (tag 1.13.3-debian-10-r0) in the latest chart version (4.1.3), see:

I checked the version of excon as suggested by @ashie:

$ docker run --rm -it bitnami/fluentd:1.13.3-debian-10-r0 -- bash
$ ruby --version
ruby 2.6.8p205 (2021-07-07 revision 67951) [x86_64-linux]
$ gem list excon

*** LOCAL GEMS ***

excon (0.85.0)

It's supposed to be fine and it shouldn't be affected by the issue unless I'm missing something. What version of the container and chart are you using?

@adrdimitrov
Copy link
Author

Hey @juan131 ,

I am currently using helm_chart_version_fluentd = "3.7.5", so i will update and revert back.

Thanks very much for your prompt response.

@juan131
Copy link
Contributor

juan131 commented Aug 2, 2021

Thanks! Please keep up updated about your insights using the latest chart version.

@adrdimitrov
Copy link
Author

Hello @juan131

I redeployed my fluentd with helm chart 4.1.3 version and the result is the same:

image
image
image

I saw that even with 4.1.3 helm chart the ruby version is still the old one:
image
image

Please note that i did not update it in place, i removed the fluentd and redeployed it using terraform.

Am I doing something wrong ?

@juan131
Copy link
Contributor

juan131 commented Aug 6, 2021

Hi @adrdimitrov

We're including the latest ruby version available in the branch 2.6.x when building the image, see:

The current chart is pointing to the version image version 1.13.3-debian-10-r0 which uses Ruby 2.6.8. Could you please describe the Fluentd pod (running kubectl describe pod POD_NAME) and let us know the specific image tag you're using? It shouldn't list ruby="2.6.7" but ruby="2.6.8" instead.

By the way, I thought the problem was related to the Ruby gem excon. Could you please confirm that Ruby 2.7.x is required? If that's the case we can release a new version of the images bundling that branch.

@adrdimitrov
Copy link
Author

Hey @juan131 ,

Just saw your answer, turned out that I am still using the old debian package. I did not see that the image is described in the values file. I changed it and now I am running fluentd with ruby 2.6.8.

I cannot confirm that 2.7.x is required or if 2.6.8 is fixing the issue, but I am now running 2.6.8 and will leave it like this and report back if it fixes the issue.

Thanks a lot for your time !

@juan131
Copy link
Contributor

juan131 commented Aug 6, 2021

Thanks so much! Please keep us updated about your insights.

@adrdimitrov
Copy link
Author

Hello @juan131 ,

I managed to deploy the latest helm chart version and as said above changed in values file to use 1.13.3-debian-10-r0 . Haven't faced the issue since then (5 days). So I guess I can confirm now that this issue is fixed in ruby 2.6.8.

image

Thanks a lot for your time and efforts! It's appreciated.

@juan131
Copy link
Contributor

juan131 commented Aug 12, 2021

👏 !!! That's great!!! I'm very glad the problem was fixed using the latest image!

Thanks so much for sharing your insights @adrdimitrov. I'll keep the issue opened for a few more days just in case you face it again.

@adrdimitrov
Copy link
Author

Hey @juan131 ,

Some bad news, although I haven't stop receiving logs (or at least i don't see gaps). The issue with the 100% CPU utilization of the pod is still there and the pod will get frequently restarted gracefully with SIGKILL:

image

I will monitor this closely and try to catch the errors and the exact behaviour. Maybe upgrading to ruby 2.7.x is a good idea.

@adrdimitrov
Copy link
Author

Hello again,

As mentioned yesterday I am still facing the CPU issue, but i left the fluentd to see how it deals with it. Unfortunately this night it stopped with similar as before scenario. It suddenly filled the chunk_queued_limit and stopped sending logs to ES.

There are 8k files like this:
image

the total size is 40MB.

As seen below the moment we hit 100% we lost the logs:

image

image

I don't see errors in the logs, it just dies without any notifications.

@juan131
Copy link
Contributor

juan131 commented Aug 17, 2021

Hi @adrdimitrov

I can build an exact copy of the current container replacing Ruby with 2.7.x version and share it with you. This way, you can install the chart replacing the image and see if the problem also persists with this version

What do you think? Does it make sense?

@adrdimitrov
Copy link
Author

Hey @juan131

Sorry for the late response I was on a vacation.

Yes, it makes sense and will be great! Meanwhile me and my colleagues are monitoring this and so far we haven't seen the issue again. I am not sure if this happens under some specific circumstances or it is absolutely random.

Will keep you posted.

@juan131
Copy link
Contributor

juan131 commented Aug 23, 2021

Thanks so much @adrdimitrov

By the way, I built and published an image based on Ruby 2.7.x for you to try. You can download it from my DockerHub account, the image is juanariza131/fluentd:development. To test it in your chart, install it using the values.yaml below:

image:
  registry: docker.io
  repository: juanariza131/fluentd
  tag: development

@adrdimitrov
Copy link
Author

Hey @juan131 ,

Quick update, I managed to deploy this custom image and it runs for a week now without issues. Will continue monitoring it for another week and keep you posted.

@adrdimitrov
Copy link
Author

Hey @juan131,

Two weeks now with the image based on Ruby 2.7.x and I have faced no issues. I believe this is solved.

@juan131
Copy link
Contributor

juan131 commented Sep 7, 2021

These are great news!!! I'll do the required changes in our system to release a new bitnami/fluentd image based on Ruby 2.7.x

@juan131
Copy link
Contributor

juan131 commented Sep 7, 2021

@adrdimitrov a new revision of the bitnami/fluentd image based on Ruby 2.7.x was released: 1.14.0-debian-10-r9. See:

Please give it a try when you have a chance! I proceed to close the issue as "solved" but please feel free to reopen it if you require further assistance.

@juan131 juan131 closed this as completed Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants