Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382

andrew-pickin-epi · 2021-05-21T08:40:22Z

Describe the bug
v1.12 only. Fluentd process 100% CPU usage on a single node.
Log events lost. No other nodes fail and continue to log to the same store.
This is a critical issue: 100% result in co-located pods restarting, and loss of log events.
We have rolled back to v1.11 on all clusters.

To Reproduce
Unknown. There are no log entries that give indication as to why this occurs.
These events occur multiple time per day on different nodes and in multiple clusters.
There is no indication of the root cause.
There are no indicative events logged by fluentd, elacticsearch or the wider kubernetes environment.
We have looked very hard over many weeks and the root cause still evades us even with log level debug.

Expected behavior
Reload/refresh connection to store.
Events not lost.
Improved diagnostics..

It should be noted that calling the /api/plugins.flushBuffers endpoint often causes the buffer to be written successfully and CPU usage to return to normal.

Your Environment

AWS EKS Cluster 1.19.6
Fluentd daemonset v1.12.3
Elasticseach plugin 5.0.3 & 4.1.4

Note this is seen in multiple clusters.

Having rolled back to v1.11 (ES 4.1.1) the issue goes away (identical configuration).

See this link for full details.

uken/fluent-plugin-elasticsearch#885

Having created a v1.12.3/v4.1.4 image and seen the same issues repeated I no longer believe that this is a plugin issue. Rather that this a reconnect/buffer write issue introduces with v1.12.

The text was updated successfully, but these errors were encountered:

ashie · 2021-05-21T09:48:10Z

If you can, could you determine the exact version which introduces the issue?
v1.12.0 cause it and v1.11.5 doesn't cause it?

andrew-pickin-epi · 2021-05-24T11:22:31Z

If you can, could you determine the exact version which introduces the issue?
v1.12.0 cause it and v1.11.5 doesn't cause it?

This will be difficult as I have no way to deterministically reproduce the issue.

I will attempt to run v1.12.3/5.0.3 and retrieve the output of sigdump.

ashie · 2021-05-24T18:54:55Z

I'm not sure the relation but another CPU & memory regression is happen at GoogleCloudPlatform/fluent-plugin-google-cloud#447 (comment)

andrew-pickin-epi · 2021-05-25T12:04:06Z

uken/fluent-plugin-elasticsearch#885 updated with sigdump logs.

ashie · 2021-05-27T08:43:37Z

uken/fluent-plugin-elasticsearch#885 updated with sigdump logs.

Thanks. It seems same cause with #3387.

ashie · 2021-05-27T08:45:43Z

From uken/fluent-plugin-elasticsearch#885 (comment)

  Thread #<Thread:0x00007f09fb6813f0@flush_thread_3@/fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin_helper/thread.rb:70 run> status=run priority=0
      /usr/local/lib/ruby/2.6.0/securerandom.rb:107:in `gen_random_urandom'
      /usr/local/lib/ruby/2.6.0/securerandom.rb:56:in `bytes'
      /usr/local/lib/ruby/2.6.0/resolv.rb:601:in `random_number'
      /usr/local/lib/ruby/2.6.0/resolv.rb:601:in `random'
      /usr/local/lib/ruby/2.6.0/resolv.rb:620:in `block in allocate_request_id'
      /usr/local/lib/ruby/2.6.0/resolv.rb:617:in `synchronize'
      /usr/local/lib/ruby/2.6.0/resolv.rb:617:in `allocate_request_id'
      /usr/local/lib/ruby/2.6.0/resolv.rb:837:in `sender'
      /usr/local/lib/ruby/2.6.0/resolv.rb:522:in `block in fetch_resource'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1120:in `block (3 levels) in resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1118:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1118:in `block (2 levels) in resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1117:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1117:in `block in resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1115:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1115:in `resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:517:in `fetch_resource'
      /usr/local/lib/ruby/2.6.0/resolv.rb:507:in `each_resource'
      /usr/local/lib/ruby/2.6.0/resolv.rb:404:in `each_address'
      /usr/local/lib/ruby/2.6.0/resolv.rb:116:in `block in each_address'
      /usr/local/lib/ruby/2.6.0/resolv.rb:115:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:115:in `each_address'
      /usr/local/lib/ruby/2.6.0/resolv.rb:58:in `each_address'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/socket.rb:110:in `connect'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/socket.rb:49:in `initialize'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:463:in `new'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:463:in `socket'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:118:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/mock.rb:57:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/instrumentor.rb:34:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/idempotent.rb:19:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/base.rb:22:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/base.rb:22:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:273:in `request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-excon-1.1.0/lib/faraday/adapter/excon.rb:31:in `block in call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-1.4.2/lib/faraday/adapter.rb:53:in `connection'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-excon-1.1.0/lib/faraday/adapter/excon.rb:31:in `call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-1.4.2/lib/faraday/rack_builder.rb:154:in `build_response'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-1.4.2/lib/faraday/connection.rb:494:in `run_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/transport/http/faraday.rb:48:in `block in perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/transport/base.rb:288:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/transport/http/faraday.rb:37:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/client.rb:191:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-api-7.12.0/lib/elasticsearch/api/namespace/common.rb:38:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-api-7.12.0/lib/elasticsearch/api/namespace/common.rb:38:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-xpack-7.12.0/lib/elasticsearch/xpack/api/actions/index_lifecycle_management/get_policy.rb:38:in `get_policy'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_lifecycle_management.rb:51:in `ilm_policy_exists?'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_lifecycle_management.rb:30:in `create_ilm_policy'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_lifecycle_management.rb:7:in `setup_ilm'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_template.rb:188:in `create_rollover_alias'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:1034:in `block in template_installation_actual'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_template.rb:40:in `retry_operate'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:1025:in `template_installation_actual'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:1049:in `send_bulk'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:875:in `block in write'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:874:in `each'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:874:in `write'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin/output.rb:1138:in `try_flush'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin/output.rb:1450:in `flush_thread_run'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin/output.rb:462:in `block (2 levels) in start'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

ashie · 2021-05-27T09:20:56Z

I'm now suspecting the following Ruby's issue:

https://bugs.ruby-lang.org/issues/17781

And related excon's issue:

Resolv.rb issues - threads hanging excon/excon#747

ndj888 · 2021-06-01T01:42:51Z

@andrew-pickin-epi me too , We downgrade v1.11, it look ok . It's an amazing question

ashie · 2021-06-01T02:31:26Z

@ndj888 Please try to get a stack trace of the worker process with v1.12 if you can.

$ kill -CONT <pid>
$ cat /tmp/sigdump-<pid>.log

mrnonz · 2021-06-02T09:21:17Z

OMG, I faced this issue on production many weeks.

So, I will try to rollback version to v1.11 and will let you know about the result.

mtbtrifork · 2021-06-02T09:49:23Z

@mrnonz It should be sufficient to rollback to 1.12.1, because it ships with a Ruby version (< 2.7.3) that doesn't have the bug. At least if one is using fluentd via td-agent packages. See #3389 (comment) and #3389 (comment)

mrnonz · 2021-06-02T10:03:25Z

@mrnonz It should be sufficient to rollback to 1.12.1, because it ships with a Ruby version (< 2.7.3) that doesn't have the bug. At least if one is using fluentd via td-agent packages. See #3389 (comment) and #3389 (comment)

Thank you @mtbtrifork, After I confirmed with v1.11 version I will try with v1.12.1 and let you know.

mrnonz · 2021-06-03T14:49:13Z

Good news. After I ran with v1.11, I not face issue anymore...

ndj888 · 2021-06-04T02:40:02Z

We fall back to v1.15, and we can see that it is very stable from grafana. Our log increases by 20GB per day.

ashie · 2021-06-04T05:21:49Z

As I mentioned at uken/fluent-plugin-elasticsearch#885 (comment), @andrew-pickin-epi's issue is caused by Ruby's resolv and triggered by excon 0.80.0 or later. The stack trace indicates it.
#3387 is a same issue. Probably @mtbtrifork's issue is also same.

I'm not sure @mrnonz & @ndj888's issue is same or not because of fewer information.
Could you tell me versions of following modules?

Ruby
resolv (if you use gem instead of Ruby's one)
excon

Updating fluentd to v1.12 and plugins might cause also upgrading excon, it triggers resolv's issue.

LouisJULIEN · 2021-06-04T16:07:48Z

We encountered the same issue with fluentd pods running at 100% after at least 3H of smooth run. We did lose the logs of the node when the pod reached 100% CPU. It happened on all our different kubernetes clusters. Failures often happen at multiple of 15 minutes (+0 to 5 minutes). It was an memorable 48H nightmare.

The issue was triggered when we added a new gem to the Dockerfile which triggered the upper layer build which updated ruby and its gems.

We are quite surprised we are the only to have the issue. Maybe we didn't setup fluentd DNS cache as it should be.

We didn't manage downgrade the ruby version so we downgraded the excon gem to 0.79.0. It solved the issue.
(We couldn't downgrade ruby because of ruby-dev which is required by fluent-plugin-kubernetes_metadata_filter:2.4.1.)

Find here an extract of our working Docker

FROM fluent/fluentd:v1.11-1

USER root

RUN apk add --no-cache --update --virtual .build-deps
build-base ruby-dev
&& apk add curl
&& gem install
#https://bugs.ruby-lang.org/issues/17781
excon:0.79.0 \
fluent-plugin-kubernetes_metadata_filter:2.4.1
&& gem sources --clear-all

geemus · 2021-06-11T13:57:41Z

Glad you all were able to pin this down, looks like a new version of the resolv gem has been released with what should be a fix, so I think you should be able to upgrade things successfully now.

ashie · 2021-06-14T01:07:55Z

FYI: You may need to specify resolv's path by RUBYLIB to force replace Ruby's resolv with the resolv 0.2.1 gem:
fluent/fluentd-docker-image#275

LouisJULIEN · 2021-06-21T07:50:25Z

I was thinking about this issue and the fact that the health check was still responding when the problem occurred, leaving the pod running at 100% CPU. Would lowering the health check thread priority be a good way to solve this problem?

See following links for more detail: * https://bugs.ruby-lang.org/issues/17748 * fluent/fluentd#3382 The patch for Ruby is taken from the following commit and remove version.h's diff to avoid conflict: ruby/ruby@87d02ea Signed-off-by: Takuro Ashie <ashie@clear-code.com>

See following links for more detail: * https://bugs.ruby-lang.org/issues/17748 * fluent/fluentd#3382 The patch for Ruby is taken from the following commit. ruby/ruby@9edc162 Signed-off-by: Takuro Ashie <ashie@clear-code.com>

ashie · 2021-07-27T05:33:09Z

We'll close this after we release td-agent 4.2.0 (it will ship with Ruby 2.7.4).

ashie · 2021-08-13T05:50:23Z

td-agent 4.2.0 has been released: https://www.fluentd.org/blog/td-agent-v4.2.0-has-been-released
Sorry for the delay.

ashie mentioned this issue May 24, 2021

High CPU usage in flush thread #3387

Closed

andrew-pickin-epi mentioned this issue May 24, 2021

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used uken/fluent-plugin-elasticsearch#885

Closed

ashie mentioned this issue May 25, 2021

CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

Closed

ashie added bug Non fluentd issue labels May 27, 2021

ashie mentioned this issue Jun 4, 2021

fluentd randomly gets stuck in infinite loops fluent/fluentd-kubernetes-daemonset#565

Closed

ashie mentioned this issue Jun 7, 2021

Add Fluentd v1.13.0 images fluent/fluentd-docker-image#270

Merged

ncsibra mentioned this issue Jun 21, 2021

Wrong RUBYLIB path for resolv gem fluent/fluentd-docker-image#277

Closed

toabi mentioned this issue Jun 21, 2021

Plugin causes errors in fluentd log: no implicit conversion of nil into Hash kube-logging/fluent-plugin-label-router#16

Closed

ashie mentioned this issue Jun 23, 2021

Fix a bug that fluentd gets stuck on DNS lookup due to resolv's bug fluent/fluent-package-builder#308

Merged

ashie mentioned this issue Jul 27, 2021

v1.13.3 #3470

Merged

ashie mentioned this issue Aug 4, 2021

After upgrading td-agent from version 3.8.0 to 4.1.0, cpu utilization of td-agent got increased #3482

Closed

ashie closed this as completed Aug 13, 2021

slc1973 mentioned this issue Jan 20, 2022

Use newer version of fluentd by default fluent/helm-charts#206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382

Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382

andrew-pickin-epi commented May 21, 2021

ashie commented May 21, 2021 •

edited

andrew-pickin-epi commented May 24, 2021

ashie commented May 24, 2021

andrew-pickin-epi commented May 25, 2021

ashie commented May 27, 2021

ashie commented May 27, 2021

ashie commented May 27, 2021

ndj888 commented Jun 1, 2021

ashie commented Jun 1, 2021

mrnonz commented Jun 2, 2021

mtbtrifork commented Jun 2, 2021

mrnonz commented Jun 2, 2021

mrnonz commented Jun 3, 2021

ndj888 commented Jun 4, 2021 •

edited

ashie commented Jun 4, 2021 •

edited

LouisJULIEN commented Jun 4, 2021

geemus commented Jun 11, 2021

ashie commented Jun 14, 2021 •

edited

LouisJULIEN commented Jun 21, 2021

ashie commented Jul 27, 2021 •

edited

ashie commented Aug 13, 2021

Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382

Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382

Comments

andrew-pickin-epi commented May 21, 2021

ashie commented May 21, 2021 • edited

andrew-pickin-epi commented May 24, 2021

ashie commented May 24, 2021

andrew-pickin-epi commented May 25, 2021

ashie commented May 27, 2021

ashie commented May 27, 2021

ashie commented May 27, 2021

ndj888 commented Jun 1, 2021

ashie commented Jun 1, 2021

mrnonz commented Jun 2, 2021

mtbtrifork commented Jun 2, 2021

mrnonz commented Jun 2, 2021

mrnonz commented Jun 3, 2021

ndj888 commented Jun 4, 2021 • edited

ashie commented Jun 4, 2021 • edited

LouisJULIEN commented Jun 4, 2021

geemus commented Jun 11, 2021

ashie commented Jun 14, 2021 • edited

LouisJULIEN commented Jun 21, 2021

ashie commented Jul 27, 2021 • edited

ashie commented Aug 13, 2021

ashie commented May 21, 2021 •

edited

ndj888 commented Jun 4, 2021 •

edited

ashie commented Jun 4, 2021 •

edited

ashie commented Jun 14, 2021 •

edited

ashie commented Jul 27, 2021 •

edited