Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382

Closed
andrew-pickin-epi opened this issue May 21, 2021 · 21 comments

Comments

@andrew-pickin-epi
Copy link

Describe the bug
v1.12 only. Fluentd process 100% CPU usage on a single node.
Log events lost. No other nodes fail and continue to log to the same store.
This is a critical issue: 100% result in co-located pods restarting, and loss of log events.
We have rolled back to v1.11 on all clusters.

To Reproduce
Unknown. There are no log entries that give indication as to why this occurs.
These events occur multiple time per day on different nodes and in multiple clusters.
There is no indication of the root cause.
There are no indicative events logged by fluentd, elacticsearch or the wider kubernetes environment.
We have looked very hard over many weeks and the root cause still evades us even with log level debug.

Expected behavior
Reload/refresh connection to store.
Events not lost.
Improved diagnostics..

It should be noted that calling the /api/plugins.flushBuffers endpoint often causes the buffer to be written successfully and CPU usage to return to normal.

Your Environment

AWS EKS Cluster 1.19.6
Fluentd daemonset v1.12.3
Elasticseach plugin 5.0.3 & 4.1.4

Note this is seen in multiple clusters.

Having rolled back to v1.11 (ES 4.1.1) the issue goes away (identical configuration).

See this link for full details.

uken/fluent-plugin-elasticsearch#885

Having created a v1.12.3/v4.1.4 image and seen the same issues repeated I no longer believe that this is a plugin issue. Rather that this a reconnect/buffer write issue introduces with v1.12.

@ashie
Copy link
Member

ashie commented May 21, 2021

If you can, could you determine the exact version which introduces the issue?
v1.12.0 cause it and v1.11.5 doesn't cause it?

@andrew-pickin-epi
Copy link
Author

If you can, could you determine the exact version which introduces the issue?
v1.12.0 cause it and v1.11.5 doesn't cause it?

This will be difficult as I have no way to deterministically reproduce the issue.

I will attempt to run v1.12.3/5.0.3 and retrieve the output of sigdump.

@ashie
Copy link
Member

ashie commented May 24, 2021

I'm not sure the relation but another CPU & memory regression is happen at GoogleCloudPlatform/fluent-plugin-google-cloud#447 (comment)

@andrew-pickin-epi
Copy link
Author

uken/fluent-plugin-elasticsearch#885 updated with sigdump logs.

@ashie
Copy link
Member

ashie commented May 27, 2021

uken/fluent-plugin-elasticsearch#885 updated with sigdump logs.

Thanks. It seems same cause with #3387.

@ashie
Copy link
Member

ashie commented May 27, 2021

From uken/fluent-plugin-elasticsearch#885 (comment)

  Thread #<Thread:0x00007f09fb6813f0@flush_thread_3@/fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin_helper/thread.rb:70 run> status=run priority=0
      /usr/local/lib/ruby/2.6.0/securerandom.rb:107:in `gen_random_urandom'
      /usr/local/lib/ruby/2.6.0/securerandom.rb:56:in `bytes'
      /usr/local/lib/ruby/2.6.0/resolv.rb:601:in `random_number'
      /usr/local/lib/ruby/2.6.0/resolv.rb:601:in `random'
      /usr/local/lib/ruby/2.6.0/resolv.rb:620:in `block in allocate_request_id'
      /usr/local/lib/ruby/2.6.0/resolv.rb:617:in `synchronize'
      /usr/local/lib/ruby/2.6.0/resolv.rb:617:in `allocate_request_id'
      /usr/local/lib/ruby/2.6.0/resolv.rb:837:in `sender'
      /usr/local/lib/ruby/2.6.0/resolv.rb:522:in `block in fetch_resource'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1120:in `block (3 levels) in resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1118:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1118:in `block (2 levels) in resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1117:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1117:in `block in resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1115:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:1115:in `resolv'
      /usr/local/lib/ruby/2.6.0/resolv.rb:517:in `fetch_resource'
      /usr/local/lib/ruby/2.6.0/resolv.rb:507:in `each_resource'
      /usr/local/lib/ruby/2.6.0/resolv.rb:404:in `each_address'
      /usr/local/lib/ruby/2.6.0/resolv.rb:116:in `block in each_address'
      /usr/local/lib/ruby/2.6.0/resolv.rb:115:in `each'
      /usr/local/lib/ruby/2.6.0/resolv.rb:115:in `each_address'
      /usr/local/lib/ruby/2.6.0/resolv.rb:58:in `each_address'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/socket.rb:110:in `connect'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/socket.rb:49:in `initialize'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:463:in `new'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:463:in `socket'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:118:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/mock.rb:57:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/instrumentor.rb:34:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/idempotent.rb:19:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/base.rb:22:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/middlewares/base.rb:22:in `request_call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/excon-0.81.0/lib/excon/connection.rb:273:in `request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-excon-1.1.0/lib/faraday/adapter/excon.rb:31:in `block in call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-1.4.2/lib/faraday/adapter.rb:53:in `connection'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-excon-1.1.0/lib/faraday/adapter/excon.rb:31:in `call'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-1.4.2/lib/faraday/rack_builder.rb:154:in `build_response'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/faraday-1.4.2/lib/faraday/connection.rb:494:in `run_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/transport/http/faraday.rb:48:in `block in perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/transport/base.rb:288:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/transport/http/faraday.rb:37:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-transport-7.12.0/lib/elasticsearch/transport/client.rb:191:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-api-7.12.0/lib/elasticsearch/api/namespace/common.rb:38:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-api-7.12.0/lib/elasticsearch/api/namespace/common.rb:38:in `perform_request'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/elasticsearch-xpack-7.12.0/lib/elasticsearch/xpack/api/actions/index_lifecycle_management/get_policy.rb:38:in `get_policy'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_lifecycle_management.rb:51:in `ilm_policy_exists?'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_lifecycle_management.rb:30:in `create_ilm_policy'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_lifecycle_management.rb:7:in `setup_ilm'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_template.rb:188:in `create_rollover_alias'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:1034:in `block in template_installation_actual'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_index_template.rb:40:in `retry_operate'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:1025:in `template_installation_actual'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:1049:in `send_bulk'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:875:in `block in write'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:874:in `each'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/out_elasticsearch.rb:874:in `write'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin/output.rb:1138:in `try_flush'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin/output.rb:1450:in `flush_thread_run'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin/output.rb:462:in `block (2 levels) in start'
      /fluentd/vendor/bundle/ruby/2.6.0/gems/fluentd-1.12.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

@ashie
Copy link
Member

ashie commented May 27, 2021

I'm now suspecting the following Ruby's issue:

And related excon's issue:

@ndj888
Copy link

ndj888 commented Jun 1, 2021

@andrew-pickin-epi me too , We downgrade v1.11, it look ok . It's an amazing question

@ashie
Copy link
Member

ashie commented Jun 1, 2021

@ndj888 Please try to get a stack trace of the worker process with v1.12 if you can.

$ kill -CONT <pid>
$ cat /tmp/sigdump-<pid>.log

@mrnonz
Copy link

mrnonz commented Jun 2, 2021

OMG, I faced this issue on production many weeks.

So, I will try to rollback version to v1.11 and will let you know about the result.

@mtbtrifork
Copy link

@mrnonz It should be sufficient to rollback to 1.12.1, because it ships with a Ruby version (< 2.7.3) that doesn't have the bug. At least if one is using fluentd via td-agent packages. See #3389 (comment) and #3389 (comment)

@mrnonz
Copy link

mrnonz commented Jun 2, 2021

@mrnonz It should be sufficient to rollback to 1.12.1, because it ships with a Ruby version (< 2.7.3) that doesn't have the bug. At least if one is using fluentd via td-agent packages. See #3389 (comment) and #3389 (comment)

Thank you @mtbtrifork, After I confirmed with v1.11 version I will try with v1.12.1 and let you know.

@mrnonz
Copy link

mrnonz commented Jun 3, 2021

Good news. After I ran with v1.11, I not face issue anymore...

@ndj888
Copy link

ndj888 commented Jun 4, 2021


We fall back to v1.15, and we can see that it is very stable from grafana. Our log increases by 20GB per day.

@ashie
Copy link
Member

ashie commented Jun 4, 2021

As I mentioned at uken/fluent-plugin-elasticsearch#885 (comment), @andrew-pickin-epi's issue is caused by Ruby's resolv and triggered by excon 0.80.0 or later. The stack trace indicates it.
#3387 is a same issue. Probably @mtbtrifork's issue is also same.

I'm not sure @mrnonz & @ndj888's issue is same or not because of fewer information.
Could you tell me versions of following modules?

  • Ruby
  • resolv (if you use gem instead of Ruby's one)
  • excon

Updating fluentd to v1.12 and plugins might cause also upgrading excon, it triggers resolv's issue.

@LouisJULIEN
Copy link

We encountered the same issue with fluentd pods running at 100% after at least 3H of smooth run. We did lose the logs of the node when the pod reached 100% CPU. It happened on all our different kubernetes clusters. Failures often happen at multiple of 15 minutes (+0 to 5 minutes). It was an memorable 48H nightmare.

The issue was triggered when we added a new gem to the Dockerfile which triggered the upper layer build which updated ruby and its gems.

We are quite surprised we are the only to have the issue. Maybe we didn't setup fluentd DNS cache as it should be.

We didn't manage downgrade the ruby version so we downgraded the excon gem to 0.79.0. It solved the issue.
(We couldn't downgrade ruby because of ruby-dev which is required by fluent-plugin-kubernetes_metadata_filter:2.4.1.)

Find here an extract of our working Docker FROM fluent/fluentd:v1.11-1

USER root

RUN apk add --no-cache --update --virtual .build-deps
build-base ruby-dev
&& apk add curl
&& gem install
#https://bugs.ruby-lang.org/issues/17781
excon:0.79.0 \
fluent-plugin-kubernetes_metadata_filter:2.4.1
&& gem sources --clear-all

@geemus
Copy link

geemus commented Jun 11, 2021

Glad you all were able to pin this down, looks like a new version of the resolv gem has been released with what should be a fix, so I think you should be able to upgrade things successfully now.

@ashie
Copy link
Member

ashie commented Jun 14, 2021

FYI: You may need to specify resolv's path by RUBYLIB to force replace Ruby's resolv with the resolv 0.2.1 gem:
fluent/fluentd-docker-image#275

@LouisJULIEN
Copy link

I was thinking about this issue and the fact that the health check was still responding when the problem occurred, leaving the pod running at 100% CPU. Would lowering the health check thread priority be a good way to solve this problem?

ashie added a commit to fluent/fluent-package-builder that referenced this issue Jun 23, 2021
See following links for more detail:
* https://bugs.ruby-lang.org/issues/17748
* fluent/fluentd#3382

The patch for Ruby is taken from the following commit and remove
version.h's diff to avoid conflict:
ruby/ruby@87d02ea

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit to fluent/fluent-package-builder that referenced this issue Jun 24, 2021
See following links for more detail:
* https://bugs.ruby-lang.org/issues/17748
* fluent/fluentd#3382

The patch for Ruby is taken from the following commit.
ruby/ruby@9edc162

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie ashie mentioned this issue Jul 27, 2021
@ashie
Copy link
Member

ashie commented Jul 27, 2021

We'll close this after we release td-agent 4.2.0 (it will ship with Ruby 2.7.4).

@ashie
Copy link
Member

ashie commented Aug 13, 2021

td-agent 4.2.0 has been released: https://www.fluentd.org/blog/td-agent-v4.2.0-has-been-released
Sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants