New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. #3382
Comments
If you can, could you determine the exact version which introduces the issue? |
This will be difficult as I have no way to deterministically reproduce the issue. I will attempt to run v1.12.3/5.0.3 and retrieve the output of |
I'm not sure the relation but another CPU & memory regression is happen at GoogleCloudPlatform/fluent-plugin-google-cloud#447 (comment) |
uken/fluent-plugin-elasticsearch#885 updated with sigdump logs. |
Thanks. It seems same cause with #3387. |
From uken/fluent-plugin-elasticsearch#885 (comment)
|
I'm now suspecting the following Ruby's issue: And related excon's issue: |
@andrew-pickin-epi me too , We downgrade v1.11, it look ok . It's an amazing question |
@ndj888 Please try to get a stack trace of the worker process with v1.12 if you can.
|
OMG, I faced this issue on production many weeks. So, I will try to rollback version to v1.11 and will let you know about the result. |
@mrnonz It should be sufficient to rollback to 1.12.1, because it ships with a Ruby version (< 2.7.3) that doesn't have the bug. At least if one is using fluentd via td-agent packages. See #3389 (comment) and #3389 (comment) |
Thank you @mtbtrifork, After I confirmed with v1.11 version I will try with v1.12.1 and let you know. |
Good news. After I ran with v1.11, I not face issue anymore... |
As I mentioned at uken/fluent-plugin-elasticsearch#885 (comment), @andrew-pickin-epi's issue is caused by Ruby's resolv and triggered by excon 0.80.0 or later. The stack trace indicates it. I'm not sure @mrnonz & @ndj888's issue is same or not because of fewer information.
Updating fluentd to v1.12 and plugins might cause also upgrading excon, it triggers resolv's issue. |
We encountered the same issue with fluentd pods running at 100% after at least 3H of smooth run. We did lose the logs of the node when the pod reached 100% CPU. It happened on all our different kubernetes clusters. Failures often happen at multiple of 15 minutes (+0 to 5 minutes). It was an memorable 48H nightmare. The issue was triggered when we added a new gem to the Dockerfile which triggered the upper layer build which updated ruby and its gems. We are quite surprised we are the only to have the issue. Maybe we didn't setup fluentd DNS cache as it should be. We didn't manage downgrade the ruby version so we downgraded the Find here an extract of our working DockerFROM fluent/fluentd:v1.11-1USER root RUN apk add --no-cache --update --virtual .build-deps |
Glad you all were able to pin this down, looks like a new version of the resolv gem has been released with what should be a fix, so I think you should be able to upgrade things successfully now. |
FYI: You may need to specify resolv's path by |
I was thinking about this issue and the fact that the health check was still responding when the problem occurred, leaving the pod running at 100% CPU. Would lowering the health check thread priority be a good way to solve this problem? |
See following links for more detail: * https://bugs.ruby-lang.org/issues/17748 * fluent/fluentd#3382 The patch for Ruby is taken from the following commit and remove version.h's diff to avoid conflict: ruby/ruby@87d02ea Signed-off-by: Takuro Ashie <ashie@clear-code.com>
See following links for more detail: * https://bugs.ruby-lang.org/issues/17748 * fluent/fluentd#3382 The patch for Ruby is taken from the following commit. ruby/ruby@9edc162 Signed-off-by: Takuro Ashie <ashie@clear-code.com>
We'll close this after we release td-agent 4.2.0 (it will ship with Ruby 2.7.4). |
td-agent 4.2.0 has been released: https://www.fluentd.org/blog/td-agent-v4.2.0-has-been-released |
Describe the bug
v1.12 only. Fluentd process 100% CPU usage on a single node.
Log events lost. No other nodes fail and continue to log to the same store.
This is a critical issue: 100% result in co-located pods restarting, and loss of log events.
We have rolled back to v1.11 on all clusters.
To Reproduce
Unknown. There are no log entries that give indication as to why this occurs.
These events occur multiple time per day on different nodes and in multiple clusters.
There is no indication of the root cause.
There are no indicative events logged by fluentd, elacticsearch or the wider kubernetes environment.
We have looked very hard over many weeks and the root cause still evades us even with log level debug.
Expected behavior
Reload/refresh connection to store.
Events not lost.
Improved diagnostics..
It should be noted that calling the
/api/plugins.flushBuffers
endpoint often causes the buffer to be written successfully and CPU usage to return to normal.Your Environment
AWS EKS Cluster 1.19.6
Fluentd daemonset v1.12.3
Elasticseach plugin 5.0.3 & 4.1.4
Note this is seen in multiple clusters.
Having rolled back to v1.11 (ES 4.1.1) the issue goes away (identical configuration).
See this link for full details.
uken/fluent-plugin-elasticsearch#885
Having created a v1.12.3/v4.1.4 image and seen the same issues repeated I no longer believe that this is a plugin issue. Rather that this a reconnect/buffer write issue introduces with v1.12.
The text was updated successfully, but these errors were encountered: