Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

Closed
qingling128 opened this issue May 24, 2021 · 19 comments
Closed

CPU and memory usage regression between Fluentd 1.11.2 and 1.12.3 #3389

qingling128 opened this issue May 24, 2021 · 19 comments
Assignees

Comments

@qingling128
Copy link

qingling128 commented May 24, 2021

Describe the bug
When upgrading Fluentd 1.11.2 to the latest 1.12.3, our soak tests detected that the memory and CPU usage increased by a non-trivial percentage, resulting in Fluentd not being able to catch up with 1000 log entries / second throughput (a baseline of the soak test).

The diff between the two is:

gem uninstall fluentd -ax --force
gem install fluentd -v 1.12.3

image
image
image

Expected behavior
CPU and memory usage is stable and log entries are not dropped.

Your Environment
Fluentd runs inside a Debian 9 based container in GKE with fixed throughput of 1000 log entries per second. (In total 3 VMs, which is why in total there are 3000 log entries per second).

Your Configuration
https://github.com/Stackdriver/kubernetes-configs/blob/1d0b24b650d7d044899c3e958faeda62acbae9c6/logging-agent.yaml#L131

@qingling128
Copy link
Author

I'm setting up more experiments for the versions in between these two to narrow down the root cause.

@ashie
Copy link
Member

ashie commented May 25, 2021

I'm now suspecting that #3391 might be a cause of this issue, there is no evidence yet though.

@ashie
Copy link
Member

ashie commented May 25, 2021

Or #3382 #3387 may be related to this.

@mtbtrifork
Copy link

At the company where I'm currently employed we're using fluentd on 50+ servers via td-agent .deb packages available from http://packages.treasuredata.com/4/ubuntu/bionic/ . Almost all of the servers run .deb package version 4.1.0-1, i.e. fluentd version 1.12.1, and we have so far only experienced the 100% CPU issue on servers where we coincidentally upgraded to 4.1.1-1 / 1.12.3. My point is that the issue can perhaps be narrowed down between versions 1.12.1 and 1.12.3.

When the 100% CPU issue is present, a sigdump (kill -CONT) appears to always show a running thread with the randomisation going on in Ruby's resolv.rb, which I have seen mentioned in #3387 (comment) . I find it difficult to believe, however, that Ruby's resolv.rb code should be causing the issue, as the td-agent .deb package for 1.12.1 seems to ship with the exact same Ruby version (2.7.0, and thereby the exact same resolv.rb) as the package for 1.12.3. And, as mentioned, we believe we don't see the issue with 1.12.1.

Sadly, we have not found a way to reliably reproduce the issue, and I don't have anything else to add at this stage.

@ashie
Copy link
Member

ashie commented May 27, 2021

@mtbtrifork Thanks for your report! It's very informative.
I'm now suspecting Ruby or excon.
As I mentioned at #3382 (comment) there are similar reports.

BTW we should discuss about this cause at #3382, because it's same cause with yours, and this issue (#3389) isn't yet judged as same cause with it.

@mtbtrifork
Copy link

@ashie I'm sorry about the confusion.

Let me correct myself: The two versions of td-agent don't ship with the same Ruby version:

$ apt install td-agent=4.1.1-1
$ td-agent --version
td-agent 4.1.1 fluentd 1.12.3 (b6e76c09d60ee866de6470f2bdd2551e7c8591d3)
$ /opt/td-agent/bin/ruby --version
ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-linux]

$ apt install td-agent=4.1.0-1
$ td-agent --version
td-agent 4.1.0 fluentd 1.12.1 (e3effa337593618cbd7f0f4ef071766df1ec69a0)
$ /opt/td-agent/bin/ruby --version
ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]

@ashie
Copy link
Member

ashie commented May 27, 2021

Yes, so that td-agent 4.1.1 can hit Ruby's issue mentioned at #3382 (comment)

@ashie
Copy link
Member

ashie commented Jun 4, 2021

Now I believe that following similar issues (not sure it's same with yours or not) are caused by Ruby's resolv: #3387 #3382
And resolv 0.2.1 gem fixes it: ruby/resolv#9
It's triggered by excon's change: excon/excon#739, excon uses Ruby's resolv as of 0.80.0.
excon might be unexpectedly upgraded by upgrading fluentd & plugins.
So please check following things:

  • Check your versions of Ruby, excon and resolv (if you use resolv gem instead of Ruby's one)
  • Try resolv 0.2.1 gem

If you don't use excon 0.80.0 or later, or resolv 0.2.1 doesn't resolve your problem, we should suspect another cause.

@qingling128
Copy link
Author

Not sure how much it helps, but here are some results of the previous experiments (The only difference is the fluentd gem version):

image

@ashie
Copy link
Member

ashie commented Jun 8, 2021

Thanks, it's very helpful!
Your data indicates that it's definitely caused since v1.12.0.rc2 which introduces follow_inodes of in_tail: #3182 #3196
Although some other features are also introduced at v1.12.0.rc2, I think follow_inodes has most heavy impact to the performance in that.
Now I think it's different problem with #3382.

I'm interested in whether v1.12.4 fixes this issue or not, because v1.12.0.rc0 - v.1.12.3 leaks tail watchers after file rotation, it will cause CPU & memory spikes: #3393

@qingling128
Copy link
Author

Glad that helped. I just set up an experiment with 1.12.4 as well. Will report back once we get some results. It typically takes a few days for the issue to start revealing itself.

@cosmo0920
Copy link
Contributor

@qingling128 How's it going your Fluentd 1.12.4 experiment?

@qingling128
Copy link
Author

It's a bit hard to tell. Eventually all three experiments' memory usage reached a higher plateau:

image

@ashie
Copy link
Member

ashie commented Jun 29, 2021

Thanks for the report!
It seems 1.12.4 is same level with 1.12.3, it doesn't resolve the issue.
We should check the changes in 1.11.5 - 1.12.0.rc2 again.

@qingling128
Copy link
Author

Occasionally we see sudden bumps (both in #3389 (comment) and #3389 (comment)) of memory, and the timing looks arbitrary.

That might mean the sudden memory bump is non-deterministic. e.g. Certain agent versions could have lower memory usage during the time duration I observed it, but chances are that I was just lucky? If that's the case, the test comparison result probably only gives us a general trend, but not necessarily narrowing down the root cause to exactly version A and B.

I agree that we probably need to look at a few more versions near that range.

@kenhys
Copy link
Contributor

kenhys commented Jul 30, 2021

As https://github.com/Stackdriver/kubernetes-configs/blob/1d0b24b650d7d044899c3e958faeda62acbae9c6/logging-agent.yaml#L131
contains many factor source/filter/match, so I've focused on only in_tail to reproduce by more simple tail test case (globing *log) in a shorter span , but no luck to monitor the problematic spike (sampling every 60sec with GC::Profiler). it is a combination of 3000 log entries/sec with dummer, rotate source periodically, remove sources.

Figure_1

Could someone reproduce it with simple fluent.conf?

@qingling128
Copy link
Author

Looks like 1.13.3 has better memory usage:

image

@ashie
Copy link
Member

ashie commented Aug 20, 2021

@ashie
Copy link
Member

ashie commented Aug 20, 2021

Close.

If it reproduces again, please let us know.
We'll reopen this.

@ashie ashie closed this as completed Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants