Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used #885

Closed
andrew-pickin-epi opened this issue May 17, 2021 · 15 comments
Closed

Comments

@andrew-pickin-epi
Copy link

andrew-pickin-epi commented May 17, 2021

Kubernetes AWS/EKS installation, same behaviuour seen on two separate clusters ( albeit with same config).

Problem

Periodically (and with no indication why) Fluentd's ES buffer starts to fill.

image

CPU consumption by flunetd pod swiftly reaches 100%.

image

Somethings fluentd is restarts on its own - no human intervention as here:

2021-05-17 13:48:59 +0000 [info]: fluent/log.rb:329:info: Worker 0 finished unexpectedly with signal SIGKILL

and re-connection with ES is established and log begin.
No intervention is performed by users. Config renames unchanged throughout.

Expected Behavior or What you need to ask

...

Using Fluentd and ES plugin versions

  • Kubernetes EKS v1.19.6
  • Fluentd Daemonset v0.12
  • ES plugin 5.0.3
  • ES version 7.8.1

fluent-gem list

addressable (2.7.0)
aws-eventstream (1.1.1)
aws-partitions (1.451.0)
aws-sdk-core (3.114.0)
aws-sdk-kms (1.43.0)
aws-sdk-s3 (1.94.1)
aws-sdk-sqs (1.38.0)
aws-sigv4 (1.2.3)
bigdecimal (default: 1.4.1)
bundler (2.2.6, default: 1.17.2)
cmath (default: 1.0.0)
concurrent-ruby (1.1.8)
cool.io (1.7.1)
csv (default: 3.0.9)
date (default: 2.0.0)
dbm (default: 1.0.0)
dig_rb (1.0.1)
domain_name (0.5.20190701)
e2mmap (default: 0.1.0)
elasticsearch (7.12.0)
elasticsearch-api (7.12.0)
elasticsearch-transport (7.12.0)
elasticsearch-xpack (7.12.0)
etc (default: 1.0.1)
excon (0.81.0)
faraday (1.4.1)
faraday-excon (1.1.0)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.1.0)
fcntl (default: 1.0.0)
ffi (1.15.0)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.1.0)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.4.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.13)
fluent-plugin-elasticsearch (4.1.4)
fluent-plugin-geoip (1.3.2)
fluent-plugin-grok-parser (2.6.2)
fluent-plugin-json-in-json-2 (1.0.2)
fluent-plugin-kubernetes_metadata_filter (2.6.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-parser-cri (0.1.0)
fluent-plugin-prometheus (1.8.5)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-s3 (1.3.4)
fluent-plugin-systemd (1.0.4)
fluentd (1.12.3)
forwardable (default: 1.2.0)
gdbm (default: 2.0.0)
geoip-c (0.9.1)
geoip2_c (0.3.3)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.3)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.6.0)
io-console (default: 0.4.7)
ipaddr (default: 1.2.2)
irb (default: 1.0.0)
jmespath (1.4.0)
json (default: 2.1.0)
jsonpath (1.1.0)
kubeclient (4.9.1)
logger (default: 1.3.0)
lru_redux (1.1.0)
matrix (default: 0.1.0)
mime-types (3.3.1)
mime-types-data (3.2021.0225)
msgpack (1.4.2)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
netrc (0.11.0)
oj (3.11.0)
openssl (default: 2.1.2)
ostruct (default: 0.1.0)
prime (default: 0.1.0)
prometheus-client (0.9.0)
psych (default: 3.1.0)
public_suffix (4.0.6)
quantile (0.2.1)
rake (13.0.3)
rdoc (default: 6.1.2)
recursive-open-struct (1.1.3)
rest-client (2.1.0)
rexml (default: 3.1.9.1)
rss (default: 0.2.7)
ruby2_keywords (0.0.4)
scanf (default: 1.0.0)
sdbm (default: 1.0.0)
serverengine (2.2.3)
shell (default: 0.7)
sigdump (0.2.4)
stringio (default: 0.0.2)
strptime (0.2.5)
strscan (default: 1.0.0)
sync (default: 0.5.0)
systemd-journal (1.4.2)
thwait (default: 0.1.0)
tracer (default: 0.1.0)
tzinfo (2.0.4)
tzinfo-data (1.2021.1)
unf (0.1.4)
unf_ext (0.0.7.7)
webrick (1.7.0, default: 1.4.4)
yajl-ruby (1.4.1)
zlib (default: 1.0.0)



##### Boot log


2021-05-17 13:48:59 +0000 [info]: fluent/log.rb:329:info: Worker 0 finished unexpectedly with signal SIGKILL
2021-05-17 13:49:01 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.**" type="null"
2021-05-17 13:49:01 +0000 [info]: adding filter in @S3 pattern="**" type="grep"
2021-05-17 13:49:01 +0000 [info]: adding filter in @S3 pattern="**" type="record_transformer"
2021-05-17 13:49:01 +0000 [info]: adding match in @S3 pattern="**" type="s3"
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="kubernetes.**" type="kubernetes_metadata"
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Found directory with secrets: /var/run/secrets/kubernetes.io/serviceaccount
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Found CA certificate: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Found pod token: /var/run/secrets/kubernetes.io/serviceaccount/token
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Creating K8S client
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="**" type="grep"
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="kubernetes.**" type="grep"
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="**" type="record_transformer"
2021-05-17 13:49:02 +0000 [info]: adding match pattern="kubernetes.var.log.containers.**" type="rewrite_tag_filter"
2021-05-17 13:49:02 +0000 [info]: #0 [rewrite_container_name] adding rewrite_tag_filter rule: $.kubernetes.container_name [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8de42e38 @keys=["kubernetes", "container_name"], @last_key="container_name", @dig_keys=["kubernetes"]>, /^(.+)$/, "", "kubernetes.log.$1", nil]
2021-05-17 13:49:02 +0000 [info]: adding match pattern="kubernetes.log.alertmanager.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_alertmanager] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8fb3f950 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.consumer*.** kubernetes.log.podium*.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_consumer_podium] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8da381f8 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.dashboard-metrics-scraper.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [nginx_dashboard_scraper] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dadb0b0 @keys="stream">, /stdout/, "", "kubernetes.log.nginx.stdout", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.elasticsearch.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_elasticsearch] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8db7fac0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.flux.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_flux] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8de35328 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.fuseki.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_fuseki] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8deb6068 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.grafana.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_grafana] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dfa11d0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.helm-operator.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_helm] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dc703d0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.kibana.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_kibana] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dd37070 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.manager.* kubernetes.log.postgres-consumer.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [spring_boot_manager] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d6edde0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.spring-boot.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingester.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.hydrology-data-explorer.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_hydrology_data_explorer] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d418ca0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.hydro-api.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_hydro_api] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d4f8080 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1.ts", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.ingester.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_ingester] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8ddbf088 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1.ts", nil]
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.controller.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.controller.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [ingress_controller_log_format] adding rewrite_tag_filter rule: log_format [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d62d7e8 @keys="log_format">, /^(.+)$/, "", "kubernetes.log.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingress.stdout.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingress.stdout.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingress.stdout.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.json.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.json.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.nginx.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.nginx.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.redis" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.redis.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot-sleuth.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot-sleuth.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.json.stdout.ts.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding match pattern="**" type="copy"
2021-05-17 13:49:03 +0000 [debug]: #0 adding store type="elasticsearch"
2021-05-17 13:49:03 +0000 [debug]: #0 adding store type="relabel"
2021-05-17 13:49:03 +0000 [info]: adding source type="systemd"
2021-05-17 13:49:03 +0000 [info]: adding source type="systemd"
2021-05-17 13:49:03 +0000 [info]: adding source type="systemd"
2021-05-17 13:49:03 +0000 [info]: adding source type="prometheus"
2021-05-17 13:49:03 +0000 [info]: adding source type="prometheus_output_monitor"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: #0 starting fluentd worker pid=1319 ppid=6 worker=0
Pod Env
  FLUENT_ELASTICSEARCH_HOST:               elasticsearch-master
  FLUENT_ELASTICSEARCH_PORT:               9200
  FLUENT_ELASTICSEARCH_SCHEME:             http
  FLUENT_ELASTICSEARCH_LOGSTASH_FORMAT:    true
  FLUENT_ELASTICSEARCH_TEMPLATE_NAME:      logstash_template
  FLUENT_ELASTICSEARCH_TEMPLATE_FILE:      /fluentd/index_template.json
  FLUENT_ELASTICSEARCH_INCLUDE_TIMESTAMP:  true
  FLUENT_ELASTICSEARCH_LOG_ES_400_REASON:  true
  FLUENT_ELASTICSEARCH_ENABLE_ILM:         true
  FLUENT_ELASTICSEARCH_ROLLOVER_INDEX:     false
  FLUENT_ELASTICSEARCH_ILM_POLICY_ID:      logstash_policy
  FLUENT_ELASTICSEARCH_ILM_POLICY:         {
                                              "policy": {
                                               "phases": {
                                                 "delete": {
                                                   "min_age": "7d",
                                                   "actions": {
                                                     "delete": {}              
                                                   }
                                                 }
                                               }
                                             }
                                           }

Plugin config

<match **>
  @type copy
   <store>

    @type elasticsearch
    @id out_es7
    @log_level warn
    include_tag_key true
    host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
    port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
    path "#{ENV['FLUENT_ELASTICSEARCH_PATH']}"
    scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
    ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'true'}"
    ssl_version "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERSION'] || 'TLSv1_2'}"
    user "#{ENV['FLUENT_ELASTICSEARCH_USER'] || use_default}"
    password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD'] || use_default}"
    reload_connections "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS'] || 'true'}"
    reconnect_on_error "#{ENV['FLUENT_ELASTICSEARCH_RECONNECT_ON_ERROR'] || 'true'}"
    reload_on_failure "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_ON_FAILURE'] || 'true'}"
    log_es_400_reason "#{ENV['FLUENT_ELASTICSEARCH_LOG_ES_400_REASON'] || 'false'}"
    logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}"
    logstash_dateformat "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_DATEFORMAT'] || '%Y.%m.%d'}"
    logstash_format "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_FORMAT'] || 'true'}"
    index_name "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_INDEX_NAME'] || 'logstash'}"
    type_name "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_TYPE_NAME'] || 'fluentd'}"
    include_timestamp "#{ENV['FLUENT_ELASTICSEARCH_INCLUDE_TIMESTAMP'] || 'false'}"
    template_name "#{ENV['FLUENT_ELASTICSEARCH_TEMPLATE_NAME'] || use_nil}"
    template_file "#{ENV['FLUENT_ELASTICSEARCH_TEMPLATE_FILE'] || use_nil}"
    template_overwrite "#{ENV['FLUENT_ELASTICSEARCH_TEMPLATE_OVERWRITE'] || use_default}"
    sniffer_class_name "#{ENV['FLUENT_SNIFFER_CLASS_NAME'] || 'Fluent::Plugin::ElasticsearchSimpleSniffer'}"
    request_timeout "#{ENV['FLUENT_ELASTICSEARCH_REQUEST_TIMEOUT'] || '5s'}"
    suppress_type_name "#{ENV['FLUENT_ELASTICSEARCH_SUPPRESS_TYPE_NAME'] || 'true'}"
    enable_ilm "#{ENV['FLUENT_ELASTICSEARCH_ENABLE_ILM'] || 'false'}"
    ilm_policy_id "#{ENV['FLUENT_ELASTICSEARCH_ILM_POLICY_ID'] || use_default}"
    ilm_policy "#{ENV['FLUENT_ELASTICSEARCH_ILM_POLICY'] || use_default}"
    ilm_policy_overwrite "#{ENV['FLUENT_ELASTICSEARCH_ILM_POLICY_OVERWRITE'] || 'false'}"
    <buffer>
      flush_thread_count "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_THREAD_COUNT'] || '8'}"
      flush_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_INTERVAL'] || '5s'}"
      chunk_limit_size "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '8M'}"
      queue_limit_length "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '32'}"
      retry_max_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_RETRY_MAX_INTERVAL'] || '30'}"
      retry_timeout "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_RETRY_TIMEOUT'] || '1h'}"
    </buffer>

   </store>

   <store>
      @type relabel
      @label @S3
   </store>

</match>

ES log shows no errors. Other nodes in the daemonset continue to function normally.

Fluentd shows no errors during the period, is not trying to reconnect.

@andrew-pickin-epi andrew-pickin-epi changed the title Logging to ES Stopps as Fluentd cpu -> 100% as ES buffer used Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used May 17, 2021
@andrew-pickin-epi
Copy link
Author

It should be noted that calling the /api/plugins.flushBuffers endpoint causes the buffer to be written successfully and CPU usage to return to normal.

So:
what is causing the buffer to grow?
why the excessive CPU usage?
why isn't the connection either timing -out and reconnecting or the buffer being gracefully written?

@andrew-pickin-epi
Copy link
Author

I have a cluster with identical configuration in production
Fluentd Daemonset v0.11
ES plugin 4.1.1

This does does not exhibit the same behavour. So this suggests a new bug introduced in v5.

@andrew-pickin-epi andrew-pickin-epi changed the title Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used May 18, 2021
@andrew-pickin-epi
Copy link
Author

Rollback-ed back to v1.11 on one cluster that was misbehaving, and thus far cpu re-away has not been seen.

@cosmo0920
Copy link
Collaborator

Rollback-ed back to v1.11 on one cluster that was misbehaving, and thus far cpu re-away has not been seen.

What is the version of back to normal? Is it in Fluentd v1.11.5?

@cosmo0920
Copy link
Collaborator

It seems that this CPU usage spike comes from in_tail plugin code change.
In Elasticsearch plugin side, the difference between from v4.3.3 to v5.0.3 is:
v4.3.3...master#diff-cf75cdd038558996eefb3a1e148524c313f9094460b38022d003bf7d1878233a

Just changes for supporting IPv6 string on host address.

I suspect that this spike is originated from in_tail which is built-in Fluentd core.

@cosmo0920
Copy link
Collaborator

And could you send SIGCONT signal on the CPU spike Fluentd process?
I'm wonder who is the culprit to make resource usage spike.

@ashie
Copy link
Contributor

ashie commented May 24, 2021

And could you send SIGCONT signal on the CPU spike Fluentd process?
I'm wonder who is the culprit to make resource usage spike.

Do you intent to getting stacktrace? (I also want to getting it)
If so, we need to tell how to get it:

https://github.com/frsyuki/sigdump#usage

@ashie
Copy link
Contributor

ashie commented May 24, 2021

fluent/fluentd#3387 might be a same issue.

@andrew-pickin-epi
Copy link
Author

Rollback-ed back to v1.11 on one cluster that was misbehaving, and thus far cpu re-away has not been seen.

What is the version of back to normal? Is it in Fluentd v1.11.5?

The fluentd I rolled back to is v1.11.1.

addressable (2.7.0)
aws-eventstream (1.1.0)
aws-partitions (1.351.0)
aws-sdk-core (3.104.3)
aws-sdk-kms (1.36.0)
aws-sdk-s3 (1.75.0)
aws-sdk-sqs (1.30.0)
aws-sigv4 (1.2.1)
bigdecimal (default: 1.4.1)
bundler (2.1.4, default: 1.17.2)
cmath (default: 1.0.0)
concurrent-ruby (1.1.6)
cool.io (1.6.0)
csv (default: 3.0.9)
date (default: 2.0.0)
dbm (default: 1.0.0)
dig_rb (1.0.1)
domain_name (0.5.20190701)
e2mmap (default: 0.1.0)
elasticsearch (7.8.1)
elasticsearch-api (7.8.1)
elasticsearch-transport (7.8.1)
elasticsearch-xpack (7.8.1)
etc (default: 1.0.1)
excon (0.76.0)
faraday (1.0.1)
fcntl (default: 1.0.0)
ffi (1.13.1)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.1.0)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.4.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.13)
fluent-plugin-elasticsearch (4.1.1)
fluent-plugin-geoip (1.3.2)
fluent-plugin-grok-parser (2.6.1)
fluent-plugin-json-in-json-2 (1.0.2)
fluent-plugin-kubernetes_metadata_filter (2.3.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-prometheus (1.6.1)
fluent-plugin-record-modifier (2.0.1)
fluent-plugin-rewrite-tag-filter (2.2.0)
fluent-plugin-s3 (1.3.4)
fluent-plugin-systemd (1.0.2)
fluentd (1.11.1)
forwardable (default: 1.2.0)
gdbm (default: 2.0.0)
geoip-c (0.9.1)
geoip2_c (0.3.3)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.3)
http-form_data (2.3.0)
http-parser (1.2.1)
http_parser.rb (0.6.0)
io-console (default: 0.4.7)
ipaddr (default: 1.2.2)
irb (default: 1.0.0)
jmespath (1.4.0)
json (default: 2.1.0)
jsonpath (1.0.5)
kubeclient (4.9.0)
logger (default: 1.3.0)
lru_redux (1.1.0)
matrix (default: 0.1.0)
mime-types (3.3.1)
mime-types-data (3.2020.0512)
msgpack (1.3.3)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
netrc (0.11.0)
oj (3.8.1)
openssl (default: 2.1.2)
ostruct (default: 0.1.0)
prime (default: 0.1.0)
prometheus-client (0.9.0)
psych (default: 3.1.0)
public_suffix (4.0.5)
quantile (0.2.1)
rake (13.0.1)
rdoc (default: 6.1.2)
recursive-open-struct (1.1.2)
rest-client (2.1.0)
rexml (default: 3.1.9)
rss (default: 0.2.7)
scanf (default: 1.0.0)
sdbm (default: 1.0.0)
serverengine (2.2.1)
shell (default: 0.7)
sigdump (0.2.4)
stringio (default: 0.0.2)
strptime (0.2.4)
strscan (default: 1.0.0)
sync (default: 0.5.0)
systemd-journal (1.3.3)
thwait (default: 0.1.0)
to_regexp (0.2.1)
tracer (default: 0.1.0)
tzinfo (2.0.2)
tzinfo-data (1.2020.1)
unf (0.1.4)
unf_ext (0.0.7.7)
webrick (default: 1.4.2)
yajl-ruby (1.4.1)
zlib (default: 1.0.0)

@andrew-pickin-epi
Copy link
Author

It seems that this CPU usage spike comes from in_tail plugin code change.
In Elasticsearch plugin side, the difference between from v4.3.3 to v5.0.3 is:
v4.3.3...masterdiff-cf75cdd038558996eefb3a1e148524c313f9094460b38022d003bf7d1878233a

Just changes for supporting IPv6 string on host address.

I suspect that this spike is originated from in_tail which is built-in Fluentd core.

I tried rolling back just the ES plugin (so fluentd v1.12.3 & ES 4.1.4) this also failed as recorded here fluent/fluentd#3382

This also leads me to suspect that it is a core issue.

I will roll forward to v1.12.3/5.0.3 and attempt to get a sigdump output.
However I have no deterministic way of prompting the issue.

@andrew-pickin-epi
Copy link
Author

andrew-pickin-epi commented May 25, 2021

5 node EKS development cluster with very little load Fluentd v1.12.3/ES pluging 5.0.3
After running with for approx 22 hrs without, incident 2 nodes went into 100% cpu within 30 of each other.

2021-05-25a
2021-05-25b

fluentd-kqlvw:

top - 10:59:17 up 12 days, 11:29,  0 users,  load average: 2.64, 2.08, 2.25
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s): 77.1 us, 21.4 sy,  0.0 ni,  0.5 id,  0.8 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   3877.4 total,    183.8 free,   3079.8 used,    613.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    855.6 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   61 root      20   0  690624 120476   9192 S  94.0   3.0  86:02.55 ruby  
    1 root      20   0    8296   1156   1036 S   0.0   0.0   0:01.77 tini 
    7 root      20   0  159128  68112   5696 S   0.0   1.7   0:20.59 ruby
  297 root      20   0   11504   4400   3800 S   0.0   0.1   0:00.00 bash 
  302 root      20   0   15236   4376   3656 R   0.0   0.1   0:00.00 top 
UID        PID  PPID  C STIME TTY          TIME CMD 
root         1     0  0 May24 ?        00:00:01 tini -- /fluentd/entrypoint.sh
root         7     1  0 May24 ?        00:00:20 ruby /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb
root        61     7  6 May24 ?        01:27:06 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb --under-supervisor
root       297     0  0 10:58 pts/0    00:00:00 bash
root       304   297  0 11:00 pts/0    00:00:00 ps -ef

fluentd-kqlvw-sigdump-61.log

fluentd-wrpmk:

top - 11:09:33 up 60 days, 21:10,  0 users,  load average: 1.74, 1.61, 1.65
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie 
%Cpu(s): 42.8 us, 24.1 sy,  0.0 ni, 33.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3877.4 total,    187.0 free,   2946.9 used,    743.6 buff/cache  
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    908.7 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                                                   
   61 root      20   0  815268 136936  20916 S  98.7   3.4  64:16.89 ruby                                                                                                                                                                                                                                                                                                                       
    1 root      20   0    8296   1448   1328 S   0.0   0.0   0:02.09 tini                                                                                                                                                                                                                                                                                                                       
    6 root      20   0  157060  70768   8324 S   0.0   1.8   0:20.91 ruby                                                                                                                                                                                                                                                                                                                       
  314 root      20   0   11504   4440   3864 S   0.0   0.1   0:00.01 bash                                                                                                                                                                                                                                                                                                                       
  319 root      20   0   15236   4304   3564 R   0.0   0.1   0:00.00 top
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 May24 ?        00:00:02 tini -- /fluentd/entrypoint.sh
root         6     1  0 May24 ?        00:00:20 ruby /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb
root        61     6  4 May24 ?        01:05:02 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb --under-supervisor
root       314     0  0 11:09 pts/0    00:00:00 bash
root       321   314  0 11:10 pts/0    00:00:00 ps -ef

fluentd-wrpmk-sigdump-61.log

@markiewitch
Copy link

@andrew-pickin-epi I can see the same endless loop of gen_random_urandom as I had. Try running strace -c -p <fluentd PID>, you'll probably see tens of thousands of calls. My workaround until DNS resolution in Ruby is fixed was to add http_backend typhoeus that uses persistent TCP connections. I've linked the culprit code fragment in here

@ashie
Copy link
Contributor

ashie commented Jun 4, 2021

fluent/fluentd#3387 (comment) by @dzikismigol

Problem lies somewhere in the /usr/lib/ruby/2.7.0/resolv.rb, pasted below. Ruby keeps generating random request IDs, probably because of a collision caused by (?) not calling free_request_id.

I found a resolv's issue for it: ruby/resolv#11
It's already fixed at ruby/resolv#9
Please try resolv 0.2.1.

I think it's triggered by excon's change: excon/excon#739
excon uses Ruby's resolv instead of libc's resolv as of v0.80.0.
And @andrew-pickin-epi's log also indicates it:

after:

excon (0.81.0)

before:

excon (0.76.0)

@andrew-pickin-epi
Copy link
Author

Not seen this error since moving to 1.13

@danielhoherd
Copy link

I can report the same, fixed in 1.13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants