v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used #885

andrew-pickin-epi · 2021-05-17T14:23:09Z

Kubernetes AWS/EKS installation, same behaviuour seen on two separate clusters ( albeit with same config).

Problem

Periodically (and with no indication why) Fluentd's ES buffer starts to fill.

CPU consumption by flunetd pod swiftly reaches 100%.

Somethings fluentd is restarts on its own - no human intervention as here:

2021-05-17 13:48:59 +0000 [info]: fluent/log.rb:329:info: Worker 0 finished unexpectedly with signal SIGKILL

and re-connection with ES is established and log begin.
No intervention is performed by users. Config renames unchanged throughout.

Expected Behavior or What you need to ask

...

Using Fluentd and ES plugin versions

Kubernetes EKS v1.19.6
Fluentd Daemonset v0.12
ES plugin 5.0.3
ES version 7.8.1

fluent-gem list

addressable (2.7.0)
aws-eventstream (1.1.1)
aws-partitions (1.451.0)
aws-sdk-core (3.114.0)
aws-sdk-kms (1.43.0)
aws-sdk-s3 (1.94.1)
aws-sdk-sqs (1.38.0)
aws-sigv4 (1.2.3)
bigdecimal (default: 1.4.1)
bundler (2.2.6, default: 1.17.2)
cmath (default: 1.0.0)
concurrent-ruby (1.1.8)
cool.io (1.7.1)
csv (default: 3.0.9)
date (default: 2.0.0)
dbm (default: 1.0.0)
dig_rb (1.0.1)
domain_name (0.5.20190701)
e2mmap (default: 0.1.0)
elasticsearch (7.12.0)
elasticsearch-api (7.12.0)
elasticsearch-transport (7.12.0)
elasticsearch-xpack (7.12.0)
etc (default: 1.0.1)
excon (0.81.0)
faraday (1.4.1)
faraday-excon (1.1.0)
faraday-net_http (1.0.1)
faraday-net_http_persistent (1.1.0)
fcntl (default: 1.0.0)
ffi (1.15.0)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.1.0)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.4.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.13)
fluent-plugin-elasticsearch (4.1.4)
fluent-plugin-geoip (1.3.2)
fluent-plugin-grok-parser (2.6.2)
fluent-plugin-json-in-json-2 (1.0.2)
fluent-plugin-kubernetes_metadata_filter (2.6.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-parser-cri (0.1.0)
fluent-plugin-prometheus (1.8.5)
fluent-plugin-record-modifier (2.1.0)
fluent-plugin-rewrite-tag-filter (2.4.0)
fluent-plugin-s3 (1.3.4)
fluent-plugin-systemd (1.0.4)
fluentd (1.12.3)
forwardable (default: 1.2.0)
gdbm (default: 2.0.0)
geoip-c (0.9.1)
geoip2_c (0.3.3)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.3)
http-form_data (2.3.0)
http-parser (1.2.3)
http_parser.rb (0.6.0)
io-console (default: 0.4.7)
ipaddr (default: 1.2.2)
irb (default: 1.0.0)
jmespath (1.4.0)
json (default: 2.1.0)
jsonpath (1.1.0)
kubeclient (4.9.1)
logger (default: 1.3.0)
lru_redux (1.1.0)
matrix (default: 0.1.0)
mime-types (3.3.1)
mime-types-data (3.2021.0225)
msgpack (1.4.2)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
netrc (0.11.0)
oj (3.11.0)
openssl (default: 2.1.2)
ostruct (default: 0.1.0)
prime (default: 0.1.0)
prometheus-client (0.9.0)
psych (default: 3.1.0)
public_suffix (4.0.6)
quantile (0.2.1)
rake (13.0.3)
rdoc (default: 6.1.2)
recursive-open-struct (1.1.3)
rest-client (2.1.0)
rexml (default: 3.1.9.1)
rss (default: 0.2.7)
ruby2_keywords (0.0.4)
scanf (default: 1.0.0)
sdbm (default: 1.0.0)
serverengine (2.2.3)
shell (default: 0.7)
sigdump (0.2.4)
stringio (default: 0.0.2)
strptime (0.2.5)
strscan (default: 1.0.0)
sync (default: 0.5.0)
systemd-journal (1.4.2)
thwait (default: 0.1.0)
tracer (default: 0.1.0)
tzinfo (2.0.4)
tzinfo-data (1.2021.1)
unf (0.1.4)
unf_ext (0.0.7.7)
webrick (1.7.0, default: 1.4.4)
yajl-ruby (1.4.1)
zlib (default: 1.0.0)



##### Boot log


2021-05-17 13:48:59 +0000 [info]: fluent/log.rb:329:info: Worker 0 finished unexpectedly with signal SIGKILL
2021-05-17 13:49:01 +0000 [info]: adding match in @FLUENT_LOG pattern="fluent.**" type="null"
2021-05-17 13:49:01 +0000 [info]: adding filter in @S3 pattern="**" type="grep"
2021-05-17 13:49:01 +0000 [info]: adding filter in @S3 pattern="**" type="record_transformer"
2021-05-17 13:49:01 +0000 [info]: adding match in @S3 pattern="**" type="s3"
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="kubernetes.**" type="kubernetes_metadata"
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Found directory with secrets: /var/run/secrets/kubernetes.io/serviceaccount
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Found CA certificate: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Found pod token: /var/run/secrets/kubernetes.io/serviceaccount/token
2021-05-17 13:49:02 +0000 [debug]: #0 [filter_kube_metadata] Creating K8S client
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="**" type="grep"
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="kubernetes.**" type="grep"
2021-05-17 13:49:02 +0000 [info]: adding filter pattern="**" type="record_transformer"
2021-05-17 13:49:02 +0000 [info]: adding match pattern="kubernetes.var.log.containers.**" type="rewrite_tag_filter"
2021-05-17 13:49:02 +0000 [info]: #0 [rewrite_container_name] adding rewrite_tag_filter rule: $.kubernetes.container_name [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8de42e38 @keys=["kubernetes", "container_name"], @last_key="container_name", @dig_keys=["kubernetes"]>, /^(.+)$/, "", "kubernetes.log.$1", nil]
2021-05-17 13:49:02 +0000 [info]: adding match pattern="kubernetes.log.alertmanager.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_alertmanager] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8fb3f950 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.consumer*.** kubernetes.log.podium*.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_consumer_podium] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8da381f8 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.dashboard-metrics-scraper.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [nginx_dashboard_scraper] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dadb0b0 @keys="stream">, /stdout/, "", "kubernetes.log.nginx.stdout", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.elasticsearch.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_elasticsearch] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8db7fac0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.flux.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_flux] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8de35328 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.fuseki.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_fuseki] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8deb6068 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.grafana.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_grafana] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dfa11d0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.helm-operator.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_helm] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dc703d0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.kibana.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_kibana] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8dd37070 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.manager.* kubernetes.log.postgres-consumer.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [spring_boot_manager] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d6edde0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.spring-boot.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingester.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.hydrology-data-explorer.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_hydrology_data_explorer] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d418ca0 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.hydro-api.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_hydro_api] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d4f8080 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1.ts", nil]
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.ingester.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [json_ingester] adding rewrite_tag_filter rule: stream [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8ddbf088 @keys="stream">, /^(.+)$/, "", "kubernetes.log.json.$1.ts", nil]
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.controller.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding match pattern="kubernetes.log.controller.**" type="rewrite_tag_filter"
2021-05-17 13:49:03 +0000 [info]: #0 [ingress_controller_log_format] adding rewrite_tag_filter rule: log_format [#<Fluent::PluginHelper::RecordAccessor::Accessor:0x00007fed8d62d7e8 @keys="log_format">, /^(.+)$/, "", "kubernetes.log.$1", nil]
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingress.stdout.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingress.stdout.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.ingress.stdout.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.json.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.json.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.nginx.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.nginx.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.redis" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.redis.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot-sleuth.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.spring-boot-sleuth.**" type="parser"
2021-05-17 13:49:03 +0000 [info]: adding filter pattern="kubernetes.log.json.stdout.ts.**" type="record_transformer"
2021-05-17 13:49:03 +0000 [info]: adding match pattern="**" type="copy"
2021-05-17 13:49:03 +0000 [debug]: #0 adding store type="elasticsearch"
2021-05-17 13:49:03 +0000 [debug]: #0 adding store type="relabel"
2021-05-17 13:49:03 +0000 [info]: adding source type="systemd"
2021-05-17 13:49:03 +0000 [info]: adding source type="systemd"
2021-05-17 13:49:03 +0000 [info]: adding source type="systemd"
2021-05-17 13:49:03 +0000 [info]: adding source type="prometheus"
2021-05-17 13:49:03 +0000 [info]: adding source type="prometheus_output_monitor"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: adding source type="tail"
2021-05-17 13:49:03 +0000 [info]: #0 starting fluentd worker pid=1319 ppid=6 worker=0

Pod Env

  FLUENT_ELASTICSEARCH_HOST:               elasticsearch-master
  FLUENT_ELASTICSEARCH_PORT:               9200
  FLUENT_ELASTICSEARCH_SCHEME:             http
  FLUENT_ELASTICSEARCH_LOGSTASH_FORMAT:    true
  FLUENT_ELASTICSEARCH_TEMPLATE_NAME:      logstash_template
  FLUENT_ELASTICSEARCH_TEMPLATE_FILE:      /fluentd/index_template.json
  FLUENT_ELASTICSEARCH_INCLUDE_TIMESTAMP:  true
  FLUENT_ELASTICSEARCH_LOG_ES_400_REASON:  true
  FLUENT_ELASTICSEARCH_ENABLE_ILM:         true
  FLUENT_ELASTICSEARCH_ROLLOVER_INDEX:     false
  FLUENT_ELASTICSEARCH_ILM_POLICY_ID:      logstash_policy
  FLUENT_ELASTICSEARCH_ILM_POLICY:         {
                                              "policy": {
                                               "phases": {
                                                 "delete": {
                                                   "min_age": "7d",
                                                   "actions": {
                                                     "delete": {}              
                                                   }
                                                 }
                                               }
                                             }
                                           }

Plugin config

<match **>
  @type copy
   <store>

    @type elasticsearch
    @id out_es7
    @log_level warn
    include_tag_key true
    host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
    port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
    path "#{ENV['FLUENT_ELASTICSEARCH_PATH']}"
    scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
    ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'true'}"
    ssl_version "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERSION'] || 'TLSv1_2'}"
    user "#{ENV['FLUENT_ELASTICSEARCH_USER'] || use_default}"
    password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD'] || use_default}"
    reload_connections "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS'] || 'true'}"
    reconnect_on_error "#{ENV['FLUENT_ELASTICSEARCH_RECONNECT_ON_ERROR'] || 'true'}"
    reload_on_failure "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_ON_FAILURE'] || 'true'}"
    log_es_400_reason "#{ENV['FLUENT_ELASTICSEARCH_LOG_ES_400_REASON'] || 'false'}"
    logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'logstash'}"
    logstash_dateformat "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_DATEFORMAT'] || '%Y.%m.%d'}"
    logstash_format "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_FORMAT'] || 'true'}"
    index_name "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_INDEX_NAME'] || 'logstash'}"
    type_name "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_TYPE_NAME'] || 'fluentd'}"
    include_timestamp "#{ENV['FLUENT_ELASTICSEARCH_INCLUDE_TIMESTAMP'] || 'false'}"
    template_name "#{ENV['FLUENT_ELASTICSEARCH_TEMPLATE_NAME'] || use_nil}"
    template_file "#{ENV['FLUENT_ELASTICSEARCH_TEMPLATE_FILE'] || use_nil}"
    template_overwrite "#{ENV['FLUENT_ELASTICSEARCH_TEMPLATE_OVERWRITE'] || use_default}"
    sniffer_class_name "#{ENV['FLUENT_SNIFFER_CLASS_NAME'] || 'Fluent::Plugin::ElasticsearchSimpleSniffer'}"
    request_timeout "#{ENV['FLUENT_ELASTICSEARCH_REQUEST_TIMEOUT'] || '5s'}"
    suppress_type_name "#{ENV['FLUENT_ELASTICSEARCH_SUPPRESS_TYPE_NAME'] || 'true'}"
    enable_ilm "#{ENV['FLUENT_ELASTICSEARCH_ENABLE_ILM'] || 'false'}"
    ilm_policy_id "#{ENV['FLUENT_ELASTICSEARCH_ILM_POLICY_ID'] || use_default}"
    ilm_policy "#{ENV['FLUENT_ELASTICSEARCH_ILM_POLICY'] || use_default}"
    ilm_policy_overwrite "#{ENV['FLUENT_ELASTICSEARCH_ILM_POLICY_OVERWRITE'] || 'false'}"
    <buffer>
      flush_thread_count "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_THREAD_COUNT'] || '8'}"
      flush_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_INTERVAL'] || '5s'}"
      chunk_limit_size "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '8M'}"
      queue_limit_length "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '32'}"
      retry_max_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_RETRY_MAX_INTERVAL'] || '30'}"
      retry_timeout "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_RETRY_TIMEOUT'] || '1h'}"
    </buffer>

   </store>

   <store>
      @type relabel
      @label @S3
   </store>

</match>

ES log shows no errors. Other nodes in the daemonset continue to function normally.

Fluentd shows no errors during the period, is not trying to reconnect.

The text was updated successfully, but these errors were encountered:

andrew-pickin-epi · 2021-05-18T08:25:09Z

It should be noted that calling the /api/plugins.flushBuffers endpoint causes the buffer to be written successfully and CPU usage to return to normal.

So:
what is causing the buffer to grow?
why the excessive CPU usage?
why isn't the connection either timing -out and reconnecting or the buffer being gracefully written?

andrew-pickin-epi · 2021-05-18T10:45:37Z

I have a cluster with identical configuration in production
Fluentd Daemonset v0.11
ES plugin 4.1.1

This does does not exhibit the same behavour. So this suggests a new bug introduced in v5.

andrew-pickin-epi · 2021-05-18T17:49:13Z

Rollback-ed back to v1.11 on one cluster that was misbehaving, and thus far cpu re-away has not been seen.

cosmo0920 · 2021-05-24T01:40:30Z

Rollback-ed back to v1.11 on one cluster that was misbehaving, and thus far cpu re-away has not been seen.

What is the version of back to normal? Is it in Fluentd v1.11.5?

cosmo0920 · 2021-05-24T01:57:50Z

It seems that this CPU usage spike comes from in_tail plugin code change.
In Elasticsearch plugin side, the difference between from v4.3.3 to v5.0.3 is:
v4.3.3...master#diff-cf75cdd038558996eefb3a1e148524c313f9094460b38022d003bf7d1878233a

Just changes for supporting IPv6 string on host address.

I suspect that this spike is originated from in_tail which is built-in Fluentd core.

cosmo0920 · 2021-05-24T02:08:22Z

And could you send SIGCONT signal on the CPU spike Fluentd process?
I'm wonder who is the culprit to make resource usage spike.

ashie · 2021-05-24T02:45:33Z

And could you send SIGCONT signal on the CPU spike Fluentd process?
I'm wonder who is the culprit to make resource usage spike.

Do you intent to getting stacktrace? (I also want to getting it)
If so, we need to tell how to get it:

https://github.com/frsyuki/sigdump#usage

ashie · 2021-05-24T05:30:55Z

fluent/fluentd#3387 might be a same issue.

andrew-pickin-epi · 2021-05-24T10:26:46Z

Rollback-ed back to v1.11 on one cluster that was misbehaving, and thus far cpu re-away has not been seen.

What is the version of back to normal? Is it in Fluentd v1.11.5?

The fluentd I rolled back to is v1.11.1.

addressable (2.7.0)
aws-eventstream (1.1.0)
aws-partitions (1.351.0)
aws-sdk-core (3.104.3)
aws-sdk-kms (1.36.0)
aws-sdk-s3 (1.75.0)
aws-sdk-sqs (1.30.0)
aws-sigv4 (1.2.1)
bigdecimal (default: 1.4.1)
bundler (2.1.4, default: 1.17.2)
cmath (default: 1.0.0)
concurrent-ruby (1.1.6)
cool.io (1.6.0)
csv (default: 3.0.9)
date (default: 2.0.0)
dbm (default: 1.0.0)
dig_rb (1.0.1)
domain_name (0.5.20190701)
e2mmap (default: 0.1.0)
elasticsearch (7.8.1)
elasticsearch-api (7.8.1)
elasticsearch-transport (7.8.1)
elasticsearch-xpack (7.8.1)
etc (default: 1.0.1)
excon (0.76.0)
faraday (1.0.1)
fcntl (default: 1.0.0)
ffi (1.13.1)
ffi-compiler (1.0.1)
fiddle (default: 1.0.0)
fileutils (default: 1.1.0)
fluent-config-regexp-type (1.0.0)
fluent-plugin-concat (2.4.0)
fluent-plugin-dedot_filter (1.0.0)
fluent-plugin-detect-exceptions (0.0.13)
fluent-plugin-elasticsearch (4.1.1)
fluent-plugin-geoip (1.3.2)
fluent-plugin-grok-parser (2.6.1)
fluent-plugin-json-in-json-2 (1.0.2)
fluent-plugin-kubernetes_metadata_filter (2.3.0)
fluent-plugin-multi-format-parser (1.0.0)
fluent-plugin-prometheus (1.6.1)
fluent-plugin-record-modifier (2.0.1)
fluent-plugin-rewrite-tag-filter (2.2.0)
fluent-plugin-s3 (1.3.4)
fluent-plugin-systemd (1.0.2)
fluentd (1.11.1)
forwardable (default: 1.2.0)
gdbm (default: 2.0.0)
geoip-c (0.9.1)
geoip2_c (0.3.3)
http (4.4.1)
http-accept (1.7.0)
http-cookie (1.0.3)
http-form_data (2.3.0)
http-parser (1.2.1)
http_parser.rb (0.6.0)
io-console (default: 0.4.7)
ipaddr (default: 1.2.2)
irb (default: 1.0.0)
jmespath (1.4.0)
json (default: 2.1.0)
jsonpath (1.0.5)
kubeclient (4.9.0)
logger (default: 1.3.0)
lru_redux (1.1.0)
matrix (default: 0.1.0)
mime-types (3.3.1)
mime-types-data (3.2020.0512)
msgpack (1.3.3)
multi_json (1.15.0)
multipart-post (2.1.1)
mutex_m (default: 0.1.0)
netrc (0.11.0)
oj (3.8.1)
openssl (default: 2.1.2)
ostruct (default: 0.1.0)
prime (default: 0.1.0)
prometheus-client (0.9.0)
psych (default: 3.1.0)
public_suffix (4.0.5)
quantile (0.2.1)
rake (13.0.1)
rdoc (default: 6.1.2)
recursive-open-struct (1.1.2)
rest-client (2.1.0)
rexml (default: 3.1.9)
rss (default: 0.2.7)
scanf (default: 1.0.0)
sdbm (default: 1.0.0)
serverengine (2.2.1)
shell (default: 0.7)
sigdump (0.2.4)
stringio (default: 0.0.2)
strptime (0.2.4)
strscan (default: 1.0.0)
sync (default: 0.5.0)
systemd-journal (1.3.3)
thwait (default: 0.1.0)
to_regexp (0.2.1)
tracer (default: 0.1.0)
tzinfo (2.0.2)
tzinfo-data (1.2020.1)
unf (0.1.4)
unf_ext (0.0.7.7)
webrick (default: 1.4.2)
yajl-ruby (1.4.1)
zlib (default: 1.0.0)

andrew-pickin-epi · 2021-05-24T11:15:01Z

It seems that this CPU usage spike comes from in_tail plugin code change.
In Elasticsearch plugin side, the difference between from v4.3.3 to v5.0.3 is:
v4.3.3...masterdiff-cf75cdd038558996eefb3a1e148524c313f9094460b38022d003bf7d1878233a

Just changes for supporting IPv6 string on host address.

I suspect that this spike is originated from in_tail which is built-in Fluentd core.

I tried rolling back just the ES plugin (so fluentd v1.12.3 & ES 4.1.4) this also failed as recorded here fluent/fluentd#3382

This also leads me to suspect that it is a core issue.

I will roll forward to v1.12.3/5.0.3 and attempt to get a sigdump output.
However I have no deterministic way of prompting the issue.

andrew-pickin-epi · 2021-05-25T12:01:20Z

5 node EKS development cluster with very little load Fluentd v1.12.3/ES pluging 5.0.3
After running with for approx 22 hrs without, incident 2 nodes went into 100% cpu within 30 of each other.

fluentd-kqlvw:

top - 10:59:17 up 12 days, 11:29,  0 users,  load average: 2.64, 2.08, 2.25
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s): 77.1 us, 21.4 sy,  0.0 ni,  0.5 id,  0.8 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   3877.4 total,    183.8 free,   3079.8 used,    613.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    855.6 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   61 root      20   0  690624 120476   9192 S  94.0   3.0  86:02.55 ruby  
    1 root      20   0    8296   1156   1036 S   0.0   0.0   0:01.77 tini 
    7 root      20   0  159128  68112   5696 S   0.0   1.7   0:20.59 ruby
  297 root      20   0   11504   4400   3800 S   0.0   0.1   0:00.00 bash 
  302 root      20   0   15236   4376   3656 R   0.0   0.1   0:00.00 top

UID        PID  PPID  C STIME TTY          TIME CMD 
root         1     0  0 May24 ?        00:00:01 tini -- /fluentd/entrypoint.sh
root         7     1  0 May24 ?        00:00:20 ruby /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb
root        61     7  6 May24 ?        01:27:06 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb --under-supervisor
root       297     0  0 10:58 pts/0    00:00:00 bash
root       304   297  0 11:00 pts/0    00:00:00 ps -ef

fluentd-kqlvw-sigdump-61.log

fluentd-wrpmk:

top - 11:09:33 up 60 days, 21:10,  0 users,  load average: 1.74, 1.61, 1.65
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie 
%Cpu(s): 42.8 us, 24.1 sy,  0.0 ni, 33.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3877.4 total,    187.0 free,   2946.9 used,    743.6 buff/cache  
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    908.7 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                                                   
   61 root      20   0  815268 136936  20916 S  98.7   3.4  64:16.89 ruby                                                                                                                                                                                                                                                                                                                       
    1 root      20   0    8296   1448   1328 S   0.0   0.0   0:02.09 tini                                                                                                                                                                                                                                                                                                                       
    6 root      20   0  157060  70768   8324 S   0.0   1.8   0:20.91 ruby                                                                                                                                                                                                                                                                                                                       
  314 root      20   0   11504   4440   3864 S   0.0   0.1   0:00.01 bash                                                                                                                                                                                                                                                                                                                       
  319 root      20   0   15236   4304   3564 R   0.0   0.1   0:00.00 top

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 May24 ?        00:00:02 tini -- /fluentd/entrypoint.sh
root         6     1  0 May24 ?        00:00:20 ruby /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb
root        61     6  4 May24 ?        01:05:02 /usr/local/bin/ruby -Eascii-8bit:ascii-8bit /fluentd/vendor/bundle/ruby/2.6.0/bin/fluentd -c /fluentd/etc/fluent.conf -p /fluentd/plugins --gemfile /fluentd/Gemfile -r /fluentd/vendor/bundle/ruby/2.6.0/gems/fluent-plugin-elasticsearch-5.0.3/lib/fluent/plugin/elasticsearch_simple_sniffer.rb --under-supervisor
root       314     0  0 11:09 pts/0    00:00:00 bash
root       321   314  0 11:10 pts/0    00:00:00 ps -ef

fluentd-wrpmk-sigdump-61.log

markiewitch · 2021-05-25T13:07:57Z

@andrew-pickin-epi I can see the same endless loop of gen_random_urandom as I had. Try running strace -c -p <fluentd PID>, you'll probably see tens of thousands of calls. My workaround until DNS resolution in Ruby is fixed was to add http_backend typhoeus that uses persistent TCP connections. I've linked the culprit code fragment in here

ashie · 2021-06-04T05:05:06Z

fluent/fluentd#3387 (comment) by @dzikismigol

Problem lies somewhere in the /usr/lib/ruby/2.7.0/resolv.rb, pasted below. Ruby keeps generating random request IDs, probably because of a collision caused by (?) not calling free_request_id.

I found a resolv's issue for it: ruby/resolv#11
It's already fixed at ruby/resolv#9
Please try resolv 0.2.1.

I think it's triggered by excon's change: excon/excon#739
excon uses Ruby's resolv instead of libc's resolv as of v0.80.0.
And @andrew-pickin-epi's log also indicates it:

after:

excon (0.81.0)

before:

excon (0.76.0)

andrew-pickin-epi · 2021-11-04T12:45:13Z

Not seen this error since moving to 1.13

danielhoherd · 2021-11-04T16:57:49Z

I can report the same, fixed in 1.13

andrew-pickin-epi changed the title ~~Logging to ES Stopps as Fluentd cpu -> 100% as ES buffer used~~ Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used May 17, 2021

andrew-pickin-epi changed the title ~~Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used~~ v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used May 18, 2021

andrew-pickin-epi mentioned this issue May 21, 2021

Logging from single k8s node stops and Fluentd cpu -> 100%. Log events lost. fluent/fluentd#3382

Closed

cosmo0920 added the Ruby Core Issue label Jun 9, 2021

ncsibra mentioned this issue Jun 21, 2021

Wrong RUBYLIB path for resolv gem fluent/fluentd-docker-image#277

Closed

ashie mentioned this issue Jul 29, 2021

Fleuntd stops sending logs and overflowing the buffer or queued_limit_size #909

Open

adrdimitrov mentioned this issue Jul 29, 2021

Got "buffer flush took longer time than slow_flush_log_threshold" error #805

Open

1 task

juan131 mentioned this issue Aug 2, 2021

Fluentd buffer gets overflow or queue limit size gets filled bitnami/charts#7089

Closed

andrew-pickin-epi closed this as completed Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used #885

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used #885

andrew-pickin-epi commented May 17, 2021 •

edited

andrew-pickin-epi commented May 18, 2021

andrew-pickin-epi commented May 18, 2021

andrew-pickin-epi commented May 18, 2021

cosmo0920 commented May 24, 2021

cosmo0920 commented May 24, 2021

cosmo0920 commented May 24, 2021

ashie commented May 24, 2021

ashie commented May 24, 2021 •

edited

andrew-pickin-epi commented May 24, 2021

andrew-pickin-epi commented May 24, 2021

andrew-pickin-epi commented May 25, 2021 •

edited

markiewitch commented May 25, 2021

ashie commented Jun 4, 2021 •

edited

andrew-pickin-epi commented Nov 4, 2021

danielhoherd commented Nov 4, 2021

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used #885

v5 bug: Logging to ES Stops and Fluentd cpu -> 100% as ES buffer used #885

Comments

andrew-pickin-epi commented May 17, 2021 • edited

Problem

Expected Behavior or What you need to ask

Using Fluentd and ES plugin versions

fluent-gem list

Pod Env

Plugin config

andrew-pickin-epi commented May 18, 2021

andrew-pickin-epi commented May 18, 2021

andrew-pickin-epi commented May 18, 2021

cosmo0920 commented May 24, 2021

cosmo0920 commented May 24, 2021

cosmo0920 commented May 24, 2021

ashie commented May 24, 2021

ashie commented May 24, 2021 • edited

andrew-pickin-epi commented May 24, 2021

andrew-pickin-epi commented May 24, 2021

andrew-pickin-epi commented May 25, 2021 • edited

markiewitch commented May 25, 2021

ashie commented Jun 4, 2021 • edited

andrew-pickin-epi commented Nov 4, 2021

danielhoherd commented Nov 4, 2021

andrew-pickin-epi commented May 17, 2021 •

edited

ashie commented May 24, 2021 •

edited

andrew-pickin-epi commented May 25, 2021 •

edited

ashie commented Jun 4, 2021 •

edited