New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got "buffer flush took longer time than slow_flush_log_threshold" error #805
Comments
This causes buffer operation should be jammed.
This warnings are caused by high CPU usage on Fluentd and/or lack of ingestion capacity on Elasticsearch cluster. When high I/O usage occurred in Fluentd side, this causes buffer operations operate frequently and rises a probability to be exceeded to buffer operation capacity. So, high flow rate ingestion should cause slow_flush_log_threshold error. And also, lack of Elasticsearch ingestion capacity causes piling up unprocessed buffers. Thus, unprocessed piling up buffers causes this error. |
Thank you for your prompt reply! |
Hi @cosmo0920 , Hi, [1] https://docs.fluentd.org/configuration/buffer-section#buffering-parameters |
Sorry for being late response.
Elasticsearch bulk API capped 200 ref: https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster example) Using cURL in my development environment: $ curl localhost:9200/_nodes/thread_pool | jq . {
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "elasticsearch_cosmo",
"nodes": {
"Sf-WnAEOQN-AbE8lj9YirA": {
"name": "Hiroshi-no-MacBook-Pro.local",
"transport_address": "127.0.0.1:9300",
"host": "127.0.0.1",
"ip": "127.0.0.1",
"version": "7.8.1",
"build_flavor": "oss",
"build_type": "tar",
"build_hash": "b5ca9c58fb664ca8bf9e4057fc229b3396bf3a89",
"roles": [
"data",
"ingest",
"master",
"remote_cluster_client"
],
"thread_pool": {
"force_merge": {
"type": "fixed",
"size": 1,
"queue_size": -1
},
"fetch_shard_started": {
"type": "scaling",
"core": 1,
"max": 8,
"keep_alive": "5m",
"queue_size": -1
},
"listener": {
"type": "fixed",
"size": 2,
"queue_size": -1
},
"refresh": {
"type": "scaling",
"core": 1,
"max": 2,
"keep_alive": "5m",
"queue_size": -1
},
"generic": {
"type": "scaling",
"core": 4,
"max": 128,
"keep_alive": "30s",
"queue_size": -1
},
"warmer": {
"type": "scaling",
"core": 1,
"max": 2,
"keep_alive": "5m",
"queue_size": -1
},
"search": {
"type": "fixed_auto_queue_size",
"size": 7,
"queue_size": 1000
},
"flush": {
"type": "scaling",
"core": 1,
"max": 2,
"keep_alive": "5m",
"queue_size": -1
},
"fetch_shard_store": {
"type": "scaling",
"core": 1,
"max": 8,
"keep_alive": "5m",
"queue_size": -1
},
"management": {
"type": "scaling",
"core": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
},
"get": {
"type": "fixed",
"size": 4,
"queue_size": 1000
},
"analyze": {
"type": "fixed",
"size": 1,
"queue_size": 16
},
"write": {
"type": "fixed",
"size": 4,
"queue_size": 200
},
"snapshot": {
"type": "scaling",
"core": 1,
"max": 2,
"keep_alive": "5m",
"queue_size": -1
},
"search_throttled": {
"type": "fixed_auto_queue_size",
"size": 1,
"queue_size": 100
}
}
}
}
} |
Hi @cosmo0920 , Thank you for your suggestion, I found that the thread_pool queue_size is 1000 now.
Also, according to @repeatedly 's comment in https://groups.google.com/g/fluentd/c/3xrieNheguE , I have optimized my fluentd config (increased flush_thread_count to 16, increased timekey to 9s, use record_modifier instead of record_transformer, removed tag chunk_key).
In addition, In order to find the root cause of slow_flush warning, I used PromQL to make the Prometheus data become meaningful according to [1]. I can see that the number of incoming records is not very high (~2k/s). However, the "maximum buffer length in last 1min" always increases and reaches the max buffer length ( I set the flush_thread_count = 16). May I know is it normal behavior for the increasing buffer length? Could anyone help to check if any data in the below graphs are abnormal? Also, I have consulted the Elasticsearch support, he checked the health status for the Elasticsearch cluster, and seems everything is fine. [1] https://docs.fluentd.org/monitoring-fluentd/monitoring-prometheus Please help. Thank you very much! |
I not sure really how to fix.
|
Hi @cosmo0920 ,
It seems the request time for _bulk request caused the slow_flush_log warning. Is this log meaningful to you? |
I am facing the same issue
This is my config right now.
@chikinchoi I am facing the same issue as yours, I would love to connect with you and try to understand more and fix this issue. |
Hi, please try to reduce bulk_message_request_threshold size (from default 20 MB to 1-2MB). It seems to make delivery more stable. |
@cosmo0920 @chikinchoi how were you able to resolve this? Can you share your resolution? |
@chikinchoi @cosmo0920 Hey guys, I am facing almost exactly the same issue and can't fix it. I see that this thread is closed for almost half an year, were you able to fix the issue and if yes how do you manage to do that. I have Fluentd on Kubernetes with 90% of the logs generated by one of the nodes, i tried a lot of configuration with small or big flush_thread_count, different sizes of the buffer and so on. I couldn't find the exact configuration to work with. Elasticsearch seems totally fine, but i constantly get slow flush error as mentioned here, and my buffer gets overflown constantly (the strange thing is that it is not overflown in moments with high logs load). Any ideas or help will be much appreciated. Current aggregator config:
|
@adrdimitrov What version do you use? |
seems to be related with: #885, i have mentioned my versions there. |
@yashumitsu, I have updated the fluentd-elasticsearch plugin version from 4.2.2 to 4.3.3 but still can see the warning "buffer flush took longer time than slow_flush_log_threshold: elapsed_time=939.1508707999999 slow_flush_log_threshold=20.0 plugin_id="firelens_es". |
@yashumitsu I am using fluentd version 1.11.4 and fluentd-elasticsearch plugin version 5.0.0 and still experiencing this issue. |
I mentioned it here: I was thinking, the bulk_message_request_threshold default value was changed in 4.3.1: But, unfortunately (maybe I'm missing something), it seems to remain the same: Can you try with explicit: This is our conf:
|
@yashumitsu Thank you for the response. You suggest bulk_message_request_threshold to be set to -1? Can you please explain what this does? |
@g3kr , I have upgraded the fluent-elasticsearch-plugin to version 5.0.5 and added bulk_message_request_threshold -1 to the config. but still can see the slow_flush warning message. below is the match section config, anything wrong?
|
@chikinchoi Its difficult for me to reproduce the issue. Hence I am not able to test the new parameter. Under what circumstances do you see this error/warn? |
I'm using 5.0.5 as well. Our team encountered slow buffer flush logs quite a few times and the fix wasn't always the same but was usually related to tuning buffer parameters and number of tasks/workers accepting data. In general, turning on debug logging and using the Here are some of the things that I would try
|
Thanks for your suggestions here. I have a couple of follow up questions - If we reduce the chunk_limit_size to 10 MB, how does the flush to ES happen? I am assuming there will be too many chunks created that needs to be flushed to ES How do we determine the optimal value for slow_flush_log_threshold? Below is my buffer configuration
what issues do you see with this? |
As I understand it, the flush to ES still happens on an interval (based on your config) but it tries to to send a bulk request to ES for each chunk. So, yes, it is likely that if you have too much data incoming, or an ES cluster that cannot support that many bulk requests you will start seeing errors/retries. If you don't have enough flush threads to actually output data fast enough, you might just see the buffers continually grow. Some "napkin math" should help you get the right configurations to make sure fluent can flush as fast as its coming in (factor in that buffers and flush threads are per worker). This, or debug metrics, should also help you determine max buffer size used for your configured flush interval.
According to the docs this is just
Keep in mind, I don't know your ingest data rate or ES cluster size/details but the things that jump out to me are the QQ: I see you're using |
@renegaderyu Thank you so much for the detailed response. This is very much helpful. I will look into how to tweak these parameters. To your question on compression. We gzip them to save some space in the buffers but you are right when it gets written to ES it is decompressed. Another thing I am trying is to include the monitor_agent input plugin to export fluentd metrics via REST API.
We run our service in AWS Fargate backed by NLB, I was hoping to reach to the load balancer endpoint at port 24220 to get some metrics, that doesn't seem to work? you have some idea on what I might be missing? Again appreciate all your help in responding to the questions |
Sorry I can't really say what you're missing. I can say that we run something very similar. If you're using multiple tasks, or autoscaling, you won't be able to associate the data from monitor_agent to the task behind the load balancer. We just opted to use Something like this:
|
@renegaderyu I will try this configuration and route the logs to stdout. Thanks again.
I quite did not understand this. How do you turn off this with the LOG_LEVEL. If I may also ask, what is the purpose of adding a record transformer here? |
@g3kr LOG_LEVEL is sourced from an environment var. If you're using fargate/ecs, environment vars are defined in the task definition for the service and changing them will not affect running fluentd processes. You have to kill the task and let a new one spin up, trigger a new deployment for the service, or something else to make fluentd pick up the new value for the env var. The record transform is just to add extra fields/data to the metrics. For instance, you can add container/config version to the monitor_agent events so its easy to compare differences if you were to use a rolling or canary type of deployment. I just provided that as an example, it can be removed if you don't need it. |
@renegaderyu Makes sense. Thank you so much! |
@renegaderyu After implementing the monitor_agent input plugin, this is the sample event I see for my elasticsearch output plugin. { I wasn't sure if this looked right from metric perspective. I couldn't find documentation on how to interpret these numbers. Can you please enlighten me on this? Thanks |
HI,
They don't have to match each other.
Write operation sometimes batched and
|
@cosmo0920 Thanks for clarifying this. Much appreciated. Based on your definitions for these metrics, I am wondering how would one make use of these metrics in times of anomalies. As per my understanding, unless your Please let me know if I am wrong |
Yes, your understanding is almost correct. |
For the below config
We observed something strange though I have
Any idea/thought on what might be going on here? When I just filtered for chunck id "5cb089f6964324f13ccc8d321dfb6c83" these are the number of retries while I would have expected it to retry only once and drop the chunk. Attached pic. |
@g3kr Never seen that but it appears that In the troubleshooting doc for this plugin, https://github.com/uken/fluent-plugin-elasticsearch/blob/master/README.Troubleshooting.md, mentions turning on transporter logs, to get more info, and potentially setting the ssl_version. |
@renegaderyu Thanks for responding. This is my entire output config.
|
@renegaderyu It would be helpful if you could answer one more question. Is there a way to look at the contents of the buffer log file to view the chunks it will process. I tried to open and its encrypted. Please let me know. Thank you |
@g3kr I just meant, from the info provided, that I'm not convinced the
I've peeked at some buffer chunks before and IIRC they are not encrypted but they do seem to use characters/bytes that do not print well as field/log separators. You could probably look at the fluentd source to figure out the format and write a quick script to help parse them. With that said, I suspect the usual fluentd debug logs, transporter logs, and emissions from monitor_agent should be sufficient to help you track down any problems. |
@cosmo0920 @renegaderyu |
Is there a particular reason for this? I came across this blog post when I was tuning and it was tremendously helpful. My experience, has been pretty much the same. It seems the best throughput is achieved by having chunks appropriately, and consistently, sized and flushed. This is why I previously mentioned removing
I think a single record being larger than
I saw something similar while using file based buffers (its why I'm now biased against them). After some unknown threshold, the buffers would just continue to grow. I knew that, based on back-of-the-envelope math, that ES could handle the ingestion rate, and fluentd should have been able to flush faster than incoming data rate. The tasks didn't appear to be maxing out cpu, iops, i/o, or exhausting inodes on the ssds. They were in an ASG and load-balancing seemed fine. After changing to memory buffers I never saw this again.
I can't say.
|
I updated the chunk_limit_size to 20MB and everything seems great! |
@chikinchoi , Its not a fake alarm it just needs to be tuned for your systems. The As I understand Fluentd to work, each flush attempt tries to flush the entire buffer, all chunks. In your case, If fluent seems fine and ES seems fine, I'd decrease the |
(check apply)
Problem
Hi Team,
I got at lot of
buffer flush took longer time than slow_flush_log_threshold: elapsed_time=60.16122445899964 slow_flush_log_threshold=20.0 plugin_id="firelens_es
error in my fluentd which is connecting to Elasticsearch .I saw that there is a document mentioned the reason for this error is because Elasticsearch is exhausted [1]. However, I checked all of the Elasticsearch clusters' CPU usage are low and healthy.
I have asked in Fluentd group and @repeatedly said that I should check the Fluentd CPU usage. Therefore, I found that the Fluentd service CPU usage is very high (almost 100%) when slow_flush error appears. I have increased the Fluentd CPU size and the error seems reduced but still can find it. I have tried to increase the Fluentd CPU size from 1024 to 2048. Although the number of slow_flush errors is decreased, I can still see it sometimes.
In conclusion, I think that the slow_flush error cause the Fluentd CPU usage increase. Therefore, I would like to know what is the cause of this error and how to fix it. Also, I don't understand why I should check the Fluentd CPU usage instead of the Elasticsearch CPU usage.
Please advise. Thank you very much!!
Steps to replicate
cannot reproduce because I don't know the reason for this error.
Below is my Fluentd config:
Expected Behavior or What you need to ask
No this error
Using Fluentd and ES plugin versions
fluentd running in AWS ECS Fargate service
fluentd version: v1.11.1
fluent-plugin-elasticsearch verrsion 4.1.0
The text was updated successfully, but these errors were encountered: