Thanos receiver hogs high memory utilisation for more than 16 hours after stress tests #7165
Replies: 4 comments 2 replies
-
Are you sure that the blocks have been deleted? Maybe you profile your memory usage through pprof and upload it to pprof.me? |
Beta Was this translation helpful? Give feedback.
-
Not sure if I pprof the right way, here is the step
Please find the attached profile report for in-use. Pod utilisation oc adm top pod obs-thanos-receive-2
NAME CPU(cores) MEMORY(bytes)
obs-thanos-receive-2 0m 1225Mi Looks like pod memory utilisation is much higher than in_use allocation. Please help to have a look, I have a feeling this might be related to the image I am using or the configuration. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Hi @fpetkovski , I saw the great comment PR you have made #4329 for TSDB pruning. I have also put the configuration A Receiver will automatically decommission a tenant once new samples have not been seen for longer than the --tsdb.retention period configured for the Receiver. The tenant decommission process includes flushing all in-memory samples for that tenant to disk, sending all unsent blocks to S3, and removing the tenant TSDB from the filesystem. If a tenant receives new samples after being decommissioned, a new TSDB will be created for the tenant. However seems that our receivers doesn't release any memory after several days from write stops. Could you please help here? |
Beta Was this translation helpful? Give feedback.
-
HI Folks, we have memory issues with Thanos receive horizontal autoscaling as they don't actually scale down after 16 hours from the load testing. we have HPA with min=2 and max=9. During the load testing, receive clusters scale up to 9 pods. As the below image implies, from the initial peak due to load testing around 17:00, the scaled up pods never gets scaled down. Instead ALL pods' memory usage remain flat afterwards, and never drops; Inital 2 pods remain even higher around 890Mi, and the memory limit is 2.5Gi! This has made the whole setup unusable since every time running load testing, we have to reinstall the whole thanos stacks to avoid any pod crashes. (We set each memory limit to 2.5Gi because we'd like to compare how well thanos autoscales, and try to keep the same configs with initial prometheus configuration)
We are using bitnami helm package for thanos on Openshift Cluster, and the image using is thanos:0.33.0-debian-11-r1 , thanos receive configuration
--tsdb.retention=6h
was set up to keep the samples for only 6 hours, samples will get uploaded to a s3 bucket (MinIO)we have default hashring with 1 tenant only
the HPA setup
We are not sure what configs we are missing so Thanos receive hogs high memory after long itme - 16 hours with
--tsdb.retention=6h
. In my understanding, we should be able to write the incoming data to the disk every 2 hours and send it to object storage periodically, and remove the samples from disk every 6 hours. Not sure of the reason to hold data in-memory. Please correct me if I am wrong. Thanks in advanceBeta Was this translation helpful? Give feedback.
All reactions