Prometheus WAL files are not deleted and its size is 261GB now #10272
-
Prometheus WAL files are not deleted since the Prometheus setup was done. I am using Prometheus version 2.26.0 and I have gone through the same issue in forums. It has been mentioned that it was fixed in version 2.11 but for me, the WAL files are not deleting even though I use version 2.26.0. In the prometheus.yml file, I have not set --storage.tsdb.retention.time and --storage.tsdb.wal-compression parameters. Because of these huge WAL files, whenever I restart Prometheus, it takes more than 2 hours to load all the segments. Please help me to find out if I am missing anything or it is a bug. Currently, the size of the WAL folder is 261GB. I have verified this below expression in Prometheus URL Output: {instance="xxx.xx.xx:9090", job="prometheus"} 15.376041666666667 *** Prometheus version:** prometheus, version 2.26.0 (branch: HEAD, revision: 3cafc58) *** Prometheus configuration file:**
|
Beta Was this translation helpful? Give feedback.
Replies: 11 comments 4 replies
-
Can you provide us the starting logs of prometheus? If possible with --log.level=debug. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I have attached latest logs of prometheus here. |
Beta Was this translation helpful? Give feedback.
-
@roidelapluie Did you get a chance to look at the logs? Pls help. |
Beta Was this translation helpful? Give feedback.
-
@codesome Can you please help me on this issue? This issue is occurring in production. So wanted to fix it sooner. |
Beta Was this translation helpful? Give feedback.
-
Log has too many open files You might have some lower limit set at some level. A quick search gives me https://www.baeldung.com/linux/error-too-many-open-files/. Can you try increasing the limit to some high value and see if that fixes it? |
Beta Was this translation helpful? Give feedback.
-
@codesome Yes, i saw this error and I have increased ulimit from 1024 to 65536 for root user as we are running prometheus with root user. But after this change when I tried to restart prometheus it was taking more than 7 hours to load all the wal segments and it was not starting. Usually it will take 2 hours to load all the wal segments which are present under /data/prometheus/wal and start prometheus. But after ulimit changes service did not started. So we have reverted that ulimit value to 1024 and then started prometheus service. So if we resolve this wal segments truncation issue, then it will delete old wal segments and prometheus will be started quickly. Am I right? |
Beta Was this translation helpful? Give feedback.
-
Since you have piled up so much of WAL, it is going to take a ton of time yes. You can delete some % of older files to speed this up, and if you are lucky, there will be no data loss, but expect some data loss nevertheless. I would not recommend reducing back the ulimit because it all started with the lower ulimit from the looks of it. |
Beta Was this translation helpful? Give feedback.
-
@codesome Sorry for the delayed response. I have set LimitNOFILE=65536 in prometheus.service file and moved all WAL files to the backup folder. Then started Prometheus and not is running fine. So now prometheus is taking 65536 value and also WAL truncation is happening properly. |
Beta Was this translation helpful? Give feedback.
-
Vou escrever em português pois sou do Brasil, Estou tendo um problema, tem como eu reduzir o tamanho wal de 1 em 1 hora? de forma automatizada com uma flag? |
Beta Was this translation helpful? Give feedback.
Log has
WAL truncation in Compact: create checkpoint: create segment reader: open segment:00002392 in dir:/data/prometheus/wal: open /data/prometheus/wal/00002392: too many open files
too many open files
You might have some lower limit set at some level. A quick search gives me https://www.baeldung.com/linux/error-too-many-open-files/. Can you try increasing the limit to some high value and see if that fixes it?