-
Notifications
You must be signed in to change notification settings - Fork 22
Failure when querying region quotas #10
Comments
I have looked at the code a bit and came to a few conclusions. First of all, I think the I think the actual problem is the following: The gcp-quota-exporter exposes its own Which leads me to the solution I'd like to suggest here: Since quota information is changing only very very slowly, but retrieving it is so exceedingly unreliable, I think it is OK to cache the result and return a cached version if the retrieval is failing. Perhaps then add a metric that contains the timestamp of the last retrieval, so that you can detect overly stale quota information. https://prometheus.io/docs/instrumenting/writing_exporters/#scheduling has some best practices about this topic, but it doesn't really cover the scenario "usually fast enough for synchronously pulling the source metrics but quite often taking too long". |
In our particular case, just setting a much higher timeout and higher retry count might already help, but in any case, exposing a |
Thanks @beorn7.
+1 on the caching, it should mitigate the target-flapping completely. |
Beorn has configured these values, which prevent it from flapping now:
|
It make sense, since we scrape 2 different api ( project and region ) we could present 2
I am not so sure about this. An issue on the scraper side would look exactly like and issue on the google API side, and it could last potentially for long times.
This seems like a much better solution to me and we could definetely update the default values to something more reasonable. I also wonder if our use of the rehttp is actually correct to handle retry on timeout (note this was mostly a copy of the stackdriver exporter ) but https://github.com/mintel/gcp-quota-exporter/blob/master/main.go#L129 , where we set the status to More investigation needed ! but thanks for the work |
given that |
We need to be careful to have default values that match |
With our manually increased timeouts and scrape interval, we are faring quite well so far. I guess it's best for now to not complicate things and refrain from caching for the time being. |
@primeroz As beorn mentioned, we are faring quite well, there is no reason for me to keep this issue open any longer. WDYT? |
I wanted to have a better look at the retry logic which I don't think is working as expected. So let's keep this open and I will close it when I am back from holiday ? |
We monitor our Prometheus scraping and this one causes a bit of noise as the scrape fails every now and then.
Error observed shows that the API is just a bit unresponsive:
Solution with least amount of work I can come up with: swap
MustRegister
forRegister
and don't panic if the collector fails. I think it should suffice for the purpose of collecting the quota data, we can be fairly sure that if the quota API is not available the quota's are still there. :-) Only thing the go docs were not clear on, if the collector fails withRegister
, does it expose the last known metric or nothing at all?Another solution (would be more work, I use this pattern in another exporter), we could simply run a loop that gathers the metrics (no
Register/MustRegister
) and havehttp.ListenAndServe
in a go routine. This just keeps exposing the data as last seen instead of nothing (for the purpose of colecting metrics on quotas or events, this should suffice). It would also simplify code a lot as the framework for the Collector is not needed for this.WDYT?
The text was updated successfully, but these errors were encountered: