Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return previous memstats if Go collector needs longer than 1s #568

Merged
merged 3 commits into from May 6, 2019

Conversation

beorn7
Copy link
Member

@beorn7 beorn7 commented May 3, 2019

If memstats couldn't be collected within 1s, they will simply be
skipped.

Also, document the issue comprehensively.

Sooooo, this is the simplest possible way to address #551. Or let's say it is the second simplest way.

The simplest way would be to not do anything about it (besides describing the problem in the doc comment) and wait/hope for Go1.13 to finally fix golang/go#19812 . Until that has happened, users affected by #551 will have occasional scrape timeouts. That's at least clean, but since the Go collector is there by default, most of the affected users will probably not have read the warning in the doc comment and will be as puzzled by the timeouts as @bboreham was.

The idea in this PR is to simply skip the memstats metrics in the (rare) cases where it hits a GC that also takes more than about a second to complete at the point the collection hits it. The obvious drawback is that in those cases, the time series will have gaps. (They will even marked as stale, if I understood staleness handling correctly. But I might be wrong. @brian-brazil will know for sure.)

Other ideas that I contemplated:

  • Add GoCollectorOpts, similar to ProcessCollectorOpts, with the option to switch off memstats metrics altogether and set a timeout to consciously opt into the "metric skip" behavior above. A sane default needs to be discussed (could be the status quo). It's a nicely explicit solution, but it will (mildly) break all code using NewGoCollector.

  • In case of a timeout, return the memstats from the previous collection cycle. Concern here is that the previous collection cycle might be very long ago. We could safeguard by only returning previous values that are at most 1m old or something. But that's still quite a contraption. Seems attractive at first, but I think that's the most problematic solution after all.

I'm really not sure what's the least evil here. @bboreham @brian-brazil I'd be grateful for your input.

@beorn7 beorn7 requested a review from brian-brazil May 3, 2019 23:14
@brian-brazil
Copy link
Contributor

They will even marked as stale, if I understood staleness handling correctly.

That's correct.

In case of a timeout, return the memstats from the previous collection cycle. Concern here is that the previous collection cycle might be very long ago.

We'd have the right stats though at the next scrape, this doesn't seem like a major issue to me. We wouldn't have updated stats without waiting until the GC completes anyway.

@bboreham
Copy link
Member

bboreham commented May 4, 2019

Couple of thoughts:

  • I was seeing >10s cycles several times an hour, so 1s doesn’t seem that unlikely.
  • Changes in the particular software have dropped the times, so I’m not as affected right now, personally.
  • Go is hard-coded to run GC every 2 minutes minimum.

I think I’m more with returning the last data we have, maybe collected on a background goroutine. But I haven’t thought how hard that is.

@beorn7
Copy link
Member Author

beorn7 commented May 4, 2019

In case of a timeout, return the memstats from the previous collection cycle. Concern here is that the previous collection cycle might be very long ago.

We'd have the right stats though at the next scrape, this doesn't seem like a major issue to me. We wouldn't have updated stats without waiting until the GC completes anyway.

That's right, and I think in scenarios with regular scrapes, it will work just fine. I'm more concerned about less conventional scenarios, e.g. a developer runs a test binary that isn't regularly scraped by Prometheus and manually looks at the metrics now and then, like every few hours. This might yield results that are hours old without any warning.

But as you both seem to like the idea of returning the last data, let's think a bit more about it.

I wouldn't like to pro-actively poll memstats in a goroutine as I want an unscraped binary to not do any significant activity only because it is instrumented.

I do think we should have some sanity age check for the previous metrics, perhaps 5m as that's the lookback delta of Prometheus and if the previous collection is longer ago, it's probably not a setup with regular scrapes.

Thus, the proposed algorithm would be:

  • Start runtime.ReadMemStats in a goroutine. No matter what happens below, this goroutine will save the result with the current timestamp.
    • If the goroutine finishes within 1s, collect metrics as usual.
    • If not, check if there are saved results from a previous run that are less than 5m old.
      • If yes, return those.
      • If not, wait for the goroutine to complete, even if it takes very long.

This will result in the following properties:

  • Returned results will never be older than 5m.
  • If there are fresh enough results from previous runs, the collection will not last for longer than ~1s (which I still think is a fair requirement).
  • If there are no fresh enough previous results available, the collection will revert to the old behavior, which will in extreme cases run into the scrape timeout of the Prometheus server. But it will be a one-time thing because the next scrape will have fresh enough previous results available.
  • Under no circumstances it will return incomplete metrics.
  • With the expected fix in Go1.13, the code will automatically do the right thing as memstats collection will be faster than 1s.

Let me know what you think. If you like it, I'll implement it.

@brian-brazil
Copy link
Contributor

That sounds reasonable to me, presuming you're okay with it.

@beorn7
Copy link
Member Author

beorn7 commented May 4, 2019

I'll give it a spin.

@beorn7 beorn7 changed the title Limit collection time of the Go collector to 1s Return previous memstats if Go collector needs longer than 1s May 5, 2019
@beorn7
Copy link
Member Author

beorn7 commented May 5, 2019

Done. Please have another look, Bryan and Brian.

@brian-brazil
Copy link
Contributor

👍

Might need a make format.

@beorn7
Copy link
Member Author

beorn7 commented May 6, 2019

Might need a make format.

What makes you think so?

@beorn7
Copy link
Member Author

beorn7 commented May 6, 2019

(And wouldn't the CI catch that anyway?)

@brian-brazil
Copy link
Contributor

What makes you think so?

Some of the comments aren't lined up as I would expect.

(And wouldn't the CI catch that anyway?)

I had looked for that, but looks like I missed that it was checked.

beorn7 added 3 commits May 6, 2019 23:28
Signed-off-by: beorn7 <bjoern@rabenste.in>
In this simple case, it's the fastest and easiest.

Signed-off-by: beorn7 <bjoern@rabenste.in>
tl;dr: Return previous memstats if reading new ones takes longer than
1s.

See the doc comment of NewGoCollector for details.

Signed-off-by: beorn7 <bjoern@rabenste.in>
@beorn7
Copy link
Member Author

beorn7 commented May 6, 2019

I squashed the commits to a meaningful sequence. Will merge on green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

runtime: cannot ReadMemStats during GC
3 participants