Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics SDK: define standard circuit breakers for uncollected, infinite cardinality recordings #1891

Closed
jkwatson opened this issue Aug 26, 2021 · 12 comments · Fixed by #2960
Closed
Assignees
Labels
area:sdk Related to the SDK release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal

Comments

@jkwatson
Copy link
Contributor

It is possible (perhaps even likely) that instrumentation will generate metrics with cardinality explosions. There may be backends that can handle this, so it's not necessarily an error in the instrumentation. However, it will often be a mistake in the instrumentation, and the SDK needs to include some sort of mechanism so that the the application will not die with a memory leak in this case. This could especially happen with a pull-based exporter if nothing is pulling the data from it, for example.

The SDK specification should define the standard strategy for this circuit breaker. There are at least several possible options to solving this, and it would be good if all languages implemented the same default strategy.

Some options:
a) Count the cardinality and start dropping recordings for instruments that exceed it.
b) Watch for specific dimensions that are causing the explosion and start automatically dropping those dimensions.
c) Something more clever.

@jkwatson jkwatson added the spec:metrics Related to the specification/metrics directory label Aug 26, 2021
@reyang reyang added the area:sdk Related to the SDK label Sep 8, 2021
@reyang reyang added this to the Metrics API/SDK Feature Freeze milestone Sep 8, 2021
@reyang reyang self-assigned this Sep 8, 2021
@reyang
Copy link
Member

reyang commented Sep 30, 2021

I'm trying to address these topics in #1966, leave it as guidance/suggestion rather than spec requirements.

@reyang reyang removed this from the Metrics API/SDK Feature Freeze milestone Oct 5, 2021
@reyang reyang added the release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs label Oct 5, 2021
@jack-berg
Copy link
Member

An initial implementation of cardinality defenses was recently implemented in java.

Im summary, each view has a max number of metric streams. Currently, this max is hardcoded to be 2000, which was chosen rather arbitrarily. If exceeded, we log a warning and drop subsequent recordings. At the same time we aggressively forget metric streams - any metric stream that doesn't have recordings during a collection is forgotten. This applies to cumulative / delta and sync / async cases.

@jmacd
Copy link
Contributor

jmacd commented Jul 25, 2022

@jack-berg --

@MrAlias asked for guidance on this topic in open-telemetry/opentelemetry-go#3006 (comment)

I am aware of incompleteness in our specification around the appearance or disappearance of metrics from one collection to the next, depending on async/sync and depending on temporality. I came up with my own answers to this set of questions for my (i.e., Lightstep's) experimental SDK, and one of the deciding factors came from trying to pin down the behavior of synchronous gauge instruments -- while prototyping for #2318.

At the highest level, my approach to these questions is "when cumulative temporality, do what a Prometheus user expects" and "when delta temporality, do what a Statsd user expects".

To be clear, I'm interested in specifying how an SDK should handle the gradual accumulation of memory due to cardinality across multiple collection cycles. I am less interested in specifying how the SDK should handle sudden accumulation of memory in a single collection cycle (in this situation, I will advocate the use of sampling, but it's a separate topic).

By the above paragraph's logic, the solution for high-cardinality in a delta-temporality exporter is to shorten the collection interval because the SDK is allowed to forget all memory following every collection. The problem referred to in open-telemetry/opentelemetry-go#3006 is for cumulative exporters, and I suggest we solicit input from the Prometheus team (cc @RichiH @gouthamve).

The primary "tool" we have to address this cardinality problem is the use of start_time to signal reset of an individual timeseries. Questions for the audience, considering cumulative exporters and high cardinality:

  1. If a timeseries is "stale" meaning has not been used in many collection cycles, MAY we drop it? (how many cycles is "many"?)
  2. If the timeseries will be dropped, SHOULD we emit data points w/ the staleness marker (i.e., NO_DATA_PRESENT) for some period of time (determined how?)
  3. If the timeseries reappears at some future time, what will its new start time be?.

The third question is the most difficult. When a process starts from scratch, it is understandable that we use the process start time as the start time when a metric first appears. But -- what does that mean? If a process starts, one hour passes, and then a Counter is incremented for the first time, at that moment we can say that the series has a lifetime rate of 1/hour. One hour later, we decide to drop that series. After another hour, the series has another count. Its lifetime rate is 2 events per 3 hours.

Note that if our goal is to eliminate memory, we can't exactly "keep" the timestamp when an individual series was forgotten. By this logic, we can't distinguish series that have been reset from series that have never reported. We can no longer use the process start time for "new" series, because we don't know what is new and what was previously used.

When reporting a new-or-previously-reused metric, any start time after the most recent data point's timestamp will be correct. One solution is to remember the last time any cumulative series was forgotten, call this time_of_last_reset (initially equal to the process start time). When any new series appears, set the start time to time_of_last_reset. When any series is forgotten, update time_of_last_reset.

Another solution is to use timestamp-1 as the start time for all new series. This has unfortunate side-effects, however, since it gives the appearance of an extremely large instantaneous rate when in fact, the series is just rarely used.

As you can see, there is some loss introduced by allowing series to be forgotten. We can no longer rely on the process start time equaling the start time of every metric it produces. This leads to follow-on questions--in particular #1273.

@jack-berg
Copy link
Member

The java SDK changed its behavior after I posted that "we aggressively forget metric streams - any metric stream that doesn't have recordings during a collection is forgotten". We do still purge series after collection when the temporality is delta, but for cumulative metrics we retain all series and have a upper limit of 2000 timeseries per instrument. If the timeseries limit is exceeded, we log a warning and drop recordings.

Another solution is to use timestamp-1 as the start time for all new series. This has unfortunate side-effects, however, since it gives the appearance of an extremely large instantaneous rate when in fact, the series is just rarely used.

What about using the timestamp of last collection as the start time?

@jmacd
Copy link
Contributor

jmacd commented Jul 27, 2022

What about using the timestamp of last collection as the start time?

For push-based exporters, this works.

For pull-based exporters, in the common case when there is only one scraper pulling data from the SDK, this works.

For pull-based exporters will more than one scraper, the scraper would have to provide its last-scrape-timestamp in the request to do this well. If the exporter simply tracks the last-export timestamp by any scraper, you might end up in a situation where the last-collection time is not meaningful. In a Prometheus environment, where typically rate calculations are done based on timeseries vectors, not using point start times, I don't think you will see a meaningful difference--although this is a source of ambiguity when counters reset.

The purpose of the start timestamp is to avoid ambiguity over counter reset, so that rate calculations can be made easily. If the last-collection timestamp is used as the start timestamp for new series, you may end up with OTLP data points with extreme-value instantaneous rates.

@jmacd
Copy link
Contributor

jmacd commented Aug 4, 2022

@legendecas made an insightful remark here, open-telemetry/opentelemetry-js#2997 (comment), noting that temporality is irrelevant to this problem. Cumulative/Synchronous and Delta/Asychronous configurations both have the same problem. In both cases, a user may stop reporting a set of attributes and then possibly resume reporting the same set at some point in the distant future.

Assuming we define a circuit-breaker mechanism, such as a configuration to "expunge" instrument/attributes that have become stale (without loss of information => addresses long-term risk) or to drop instrument/attributes that are presently overflowing (with loss of information => addresses short-term risk), we have an independent problem of how to convey that to the OTLP or PRW consumer.

I propose to leave this issue open as requesting a definition for the circuit-breaker mechanism, examples include:

  • Expunge series from the instrument after they have not been used for N collection cycles (N >= 1) or T seconds pass
  • Expunge series from the instrument if the current series count is >= R records.

I will file a new issue to discuss how the SDK should "safely" expunge these series.

@MrAlias
Copy link
Contributor

MrAlias commented Oct 27, 2022

It was pointed out in this discussion that Prometheus allows users to define their own mechanism to purge their own data.

Coming from prometheus, you can create MetricVec type which can delete specific metrics.

Might be worth specifying a similar mechanism instead of a prescriptive drop policy. That way the most informed person about the data, the user, can make the decision.

@jmacd
Copy link
Contributor

jmacd commented Nov 9, 2022

We discussed this in the Prometheus working group today because we feel it is important for OTel and Prometheus to align on how we handle cardinality explosions. We placed our attention on these configuration options documented for the Prometheus server, particularly sample_limit and label_limit.

# An uncompressed response body larger than this many bytes will cause the
# scrape to fail. 0 means no limit. Example: 100MB.
# This is an experimental feature, this behaviour could
# change or be removed in the future.
[ body_size_limit: [<size>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#size) | default = 0 ]
# Per-scrape limit on number of scraped samples that will be accepted.
# If more than this number of samples are present after metric relabeling
# the entire scrape will be treated as failed. 0 means no limit.
[ sample_limit: [<int>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#int) | default = 0 ]

# Per-scrape limit on number of labels that will be accepted for a sample. If
# more than this number of labels are present post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_limit: [<int>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#int) | default = 0 ]

# Per-scrape limit on length of labels name that will be accepted for a sample.
# If a label name is longer than this number post metric-relabeling, the entire
# scrape will be treated as failed. 0 means no limit.
[ label_name_length_limit: [<int>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#int) | default = 0 ]

# Per-scrape limit on length of labels value that will be accepted for a sample.
# If a label value is longer than this number post metric-relabeling, the
# entire scrape will be treated as failed. 0 means no limit.
[ label_value_length_limit: [<int>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#int) | default = 0 ]

# Per-scrape config limit on number of unique targets that will be
# accepted. If more than this number of targets are present after target
# relabeling, Prometheus will mark the targets as failed without scraping them.
# 0 means no limit. This is an experimental feature, this behaviour could
# change in the future.
[ target_limit: [<int>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#int) | default = 0 ] ]

We are interested in describing what should happen for Push and Pull exporters when a cardinality limit is reached. We know that when these limits are reached the entire scrape will be treated as failed (Pull). We also have standard gRPC/HTTP message-size limits that, however configured, can cause an entire batch of metrics to be "treated as failed" without further specification (as we have not specified any ways of splitting an OTLP metrics payload).

Therefore, we are interested in allowing the user to configure limits at a fine grain to avoid the entire loss of metrics scenario. Cardinality limits could be applied at:

  • the instrument level
  • the instrumentation-scope level.

That said, these configuration settings should be considered alongside other means of controlling cardinality. Currently, we have these other approaches:

  1. Configure a view to filter the high-cardinality attribute
  2. For asynchronous instruments: the instrument's callback has the ability to forget timeseries, the callback can simply stop reporting metrics. (Note: This can be accomplished in Prometheus through a custom collector of the instrument Delete() method.)
  3. For synchronous instruments: there is no OTel API to delete a timeseries, and we are not willing to change this detail before reaching a Metrics-1.0 milestone. Therefore, the standard behavior should follow Prometheus and fail the metrics.

Speaking as a user, I've already experienced the limit described above for gRPC pushing OTLP payloads that exceed the configured limit. I would like to be able to build in isolation so that a single instrumentation library cannot cause entire loss of metrics. I believe the instrumentation scope is the natural place to place limits such as the total byte size and data point count.

When an instrumentation scope exceeds its limits, the following should happen:

a. The entire batch of data from that scope fails. Presumably, this will be observable in the future sense that we want to have metrics about the SDK's success and failure, and these metrics can include the instrumentation scope name/version/attributes.
b. The SDK should raise an error telling the user that limits were exceeded by the library, and to consider one of the following remedies: (b.1) use of a view with attribute filter for the high-card attribute (b.2) use of delta temporality with synchronous instruments, (b.3) use of asynchronous instruments that forget cardinality.

These recommendations are in-line with Prometheus in that Views should be applied before comparing against limits and "entire failure" is the consequence when limits are met.

@asafm
Copy link
Contributor

asafm commented Dec 29, 2022

there is no OTel API to delete a timeseries, and we are not willing to change this detail before reaching a Metrics-1.0 milestone.
@jmacd Can you please explain the timing preference? I need more context to understand that.

IMO I think it's imperative to have an explicit way to delete a time series, and I've set up an issue to discuss this: #3062 (I searched for an issue and couldn't find one, not sure GitHub search is that good). In this issue, I gave a real-world example of why it is needed.
IMO having a "manual" way to say "this time series won't be reported anymore" complements the issue discussed here of applying a limit to the number of time series measured and/or reported.

@dashpole
Copy link
Contributor

dashpole commented Jan 18, 2023

The primary "tool" we have to address this cardinality problem is the use of start_time to signal reset of an individual timeseries. Questions for the audience, considering cumulative exporters and high cardinality:

  1. If a timeseries is "stale" meaning has not been used in many collection cycles, MAY we drop it? (how many cycles is "many"?)

For sync instruments, IMO you should never drop at the instrument level, and leave it to exporters to determine if they can support the cardinality produced.
For async instruments, IMO "stale" should mean it has not been used in a single (non-error) callback, and it should be dropped immediately.

  1. If the timeseries will be dropped, SHOULD we emit data points w/ the staleness marker (i.e., NO_DATA_PRESENT) for some period of time (determined how?)

At least for prometheus, this is unnecessary, as staleness tracking is done by the scraper, not by the client. The client simply stops exposing the timeseries in question on the scrape endpoint.
For a push exporter, it would be helpful to emit a single point with a staleness marker for each exporter (if possible). But I don't think this is a hard requirement.

  1. If the timeseries reappears at some future time, what will its new start time be?.

I agree with @jack-berg above. It should be the time of the last collection (i.e. the last time the callback is invoked). At that point in time, a callback was invoked and no value was returned. If there are multiple scrapers, the start time is still correct in this case. If we had scrapes for a particular label set:

t0: "A" scrapes -> No value
t1: "B" scrapes -> No value
t2: "A" scrapes -> Has value X
t3: "B" scrapes -> Has value Y

"A" would receive (at t2 scrape):

  • start = t1
  • end = t2
  • value = X

"B" would receive (at t3 scrape):

  • start = t1
  • end = t3
  • value = Y

I believe this is correct, as the value X implicitly was accumulated between t1 and t2, and Y implicitly accumulated between t1 and t3. While a start time of t0 (instead of t1) for the "A" scraper might also be correct (if we were tracking start time per-scraper), t1 is also correct.

For prometheus, the built-in standard sync instruments (e.g. gauge don't support Delete()(at least not in go). You can use a raw MetricVec to delete label values, but that isn't very common from my experience. It is documented as a building block for building your own synchronous instruments. More common is to implement a custom collector (e.g. cadvisor which is roughly the equivalent of a callback, as it allows the user to put the current value of metrics into a channel (in go).

For users who need a mechanism to report metrics which have a bounded cardinality in a particular collection round, but unbounded cardinality over time (e.g. something like kube-state-metrics which reports a metric for each pod in a kubernetes cluster, but which is likely to have about the same number of pods in the cluster at a point-in-time), I would advocate they use asynchronous instruments to intentionally "forget" label sets that are no longer relevant, and that the SDK treat the non-reporting of a label set as a signal that it should no longer export or cache information about it.

@asafm
Copy link
Contributor

asafm commented Feb 19, 2023

@dashpole

For users who need a mechanism to report metrics which have a bounded cardinality in a particular collection round, but unbounded cardinality over time (e.g. something like kube-state-metrics which reports a metric for each pod in a kubernetes cluster, but which is likely to have about the same number of pods in the cluster at a point-in-time), I would advocate they use asynchronous instruments to intentionally "forget" label sets that are no longer relevant, and that the SDK treat the non-reporting of a label set as a signal that it should no longer export or cache information about it.

What can they use to report a histogram, as it doesn't have an asynchronous version?
Don't you think it's better, specifically in the case of k8s pods, they will remove the pod related attribute sets, since they know when a pod was deleted - explicit delete?

@carlosalberto carlosalberto added the triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal label Feb 20, 2023
@dashpole
Copy link
Contributor

What can they use to report a histogram, as it doesn't have an asynchronous version?

They wouldn't be able to use the approach above today, and would need what you are proposing.

Don't you think it's better, specifically in the case of k8s pods, they will remove the pod related attribute sets, since they know when a pod was deleted - explicit delete?

I don't want to delve too deeply into k8s internals, but the kube client already maintains the set of watched objects in memory for us. It would likely be simpler to use an async instrument based on listing objects from the client's cache.

carlosalberto pushed a commit that referenced this issue May 8, 2023
Fixes #1891.

**EDIT: Updated to specify cardinality limits at the View/Instrument
level with a Reader-level default. Updated to use a hard limit**

## Changes

Adds optional support for a maximum cardinality limit.

The recommended default is 2000, based on this comment by
#1891 (comment)
@jack-berg.

~The Prometheus-WG SIG discussed this on Nov 9, 2022 and reached this
recommended solution to the problem outlined in #1891. The consequence
of exceeding these limits is in line with the current Prometheus server
behavior, which drops targets that misbehave. The discussed was
summarized here:
#1891 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory triaged-accepted The issue is triaged and accepted by the OTel community, one can proceed with creating a PR proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants