High cardinality metric observed from grpc instrumentation #7517

jaronoff97 · 2023-04-11T22:45:14Z

Describe the bug
When the flag telemetry.useOtelForInternalMetrics is enabled, metrics for gRPC now come through because of this line. When running a collector with a sufficiently high number of connections, this floods the prometheus exporter with a metric with net.sock.peer.addr attributes. This resulted in a prometheus scrape being a 33 MB file.

This issue has been reported and discussed here. I'm moving the discussion here because this is going to cause any user enabling this flag to potentially experience this cardinality explosion. Temporarily, it would be ideal if we could disable the line I linked to remediate this problem.

Steps to reproduce
Create a collector with the telemetry.useOtelForInternalMetrics flag enabled, load test it, curl the metrics endpoint.

What did you expect to see?
A stable collector that doesn't constantly OOM.

What did you see instead?
Cardinality explosion

What version did you use?
v0.74.0

What config did you use?
Config:

    extensions:
      health_check:
        check_collector_pipeline:
          enabled: false
          exporter_failure_threshold: 5
          interval: 5m
        endpoint: 0.0.0.0:13133
        path: /
      pprof: null
    exporters:
      otlp/selfreport:
        endpoint: XXXXXXXXXXXXXXXXX
        headers:
          lightstep-access-token: ${LS_TOKEN}
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            include_metadata: true
      prometheus:
        config:
          scrape_configs:
          - job_name: spaningest
            scrape_interval: 5s
            static_configs:
            - labels:
                collector_name: ${KUBE_POD_NAME}
              targets:
                - 0.0.0.0:8888
    processors:
      batch:
        send_batch_max_size: 1500
        send_batch_size: 1000
        timeout: 1s
      memory_limiter:
        check_interval: 1s
        limit_percentage: 85
        spike_limit_percentage: 10
    service:
      extensions:
      - health_check
      - pprof
      pipelines:
        metrics:
          exporters:
          - otlp/selfreport
          processors:
          - memory_limiter
          - batch
          receivers:
          - prometheus
      telemetry:
        metrics:
          level: detailed
        resource:
          service.name: spaningest

Environment
OS: Kubernetes
Compiler(if manually compiled): go 1.20

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2023-04-14T15:46:39Z

@mx-psi would you be open to me solving this by removing the line that adds the otel grpc metrics? (this line)

mx-psi · 2023-04-17T06:51:38Z

That would be a breaking change for users, and we would be removing support for something that (AIUI) is in the spec. I think this should be raised at the spec level if it has not been already. On the Collector, I guess the right solution for this would be to support defining views for the metrics, but this seems like something that would require careful design and take significant time.

@codeboten, what do you think is the right approach here? Do we have a clear schema for configuring views in YAML (e.g. from the Configuration WG)?

codeboten · 2023-04-17T11:28:39Z

@mx-psi views are part of the initial example. i agree that it is likely to take some time before it is fully implemented.

Since this is a problem only for the otel instrumentation, would an acceptable interim solution be to configure a view in the SDK to drop problematic metrics? Even if users don't have an option to re-enable them?

I suppose there could be a "enablePotentiallyHighCardinalityMetrics" feature gate.

mx-psi · 2023-04-17T13:32:48Z

Since this is a problem only for the otel instrumentation, would an acceptable interim solution be to configure a view in the SDK to drop problematic metrics? Even if users don't have an option to re-enable them?

I suppose there could be a "enablePotentiallyHighCardinalityMetrics" feature gate.

That sounds like an acceptable solution for me in the short term

**Description:** Puts the grpc meter provider behind a feature flag for controlling high cardinality metrics. **Link to tracking Issue:** #7517

**Description:** Puts the grpc meter provider behind a feature flag for controlling high cardinality metrics. **Link to tracking Issue:** open-telemetry#7517

jaronoff97 · 2023-08-24T21:12:17Z

this was closed by #7543

jaronoff97 added the bug Something isn't working label Apr 11, 2023

mx-psi mentioned this issue Apr 12, 2023

Work needed to enable useOtelForInternalMetrics by default #7454

Closed

1 task

This was referenced Apr 17, 2023

put otel meter for http/grpc behind a FF lightstep/opentelemetry-collector#202

Merged

Put otel meter for http/grpc behind a FF #7543

Merged

codeboten pushed a commit that referenced this issue May 1, 2023

Put otel meter for http/grpc behind a FF (#7543)

f0f710b

**Description:** Puts the grpc meter provider behind a feature flag for controlling high cardinality metrics. **Link to tracking Issue:** #7517

jaronoff97 closed this as completed Aug 24, 2023

ptodev mentioned this issue Aug 29, 2023

Work around high cardinality metrics from otelcol receivers grafana/agent#4769

Merged

4 tasks

dmitryax mentioned this issue Mar 14, 2024

Revisit how collector internal metric are distributed across telemetry levels #7890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High cardinality metric observed from grpc instrumentation #7517

High cardinality metric observed from grpc instrumentation #7517

jaronoff97 commented Apr 11, 2023

jaronoff97 commented Apr 14, 2023

mx-psi commented Apr 17, 2023

codeboten commented Apr 17, 2023

mx-psi commented Apr 17, 2023

jaronoff97 commented Aug 24, 2023

High cardinality metric observed from grpc instrumentation #7517

High cardinality metric observed from grpc instrumentation #7517

Comments

jaronoff97 commented Apr 11, 2023

jaronoff97 commented Apr 14, 2023

mx-psi commented Apr 17, 2023

codeboten commented Apr 17, 2023

mx-psi commented Apr 17, 2023

jaronoff97 commented Aug 24, 2023