Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High cardinality metric observed from grpc instrumentation #7517

Closed
jaronoff97 opened this issue Apr 11, 2023 · 5 comments
Closed

High cardinality metric observed from grpc instrumentation #7517

jaronoff97 opened this issue Apr 11, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@jaronoff97
Copy link
Contributor

Describe the bug
When the flag telemetry.useOtelForInternalMetrics is enabled, metrics for gRPC now come through because of this line. When running a collector with a sufficiently high number of connections, this floods the prometheus exporter with a metric with net.sock.peer.addr attributes. This resulted in a prometheus scrape being a 33 MB file.

This issue has been reported and discussed here. I'm moving the discussion here because this is going to cause any user enabling this flag to potentially experience this cardinality explosion. Temporarily, it would be ideal if we could disable the line I linked to remediate this problem.

Steps to reproduce
Create a collector with the telemetry.useOtelForInternalMetrics flag enabled, load test it, curl the metrics endpoint.

What did you expect to see?
A stable collector that doesn't constantly OOM.

What did you see instead?
Cardinality explosion

What version did you use?
v0.74.0

What config did you use?
Config:

    extensions:
      health_check:
        check_collector_pipeline:
          enabled: false
          exporter_failure_threshold: 5
          interval: 5m
        endpoint: 0.0.0.0:13133
        path: /
      pprof: null
    exporters:
      otlp/selfreport:
        endpoint: XXXXXXXXXXXXXXXXX
        headers:
          lightstep-access-token: ${LS_TOKEN}
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            include_metadata: true
      prometheus:
        config:
          scrape_configs:
          - job_name: spaningest
            scrape_interval: 5s
            static_configs:
            - labels:
                collector_name: ${KUBE_POD_NAME}
              targets:
                - 0.0.0.0:8888
    processors:
      batch:
        send_batch_max_size: 1500
        send_batch_size: 1000
        timeout: 1s
      memory_limiter:
        check_interval: 1s
        limit_percentage: 85
        spike_limit_percentage: 10
    service:
      extensions:
      - health_check
      - pprof
      pipelines:
        metrics:
          exporters:
          - otlp/selfreport
          processors:
          - memory_limiter
          - batch
          receivers:
          - prometheus
      telemetry:
        metrics:
          level: detailed
        resource:
          service.name: spaningest

Environment
OS: Kubernetes
Compiler(if manually compiled): go 1.20

Additional context
Add any other context about the problem here.

@jaronoff97 jaronoff97 added the bug Something isn't working label Apr 11, 2023
@jaronoff97
Copy link
Contributor Author

@mx-psi would you be open to me solving this by removing the line that adds the otel grpc metrics? (this line)

@mx-psi
Copy link
Member

mx-psi commented Apr 17, 2023

That would be a breaking change for users, and we would be removing support for something that (AIUI) is in the spec. I think this should be raised at the spec level if it has not been already. On the Collector, I guess the right solution for this would be to support defining views for the metrics, but this seems like something that would require careful design and take significant time.

@codeboten, what do you think is the right approach here? Do we have a clear schema for configuring views in YAML (e.g. from the Configuration WG)?

@codeboten
Copy link
Contributor

@mx-psi views are part of the initial example. i agree that it is likely to take some time before it is fully implemented.

Since this is a problem only for the otel instrumentation, would an acceptable interim solution be to configure a view in the SDK to drop problematic metrics? Even if users don't have an option to re-enable them?

I suppose there could be a "enablePotentiallyHighCardinalityMetrics" feature gate.

@mx-psi
Copy link
Member

mx-psi commented Apr 17, 2023

Since this is a problem only for the otel instrumentation, would an acceptable interim solution be to configure a view in the SDK to drop problematic metrics? Even if users don't have an option to re-enable them?

I suppose there could be a "enablePotentiallyHighCardinalityMetrics" feature gate.

That sounds like an acceptable solution for me in the short term

codeboten pushed a commit that referenced this issue May 1, 2023
**Description:** Puts the grpc meter provider behind a feature flag for controlling high cardinality metrics.

**Link to tracking Issue:** #7517
codeboten pushed a commit to codeboten/opentelemetry-collector that referenced this issue May 2, 2023
**Description:** Puts the grpc meter provider behind a feature flag for controlling high cardinality metrics.

**Link to tracking Issue:** open-telemetry#7517
@jaronoff97
Copy link
Contributor Author

this was closed by #7543

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants