Work needed to enable `useOtelForInternalMetrics` by default #7454

codeboten · 2023-03-29T16:38:38Z

This issue is to discuss the work needed to enable using the OTel SDK by default.

Related issues:

Migrate collector's metrics from OC to OpenTelemetry #816

jmacd · 2023-03-29T21:30:42Z

I've experimented with using metrics generated by an OTel collector running with the above-mentioned flag. I'm picking up work started by @paivagustavo. This is an experience report:

I modified this example to configure the Prometheus receiver inside a collector for self observability like:

  prometheus:
    config:
      scrape_configs:
      - job_name: self
        scrape_interval: 5s
        static_configs:
        - targets:
          - 127.0.0.1:8888
// ...
service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [logging, otlp/arrow]
  telemetry:
    resource:
      "service.name": "example-bridge"
    metrics:
      address: 127.0.0.1:8888
      level: detailed
    logs:
      level: info

which is similar to the example configuration given in that receiver's README.

To address the problem in #6403, I use --feature-gates=telemetry.useOtelForInternalMetrics,pkg.translator.prometheus.NormalizeName flags together.

The result, viewed by Prometheus has a mix of metrics produced from OC and OTel. This is an example from the OC path:

# HELP otelcol_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue.
# TYPE otelcol_exporter_enqueue_failed_log_records counter
otelcol_exporter_enqueue_failed_log_records{exporter="logging",service_instance_id="955ece9f-2577-410e-ab0a-8ea042c6de5d",service_name="example-bridge",service_version="0.0.1-dev"} 0
otelcol_exporter_enqueue_failed_log_records{exporter="otlp/arrow",service_instance_id="955ece9f-2577-410e-ab0a-8ea042c6de5d",service_name="example-bridge",service_version="0.0.1-dev"} 0

See that service_name is the configured service.name, service_version was set via the builder and service_instance_id was synthesized while setting up telemetry.

Here is an example from the OTel instrumentation with the associated target info.

# HELP otelcol_exporter_sent_wire_total Number of bytes sent on the wire by the component.
# TYPE otelcol_exporter_sent_wire_total counter
otelcol_exporter_sent_wire_total{exporter="otlp/arrow"} 69203

# HELP otelcol_target_info Target metadata
# TYPE otelcol_target_info gauge
otelcol_target_info{service_instance_id="955ece9f-2577-410e-ab0a-8ea042c6de5d",service_name="example-bridge",service_version="0.0.1-dev"} 1

The same metrics are also pushed via OTLP through an example pipeline, which I've printed. The resource comes through like:

  resource: {
    attributes: {
      key: "service.name"
      value: {
        string_value: "self"
      }
    }
    attributes: {
      key: "service.instance.id"
      value: {
        string_value: "127.0.0.1:8888"
      }
    }
    attributes: {
      key: "net.host.port"
      value: {
        string_value: "8888"
      }
    }
    attributes: {
      key: "http.scheme"
      value: {
        string_value: "http"
      }
    }
  }

Note that the original service name has been lost, overridden by the Prometheus scrape config's job name. The original service instance ID is also lost. The net.host.port has been inserted, but I'm not sure it should be in this case, because I'd prefer not to have lost the original instance ID. The http.scheme is surely not relevant as a resource, here.

I can't explain where the service.version attribute has gone -- it should have come through as a resource attribute.

Next, I experimented with whether you can fix some of these issues using Prometheus relabeling rules, e.g., using

        - regex: service_version
          action: labeldrop
        - regex: __scheme__
          action: labeldrop
        - source_labels: [service_instance_id]
          action: replace
          target_label: instance
        - source_labels: [service_name]
          action: replace
          target_label: job

but this doesn't help much -- because anything you do at this step is before the logic that converts Prom data back to OTLP (which is where the problems arise).

Here's what the OC metric looks like after OTLP conversion:

    metrics: {
      name: "otelcol_exporter_enqueue_failed_log_records"
      description: "Number of log records failed to be added to the sending queue."
      sum: {
        data_points: {
          attributes: {
            key: "exporter"
            value: {
              string_value: "logging"
            }
          }
          attributes: {
            key: "service_name"
            value: {
              string_value: "example-bridge"
            }
          }
          attributes: {
            key: "service_version"
            value: {
              string_value: "0.0.1-dev"
            }
          }
          attributes: {
            key: "service_instance_id"
            value: {
              string_value: "955ece9f-2577-410e-ab0a-8ea042c6de5d"
            }
          }
          start_time_unix_nano: 1680124224552000000
          time_unix_nano: 1680124584556000000
          as_double: 0
        }

It appears that there are several issues, not all of them need to be addressed.

The Prometheus-to-OTLP conversion is leaving me a bit unhappy. I'd like a way for that conversion to honor the existing OTel conventions, somehow. Something like honor_labels: true in the Prometheus configuration, but for honoring OTel attributes. So, instead of honoring job and instance, I want to honor service.name, service.instance.id, and service.version from the scraped data (i.e., from the target_into). Possibly that is the intention behind pkg.translator.prometheus.PermissiveLabelSanitization.
Somehow the target_into metric is not being respected, otherwise we'd see a service.version attribute in the printout
The OpenCensus-generated metrics have unwanted service_name, service_instance_id, and service_version attributes which should be disabled when OTel SDKs are being used

codeboten · 2023-04-03T22:21:34Z

@jmacd I believe this PR will help w/ the third point.

The issue is that currently both OC and OT are using prometheus as a bridge to allow components to continue using opencensus instrumentation. The change I'm proposing disables the OC configuration when useOtelForInternalMetrics is enabled and relies on the oc to otel bridge instead.

mx-psi · 2023-04-12T11:47:55Z

We may need to address #7517 before closing this

mx-psi · 2023-08-11T15:16:30Z

The check_collector_pipeline functionality in the health check extension depends on OpenCensus, see here. We would need to either deprecate this functionality or update it to use OTel before enabling useOtelForInternalMetrics

This ensures backwards compatibility with the metrics generated via opencensus by default today. Part of open-telemetry#7454 Signed-off-by: Alex Boten <aboten@lightstep.com>

This ensures backwards compatibility with the metrics generated via opencensus by default today. Part of #7454 Signed-off-by: Alex Boten <aboten@lightstep.com>

codeboten · 2023-11-22T16:13:49Z

With the PR for _total suffix merging yesterday, I started looking at the second issue @jmacd mentioned in his comment above. With opencensus telemetry, service labels are applied on each metric:

otelcol_exporter_queue_capacity{exporter="otlp",service_instance_id="aae63e2f-cad2-4224-bbb3-806689b0a2d0",service_name="otelcontribcol",service_version="0.89.0-dev"} 1000

With otelForInternalMetrics enabled, all that information is set in the target_info:

otelcol_target_info{service_instance_id="73a547db-f135-45c4-870e-cde704b3cf7d",service_name="otelcontribcol",service_version="0.89.0-dev"} 1

I'm not sure that there's a good mechanism today for applying info from target_info as resource attributes. @dashpole pointed me to a spec change proposed to address this for prometheus exporters open-telemetry/opentelemetry-specification#3761 (thanks David!)

I don't know what the right way to move forward with this change is with the current state. I did test the OTLP exporter for internal metrics and the resource attributes contained the service information I expected. The following is the output from the Go stdout exporter:

  "Resource": [
    {
      "Key": "service.instance.id",
      "Value": {
        "Type": "STRING",
        "Value": "cdc7786f-fcfc-4530-98cd-7278e6dbcf5c"
      }
    },
    {
      "Key": "service.name",
      "Value": {
        "Type": "STRING",
        "Value": "otelcontribcol"
      }
    },
    {
      "Key": "service.version",
      "Value": {
        "Type": "STRING",
        "Value": "0.89.0-dev"
      }
    }
  ],

In other words, for end users that want OTLP for their collector metrics (like @jmacd's original scenario), this would be a good replacement for what is only available today w/ prometheus metrics (with OC). For users that want to continue using the Prometheus export functionality, I'm not sure what to recommend.

dashpole · 2023-11-22T17:13:14Z

Do you know why otelcol_target_info has the otelcol prefix? We would really like to keep the name unmodified, if possible.

codeboten · 2023-11-22T17:30:02Z

Do you know why otelcol_target_info has the otelcol prefix? We would really like to keep the name unmodified, if possible.

It's prefixed here https://github.com/open-telemetry/opentelemetry-collector/blob/445960b82fb4f4a979bfa5e6ce88aa754d22b92d/service/internal/proctelemetry/config.go#L206C1-L206C84

dashpole · 2023-11-22T18:02:10Z

I think we can wrap Collect to add resource attributes as additional labels. That is what the prometheus exporter implements. In Collect, we would need to wrap each Metric, and override the Write function. There, we would need to add our additional labels.

It would end up looking similar to https://github.com/prometheus/client_golang/blob/80d3f0b5b36c891016370423516e353b1a5f0a55/prometheus/metric.go#L172, which wraps another metric and adds exemplars to it.

jpkrohling · 2023-11-23T08:49:46Z

During the SIG call, I mentioned that Java implemented a double-emit so that an application could use both a new and an old semantic convention at the same time, to make it easier to migrate between versions without breaking other parts of the observability pipeline. That feature is described here: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/README.md

codeboten · 2023-11-23T16:15:13Z

@jpkrohling I guess this would raise the question of what's the correct thing to emit since there are a couple of differences between the current metrics and what's are the correct metrics:

_total suffix should be emitted
target_info should be present and new metrics should not include resource labels

Are you proposing that we have two endpoints that emit these metrics at the same time? I'm not sure what the deprecation story would be here, since it would be hard for users to know this change is coming. We saw changing metrics surprise users when we tried to change the default behaviour of counter metrics to include the _total suffix in the past.

Maybe the solution is to have a configuration option to use the legacy format?

codeboten · 2023-11-23T20:06:36Z

@dashpole I followed up on your question about the prefix, and made the change: #8988

I wanted to answer another question from the SIG meeting which was whether the prometheus receiver was able to use this target_info. Before the change, it would treat otelcol_target_info as any other metric. After #8988, the prometheus receiver to use target_info and add its contents in the emitted telemetry. The following is the resource attributes output of a prometheus receiver scraping the otel collector internal metrics and emitting them as a debug exporter:

2023-11-23T12:04:37.361-0800	info	ResourceMetrics #0
Resource SchemaURL:
Resource attributes:
     -> service.name: Str(self)
     -> service.instance.id: Str(127.0.0.1:8889)
     -> net.host.port: Str(8889)
     -> http.scheme: Str(http)
     -> service_instance_id: Str(59dee129-67c4-4981-b077-e4e04aa3e4e6)
     -> service_name: Str(otelcorecol)
     -> service_version: Str(0.89.0-dev)

One more question that ^^^ raises is whether or not this is correct. As you can see there is both a service.name and a service_name attribute which are different, service.name is the prometheus job name configured in the receiver, service_name is the collector's target_info details.

jpkrohling · 2023-11-27T13:21:41Z

target_info should be present and new metrics should not include resource labels
...
Maybe the solution is to have a configuration option to use the legacy format?

Perhaps a feature flag can control the three possible scenarios? No target info + all resource attributes on metrics, target info and minimal set of resource attributes on metrics, target info and all resource attributes.

The third scenario would help people migrate from the previous to the new one, as they can just ignore the extra resource attributes when building the integration. Once they are ready, they can move to the scenario 2, which is what will be supported for the long term.

dashpole · 2023-11-27T14:28:06Z

One more question that ^^^ raises is whether or not this is correct.

This is correct. Unfortunately, until prom supports utf-8, we can't auto-translate service_name -> service.name. I would recommend using an additional processor to overwrite service.name with service_name, etc.

codeboten · 2023-11-27T14:48:12Z

target_info should be present and new metrics should not include resource labels
...
Maybe the solution is to have a configuration option to use the legacy format?

Perhaps a feature flag can control the three possible scenarios? No target info + all resource attributes on metrics, target info and minimal set of resource attributes on metrics, target info and all resource attributes.

The third scenario would help people migrate from the previous to the new one, as they can just ignore the extra resource attributes when building the integration. Once they are ready, they can move to the scenario 2, which is what will be supported for the long term.

Are you suggesting a single gate that could be used for all 3 scenarios? If so then I would recommend we:

wait until we can produce metrics that are identical using OTel,
enable otel by default and deprecate this feature gate
create a new gate that allows users to change the output (open to ideas for name here)
enable this new gate as the default some time in the future

In other words, I would like to decouple people having to change the metrics they're using from enabling otel. That being said, it's a bit weird to enable otel with the old style metrics, but at least it unblocks that work

codeboten · 2023-11-27T15:32:01Z

@dashpole with open-telemetry/opentelemetry-specification#3761 close to merging, do you have any thoughts on how soon this could be adopted in the otel-go prometheus exporter? i would like to avoid adding workarounds in the collector code if possible.

dashpole · 2023-11-27T15:34:30Z

I would guess a week or two? I don't think it will be hard to implement.

…etry#8965) This ensures backwards compatibility with the metrics generated via opencensus by default today. Part of open-telemetry#7454 Signed-off-by: Alex Boten <aboten@lightstep.com>

The metrics are now consistent with the metrics produced by OpenCensus. We should move the featuregate forward. Note that the OpenTelemetry generated metrics includes grpc client/server metrics (for receivers/exporters that use grpc) and `target_info` metrics Fixes #7454 --------- Signed-off-by: Alex Boten <aboten@lightstep.com>

The metrics are now consistent with the metrics produced by OpenCensus. We should move the featuregate forward. Note that the OpenTelemetry generated metrics includes grpc client/server metrics (for receivers/exporters that use grpc) and `target_info` metrics Fixes open-telemetry#7454 --------- Signed-off-by: Alex Boten <aboten@lightstep.com>

codeboten mentioned this issue Mar 29, 2023

Enable useOtelForInternalMetrics by default #7453

Closed

paivagustavo mentioned this issue Apr 4, 2023

prometheus: add WithNamespace option to prefix metrics open-telemetry/opentelemetry-go#3970

Merged

djaglowski mentioned this issue Oct 18, 2023

[pkg/stanza] add buildInfo for operator(#21794) open-telemetry/opentelemetry-collector-contrib#26664

Closed

martin-majlis-s1 mentioned this issue Oct 25, 2023

[exporter/datasetexporter]: Add metrics reporting open-telemetry/opentelemetry-collector-contrib#27487

Closed

codeboten mentioned this issue Nov 20, 2023

[service] remove _total suffix for otel generated metrics #8965

Merged

codeboten mentioned this issue Nov 27, 2023

Allow configuration of prometheus exporter to apply resource attributes as labels open-telemetry/opentelemetry-go#4732

Closed

codeboten mentioned this issue Dec 12, 2023

make useOtelForInternalMetrics as beta #9037

Merged

codeboten closed this as completed in #9037 Dec 12, 2023

kriskalish mentioned this issue Feb 2, 2024

otelcol_exporter_enqueue_failed_spans available in 0.87.0, but not 0.88 #9456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work needed to enable `useOtelForInternalMetrics` by default #7454

Work needed to enable `useOtelForInternalMetrics` by default #7454

codeboten commented Mar 29, 2023 •

edited

jmacd commented Mar 29, 2023

codeboten commented Apr 3, 2023 •

edited

mx-psi commented Apr 12, 2023

mx-psi commented Aug 11, 2023

codeboten commented Nov 22, 2023

dashpole commented Nov 22, 2023

codeboten commented Nov 22, 2023

dashpole commented Nov 22, 2023

jpkrohling commented Nov 23, 2023

codeboten commented Nov 23, 2023

codeboten commented Nov 23, 2023

jpkrohling commented Nov 27, 2023

dashpole commented Nov 27, 2023

codeboten commented Nov 27, 2023

codeboten commented Nov 27, 2023

dashpole commented Nov 27, 2023

Work needed to enable useOtelForInternalMetrics by default #7454

Work needed to enable useOtelForInternalMetrics by default #7454

Comments

codeboten commented Mar 29, 2023 • edited

jmacd commented Mar 29, 2023

codeboten commented Apr 3, 2023 • edited

mx-psi commented Apr 12, 2023

mx-psi commented Aug 11, 2023

codeboten commented Nov 22, 2023

dashpole commented Nov 22, 2023

codeboten commented Nov 22, 2023

dashpole commented Nov 22, 2023

jpkrohling commented Nov 23, 2023

codeboten commented Nov 23, 2023

codeboten commented Nov 23, 2023

jpkrohling commented Nov 27, 2023

dashpole commented Nov 27, 2023

codeboten commented Nov 27, 2023

codeboten commented Nov 27, 2023

dashpole commented Nov 27, 2023

Work needed to enable `useOtelForInternalMetrics` by default #7454

Work needed to enable `useOtelForInternalMetrics` by default #7454

codeboten commented Mar 29, 2023 •

edited

codeboten commented Apr 3, 2023 •

edited