OpenTelemetry Support #2000

jaronoff97 · 2023-06-13T15:37:24Z

Hello! I'm coming from the https://github.com/open-telemetry/opentelemetry-operator group, and I was wondering if there's a good way to integrate OpenTelemetry support in to the openshift cluster monitoring operator? The OpenTelemetry operator is able to seamlessly pull scrape configurations via service monitors and pod monitors and can slot in very easily to existing prometheus implementations. Would there be any interest in OpenTelemetry support here to run collectors so that this operator could push OpenTelemetry metrics?

pavolloffay · 2023-06-20T07:35:00Z

@jaronoff97 thanks for opening the issue.

Let's discuss the overall use case and maybe the alternatives that are available right now.

Is the intention to export metrics from the cluster to 3rd party observability vendor? E.g. use CMO to collect metrics, forward them to the OpenTelemetry collector, and from there integrate with 3rd party vendors? For this use-case, the CMO should support Prometheus remote-write protocol.

simonpasquier · 2023-06-20T08:30:49Z

I agree with @pavolloffay it would help if you can provide more details about your use case. Is the goal to replace the existing Prometheus stack by the OTEL collector? Prometheus is a corner stone of the cluster monitoring operator (alerts, dashboards, ...).

jaronoff97 · 2023-06-20T18:13:04Z

@pavolloffay that's the intention yes. I have it right now where the collector can be installed with the target allocator and prometheus CR functionality enabled to pull down the servicemonitor config created by the CMO, however, the collector doesn't have access to the secrets/configmaps created by the CMO out of the box which requires some pretty tedious wiring. Further, the collector doesn't have a prometheus remote write receiver yet (issue) so we coudn't have the CMO use that to forward those metrics.

@simonpasquier I'm aware of why the CMO exists to install prometheus for alerts and dashboards, however, for a user who is sending data to a vendor for those capabilities, prometheus mostly exists as a scraping mechanism (hence the use of the collector)

simonpasquier · 2023-06-21T08:20:56Z

Does it mean that the OTEL collector supporting the Prometheus remote write protocol as a receiver is good enough?

jaronoff97 · 2023-06-21T19:56:40Z

It's not ideal, i would much rather this chart be able to run the otel collector / otel operator as an alternative to the default prometheus components. It would make a customer's experience using openshift with an external vendor much more seamless. If that's not possible, however, I think the prometheus remote write idea may be sufficient in the interim.

smithclay · 2023-06-21T20:39:33Z

Chiming in to share some more details re: the ask and @jaronoff97's use case:

We’re seeing teams with very large OpenShift clusters standardize all of their telemetry and telemetry infrastructure around OpenTelemetry. In OpenShift and Kubernetes, almost all of these teams also want to scrape the standard cluster metrics from kube-state-metrics, node-exporter, kublet (etc) and send them to remote endpoints (vendors, internal telemetry pipelines, other clusters, etc) via the OTLP protocol.

From a ease-of-use and cost perspective, the thing we hear is the ideal state is a “drop-in-replacement” that swaps out Prometheus with OpenTelemetry collector(s) -- storage/visualization/querying/alerting in this architecture is handled by tools and pipelines that live outside of the clusters, so there's no need to run Prometheus.

We think the cluster-monitoring-operator is the best place to give teams the option to swap out Prometheus as a backend with a collector that manages and forwards the data outside the cluster. The goal isn't to replace Prometheus in the operator, just to better support teams standardizing on OpenTelemetry that want to adopt a OTel Collector-based approach.

jan--f · 2023-06-23T11:31:34Z

@smithclay @jaronoff97 At this point OpenShift requires a Prometheus stack that can be queried, like Simon already mentioned. We are working on fixing this interdependence, you can follow https://issues.redhat.com/browse/OBSDA-242 for details and status.

Whether CMO is then the right component for such a pipeline is still an open question I think.

Do you know of any large scale performance test for a OTEL collector deployment like this?

jaronoff97 · 2023-06-30T18:52:34Z

Is OpenShift constantly querying Prometheus, or does it just expect a prometheus to be present? Are we not able to run Prometheus and the OtelCollector side by side? Our goal here is to make it easy for OpenShift customers to install and use an OtelCollector when they're not going to be using Prometheus. Is there another OpenShift component that customers could install to accomplish this that we should be looking into?

jaronoff97 · 2023-07-13T19:17:36Z

@jan--f wanted to check in and see if there was any interest in doing the above. I'm happy to contribute to this project to enable the above. The OtelCollector should be able to serve as a drop in replacement for the Prometheus instance you are currently installing.

jan--f · 2023-07-14T10:24:22Z

The OtelCollector should be able to serve as a drop in replacement for the Prometheus instance you are currently installing.

That's unfortunately not the case. OpenShift expects are query-able Prometheus instance. I don't really see the point of running both at the same time.
We are working on removing OpenShifts dependencies on Prometheus. When this is done this is worth revisiting. This is tracked in https://issues.redhat.com/browse/MON-3152.
https://github.com/rhobs/observability-operator might be a better target for this. However reusing the ServiceMonitors that CMO usees is not straight forward.

jaronoff97 · 2023-07-14T14:54:18Z

We could also run the otel collector as the and scraper remote write to the CMO prometheus instance so that prometheus doesn't do any of the scraping, only the querying. This would let customers who want to export their metrics to other backends able to do so and precent the dual scraping you're worried about.

jaronoff97 · 2023-08-25T14:48:39Z

@jan--f I'm running in to this same issue again with another otel-installation. Any updates here?

jan--f · 2023-08-28T09:23:15Z

Sorry what issue is that exactly?
No update on this. There is currently no plan to integrate the otel-collector in CMO. As of now we don't see the benefit of running an additional component. You mentioned exporting metrics to third-party backends, but as of now no one in Red Hat supports exporters that would add functionality afaik. Going by open-telemetry/opentelemetry-collector#3474 the Otel community has a similar issue.

jaronoff97 · 2023-08-28T18:26:43Z

I'm not sure what the linked issue implies – we have a successful stable way of exporting traces. I have an open issue with Prometheus to see if they will let us export OTLP prometheus/prometheus#12633. The issue is that the CMO installs a lot of prometheus components that customers would like to export as OTLP, currently that is impossible with the CMO as far as i can tell. Regardless, it's alright I'm going to push the prometheus issue and come back to it.

openshift-bot · 2023-11-27T01:00:57Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

pavolloffay · 2023-11-27T13:42:44Z

/lifecycle frozen

I will have some updates to share soon.

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 27, 2023

openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry Support #2000

OpenTelemetry Support #2000

jaronoff97 commented Jun 13, 2023

pavolloffay commented Jun 20, 2023

simonpasquier commented Jun 20, 2023

jaronoff97 commented Jun 20, 2023 •

edited

simonpasquier commented Jun 21, 2023

jaronoff97 commented Jun 21, 2023

smithclay commented Jun 21, 2023

jan--f commented Jun 23, 2023

jaronoff97 commented Jun 30, 2023

jaronoff97 commented Jul 13, 2023

jan--f commented Jul 14, 2023

jaronoff97 commented Jul 14, 2023 •

edited

jaronoff97 commented Aug 25, 2023

jan--f commented Aug 28, 2023

jaronoff97 commented Aug 28, 2023 •

edited

openshift-bot commented Nov 27, 2023

pavolloffay commented Nov 27, 2023

OpenTelemetry Support #2000

OpenTelemetry Support #2000

Comments

jaronoff97 commented Jun 13, 2023

pavolloffay commented Jun 20, 2023

simonpasquier commented Jun 20, 2023

jaronoff97 commented Jun 20, 2023 • edited

simonpasquier commented Jun 21, 2023

jaronoff97 commented Jun 21, 2023

smithclay commented Jun 21, 2023

jan--f commented Jun 23, 2023

jaronoff97 commented Jun 30, 2023

jaronoff97 commented Jul 13, 2023

jan--f commented Jul 14, 2023

jaronoff97 commented Jul 14, 2023 • edited

jaronoff97 commented Aug 25, 2023

jan--f commented Aug 28, 2023

jaronoff97 commented Aug 28, 2023 • edited

openshift-bot commented Nov 27, 2023

pavolloffay commented Nov 27, 2023

jaronoff97 commented Jun 20, 2023 •

edited

jaronoff97 commented Jul 14, 2023 •

edited

jaronoff97 commented Aug 28, 2023 •

edited