Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Support exponential histograms in the Prometheus exporter #5940

Closed
fstab opened this issue Oct 26, 2023 · 15 comments
Closed

Proposal: Support exponential histograms in the Prometheus exporter #5940

fstab opened this issue Oct 26, 2023 · 15 comments
Labels
Feature Request Suggest an idea for this project

Comments

@fstab
Copy link
Member

fstab commented Oct 26, 2023

Problem

The Prometheus and OpenMetrics Compatibility Spec says:

An OpenTelemetry Exponential Histogram with a cumulative aggregation temporality MUST be converted to a Prometheus Native Histogram.

I understand this is currently not implemented in the Prometheus exporter in the OTel Java SDK: Exponential histograms are dropped.

I would like to add OpenTelemetry Exponential Histogram to Prometheus Native Histogram conversion. As Prometheus Native histograms can only be exposed in Prometheus Protobuf format, I would also add support for Prometheus protobuf as an additional exposition format.

However, before I start working on it, I'm creating this issue. Please let me know what you think.

Proposed Solution

I'm one of the maintainers of the Prometheus Java client library, and we recently released version 1.0.0. The Prometheus Java client library is owned by the Prometheus team under CNCF governance.

The new architecture of the 1.0.0 release separates the implementation of the Prometheus data model and the Prometheus metrics library, as described on https://prometheus.github.io/client_java/internals/model/:

prometheus-client-java-model

  • prometheus-metrics-core is the metrics library. There's no need to add a dependency to this module to the Prometheus SDK.
  • prometheus-metrics-model are read-only immutable Prometheus metric snapshots produced during scraping.
  • exposition formats convert the snapshots into different exposition formats, like Prometheus text format, OpenMetrics text format, or Prometheus protobuf format.

My proposal is to refactor OpenTelemetry Java's Prometheus exporter to produce Snapshots as defined in prometheus-metrics-model, and to use Prometheus exposition format modules to convert these snapshots to Prometheus Text format, OpenMetrics format, and Prometheus Protobuf format depending on the Accept header in the scrape request.

I'm happy to maintain the Prometheus exporter long term if you need a maintainer.

The immediate benefit is support for Native histograms.

However, there's also a long-term benefit: Prometheus exposition formats are currently in flux, there is a decision to add a new OpenMetrics format, but that isn't specified yet. However, if the OpenTelemetry Java SDK uses the exposition format modules from the Prometheus Java client library, adding support for future Prometheus exposition formats will be a trivial change, because you can just use the upstream dependency from the Prometheus project.

Let me know what you think. I'm happy to create a PR if you think this is a good idea.

@fstab fstab added the Feature Request Suggest an idea for this project label Oct 26, 2023
@jack-berg
Copy link
Member

Hi @fstab - what you propose would definitely simplify the code. A couple of downsides I see:

  • Increased memory allocation resulting from transforming OpenTelemetry MetricData to the prometheus-metrics-model immutable snapshots. I know there are some users that are sensitive about memory allocation.
  • Potential for library version conflicts. What happens if a user relies on prometheus client library and OpenTelemetry API / SDK at the same time? OpenTelemetry would essentially be imposing some minimal prometheus client library requirement on users based on the version we need to transform to the prometheus-metrics-model.

Overall, I think I favor this solution, despite its downsides. I like the simplification this brings to the prometheus exporter. It allows maintainers of this project to more easily keep it up to date without requiring as much deep expertise on the different prometheus exposition formats. IMO, we can / should optimize the OTLP exporters, and encourage any users with strict memory allocation requirements or prometheus library version requirements to switch to OTLP instead.

@fstab
Copy link
Member Author

fstab commented Oct 30, 2023

Thanks @jack-berg, I'll create a PR.

@asafm
Copy link
Contributor

asafm commented Oct 31, 2023

Hi @fstab - @jack-berg is helping through a big project I'm working on - which is integrating OTel in Apache Pulsar (It's a low latency messaging/streaming system). As part of this project, I'm in the implementation phase of the proposal I made to OTel to introduce a new mode for latency sensitive systems called Memory Mode: It has the normal existing one (IMMUTABLE) and a new one called REUSABLE. It's main goal is to eliminate any O(#Attributes) memory allocations, which in turn causes the garbage collector to kick in more often. With high enough #Attributes it affect the latency of latency sensitive systems such as Apache Pulsar. You can read about it here. For async instruments (which was already merged) I reduced the memory allocations by 99.98% compared to IMMUTABLE.

Your suggestion worries me since it lacks the ability to support the memory mode REUSABLE. If we’re using this method in IMMUTABLE memory mode, we’re basically adding another source of O(#Attributes) of memory allocations. The users choosing IMMUTABLE (I guess with very low cardinality of #Attributes overall) wouldn’t notice that I presume.

The REUSABLE memory mode would need a different way of doing that. Aside from allocating data structures on every collection cycle, another big problem I presume is that you have to convert any Attributes object to a Prometheus flavor of it - on each cycle, or some how cache the conversion, which will cost more in memory usage, probably twice since it’s for each Attributes.
Asking your library to support MetricData as is probably wouldn’t make sense.

The way I see it:
OTel is an emerging standard that IMO will be the defacto way of using metrics. It will take time, but it will.
It starts by offering Prometheus and OTLP out of the box, and may in the future offer more exporters, although the logic seems that most of the heavy lifting should be done in the collector.
It should contain almost 0 dependencies as it is going to be in such mass usage, in so many different devices and types of software, you want to control every aspect of it as much as possible.
Those OOTB protocols are well defined: both OTLP and Prometheus/ OpenMetrics. Once you already spent time supporting them, the rate of them changing probably shouldn’t be high.
So OTel Java SDK current design choice makes sense which is to write it on your own end to end.
Memory Mode is a clear example: I could have never design it without OTel Java SDK have all the logic inside e2e. If I needed to implement it in Prometheus java library or Protobuf serializer library, I would not have started.
It’s a lot of power and I think it has a lot of sense.
The future may bring more requirement hence controlling e2e makes total sense to me.

So I think doing that proposal collides with the new memory mode which is in the middle of implementation.

@fstab
Copy link
Member Author

fstab commented Oct 31, 2023

Thanks @asafm. I'm not sure if I understand your project correctly. Is your goal to expose metrics in Prometheus format for scraping? If that's the case, keep in mind that the Prometheus server will request protobuf format when exponential histograms are enabled, so as soon as we add exponential histogram support (as required by the spec) we must add support for protobuf to the Prometheus exporter.

@jack-berg
Copy link
Member

@asafm is trying to reduce allocations throughout the OpenTelemetry java metrics SDK pipeline. We've made a lot of progress in the SDK itself, and with some additional work in flight, allocations will be very low (near zero). The next area that needs attention would be the exporters themselves. You can imagine that if you have a large number of series, solving allocations in the SDK but not in the exporter only solves half the problem: a memory inefficient exporter may still cause many allocations while mapping to an internal representation before serialization.

I believe this is the Asaf's concern with this proposed change to the prometheus exporter. The prometheus exporter implementation should currently be quite efficient from a memory standpoint, since it directly serializes MetricData to the prometheus representation without any (minimal?) additional internal representation. IIUC, this proposal would convert each MetricData to an equivalent representation from prometheus-metrics-model before serialization, causing allocations. As a maintainer of this repo, I like that this simplifies the prometheus exporter by offloading the job of serializing to the prometheus exposition format. In my head, I can justify the additional allocations in the prometheus exporter by instead focusing on a memory efficient OTLP exporter, and telling memory sensitive users to use the OTLP exporter if they care. For prometheus users, this really only works if prometheus continues the effort to support native OTLP ingest.

@asafm if prometheus support for native OTLP ingest continues to mature / stabilize, surely that diminishes the importance of an optimized prometheus exporter, no?

@fstab
Copy link
Member Author

fstab commented Nov 3, 2023

Thanks a lot!

If caching and reusing prometheus-metrics-model objects helps improving performance we could do this in the Prometheus library as well.

I don't find it easy to understand what effect object pools have. Garbage collection algorithms mark live objects, not unreferenced objects. So intuitively the more objects you keep alive the more work a GC has to do. But I guess this is not so straightforward in practice.

Anyway, looking forward to seeing the results of the experiment, and if it works well we could implement something similar upstream in the Prometheus library.

In the meantime, if there are not objections, I'm happy to create a PR.

@asafm
Copy link
Contributor

asafm commented Nov 5, 2023

@asafm if prometheus support for native OTLP ingest continues to mature / stabilize, surely that diminishes the importance of an optimized prometheus exporter, no?

This effort will be a very long one I presume, not counting the entire eco-system will need to align on this - this can easily be 2 years including eco-system (M3DB, Cortex, VictoriaMetrics).
In the meantime, I wouldn't want to sacrifice the Prometheus endpoint, which in my opinion will be the go-to exporter in the beginning for many users, since they will want to gradually migrate their stack. Apache Pulsar users are a good example. They all scrape /metrics. If on next major Pulsar version I will tell them they must install an OTel collector since effectively only OTLP is supported (since Prometheus would be allocation hazard or as @fstab suggested - they will keep a cache of their own data structures/Attributes hence doubling the amount of memory required) it will create a heavy backlash.

As with anything in life, change is better in small iterations.
Minimizing memory allocations is introduced as another memory mode, not impacting existing users at all, unless they choose so.

Same should with this proposal. If it goes forward, it's not an incremental step as it has severe cost to users using Prometheus exporter.

I suggest this proposal to be implemented after OTLP support has been added and GA across Prometheus and major ecosystem players.
For now, to address your requirement to add support for Exponential Histogram. Adding this is a small effort, compared to what has already been contributed and exist in the current Prometheus exporter. We're talking about taking an existing data structure for Exponential Histogram and serializing it. Hence we have a way to move forward with original requirement without harming the efforts of memory mode which already 50% implemented and is in progress.
I presume once code is done, you wouldn't change it unless the spec it self of OTel will change or the protocol it self will change which I think it's fare to assume would be in super low frequency. Hence the overall effort is not big.

Again please take into account the other disadvantage of adding a dependency to OTel Java SDK as I wrote above:

OTel is an emerging standard that IMO will be the defacto way of using metrics. It will take time, but it will.
It starts by offering Prometheus and OTLP out of the box, and may in the future offer more exporters, although the logic seems that most of the heavy lifting should be done in the collector.
It should contain almost 0 dependencies as it is going to be in such mass usage, in so many different devices and types of software, you want to control every aspect of it as much as possible.
Those OOTB protocols are well defined: both OTLP and Prometheus/ OpenMetrics. Once you already spent time supporting them, the rate of them changing probably shouldn’t be high.
So OTel Java SDK current design choice makes sense which is to write it on your own end to end.
Memory Mode is a clear example: I could have never design it without OTel Java SDK have all the logic inside e2e. If I needed to implement it in Prometheus java library or Protobuf serializer library, I would not have started.
It’s a lot of power and I think it has a lot of sense.
The future may bring more requirement hence controlling e2e makes total sense to me.

@fstab The JMH benchmarked shows object pooling reducing allocation rates by 99.98%. Also, production experience with Pulsar shows with 10k - 100k Attribute sets, the latency it experience due to GC pauses is grave. If you read M3DB code for example, you'll see that they are also using object pooling quite extensively. Latency sensitive systems such as databases and streaming systems, are very sensitive to memory allocations.

@fstab
Copy link
Member Author

fstab commented Nov 27, 2023

PR here: #6015

@asafm
Copy link
Contributor

asafm commented Nov 27, 2023

@fstab @jack-berg As mentioned before, this PR will kill the ability to continue with REUSABLE_DATA, which is a feature developed right now. From my perspective it's a big problem to merge it.

@fstab
Copy link
Member Author

fstab commented Nov 27, 2023

Hi @asafm, I'm not really sure what you are proposing. Do you have a better way of implementing support for exponential histograms in the Prometheus endpoint, or are you proposing not to implement support for exponential histograms at all?

@asafm
Copy link
Contributor

asafm commented Nov 27, 2023

@fstab Just implement it within the existing code like all the other instrument types. No external library. It's not such a big effort.

@fstab
Copy link
Member Author

fstab commented Nov 27, 2023

I don't think it is possible to implement exponential histograms like all the other instrument types. Exponential histograms are fundamentally different, that's why the Prometheus community decided to introduce a new exposition format for them. I think it will be a big effort, and I believe re-implementing this in the OpenTelemetry SDK rather than using the upstream implementation will introduce a lot of additional maintenance effort but not solve any issue.

@asafm
Copy link
Contributor

asafm commented Nov 27, 2023

I disagree and answered here.
Frankly, I don't understand why you submitted the PR without getting consensus on the proposal here first.

@asafm
Copy link
Contributor

asafm commented Nov 27, 2023

We can continue the thread on the PR

@jack-berg
Copy link
Member

Resolved in #6015.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Request Suggest an idea for this project
Projects
None yet
Development

No branches or pull requests

3 participants