opentelemetry: support publishing metrics #2185

bryangarza · 2022-06-29T02:06:36Z

Motivation

Currently, there is no way to publish metrics via tracing.

Solution

Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name.

Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric.

Example

It's this simple:

info!(MONOTONIC_COUNTER_API_REQUESTS = 1 as u64);

Other notes

Consider this a V1 that works end-to-end, but we will probably want follow-up PRs on top of this.
If we had support for Valuable in the master branch, we could use an Enum instead of string prefixes (def. something we can follow up with)
I did not support OTel observers (docs), since the same metrics events should be consumable by other subscribers such as tokio-console, they should not be specific to OTel. I'm thinking if someone wants to use observers, they should directly use the rust-opentelemetry crate in their code for that (which might require changing the code slightly so that the application has access to the Meter)
Right now this is only for Events, but it should take very little additional work to also support Spans (follow-up PR?)
I have a separate repo where I was testing this out, you can find it here: https://github.com/bryangarza/tracing-metrics-feature-test

Currently, there is no way to publish metrics via tracing. Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric.

tracing-opentelemetry/src/subscriber.rs

jtescher · 2022-06-29T19:27:04Z

Great to get started thinking about how otel-metrics fits in to tracing, as the metrics impl is still in progress you may have an easier time maintaining a separate subscriber/layer that you can update without worrying as much about breaking changes / stability guarantees. e.g. open-telemetry/opentelemetry-rust#819 is a fairly substantial change to the otel metrics api and configuring metric views / multiple metric exporters is still to be built out.

Other option could be a feature flag that suggests the instability / warns about using it currently, but it may be more trouble than it's worth until the metrics side stablizes a bit more.

This patch moves out all the metrics-related code to its own layer, and also adds integration tests. Unfortunately, the assertions don't currently fail the tests because the panics are captured by the closure. I tried using an Arc Mutex to store the Metrics data and assert from the test functions themselves, but for some reason, the closure runs after the assertions in that case, and the tests fail because of that race condition. That is, the closure which should update the Arc Mutex does so too late in the execution of the test.

bryangarza · 2022-07-01T07:40:11Z

Great to get started thinking about how otel-metrics fits in to tracing, as the metrics impl is still in progress you may have an easier time maintaining a separate subscriber/layer that you can update without worrying as much about breaking changes / stability guarantees. e.g. open-telemetry/opentelemetry-rust#819 is a fairly substantial change to the otel metrics api and configuring metric views / multiple metric exporters is still to be built out.

Other option could be a feature flag that suggests the instability / warns about using it currently, but it may be more trouble than it's worth until the metrics side stablizes a bit more.

Thanks @jtescher for the comment. I thought about what you said and I liked the suggestion of putting this code in a separate Layer. I pushed a new commit that moves it out to OpenTelemetryMetricsSubscriber

Tomorrow I will write some docs, then it should be in a mergeable state.

bryangarza · 2022-07-01T07:46:00Z

I ran into an issue with my integration tests where if I put the assertions into the test function, the assertions seem to run before MetricsExporter::export, making it impossible to pass values from its closure into some shared state that the test function can check. As far as I know though, tokio::test runs single-threaded by default. Do I need to do something to force the MetricsExporter::export method to be called eagerly? I'm guessing that rust-opentelemetry is waiting for some buffer to fill before flushing the metrics out? cc @jtescher @carllerche

tracing-opentelemetry/src/metrics.rs

This patcd adds documentation for users to know how to utilize the new subscriber, and also handles the case where we receive an `i64` for a MonotonicCounter.

bryangarza · 2022-07-02T00:01:51Z

This morning I tried to improve the tests again, but the CheckpointSet behavior is not consistent, sometimes, when MetricsExporter::export is called there is data available, sometimes (the majority of the time) the CheckpointSet is empty. I have a separate branch in case anyone is curious, and there's a bit more explanation in the commit message bryangarza@4804dce

Other than that, I think this is ready to merge (if the CI passes) @carllerche

This patch just fixes lints that show up during the PR's CI, and adds a couple more unit tests to make sure that `u64` and `i64` are handled properly by the Subscriber.

Rustdocs now show how to instantiate and register an `OpenTelemetryMetricsSubscriber` so that it can be used.

This patch reduces the number of allocations in the `MetricVisitor` by making use of the fact that the metadata for each callsite has a static name. This means that we don't need to convert to a `String` and can instead directly use the `&'static str` in the `HashMap`s.

bryangarza · 2022-07-06T19:58:36Z

Got some good feedback from @hlbarber (thanks!), which I've now incorporated into the PR (using &'static str in the MetricVisitor to reduce allocations)

This commit makes some changes that should significantly reduce the performance overhead of the OpenTelemetry metrics layer. In particular: * The current code will allocate a `String` with the metric name _every_ time a metric value is recorded, even if this value already exists. This is in order to use the `HashMap::entry` API. However, the performance cost of allocating a `String` and copying the metric name's bytes into that string is almost certainly worse than performing the hashmap lookup a second time, and that overhead occurs *every* time a metric is recorded. This commit changes the code for recording metrics to perform hashmap lookups by reference. This way, in the common case (where the metric already exists), we won't allocate. The allocation only occurs when a new metric is added to the map, which is infrequent. * The current code uses a `RwLock` to protect the map of metrics. However, because the current code uses the `HashMap::entry` API, *every* metric update must acquire a write lock, since it may insert a new metric. This essentially reduces the `RwLock` to a `Mutex` --- since every time a value is recorded, we must acquire a write lock, we are forcing global synchronization on every update, the way a `Mutex` would. However, an OpenTelemetry metric can have its value updated through an `&self` reference (presumably metrics are represented as atomic values?). This means that the write lock is not necessary when a metric has already been recorded once, and multiple metric updates can occur without causing all threads to synchronize. This commit changes the code for updating metrics so that the read lock is acquired when attempting to look up a metric. If that metric exists, it is updated through the read lock, allowing multiple metrics to be updated without blocking every thread. In the less common case where a new metric is added, the write lock is acquired to update the hashmap. * Currently, a *single* `RwLock` guards the *entire* `Instruments` structure. This is unfortunate. Any given metric event will only touch one of the hashmaps for different metric types, so two distinct types of metric *should* be able to be updated at the same time. The big lock prevents this, as a global write lock is acquired that prevents *any* type of metric from being updated. This commit changes this code to use more granular locks, with one around each different metric type's `HashMap`. This way, updating (for example) an `f64` counter does not prevent a `u64` value recorder from being updated at the same time, even when both metrics are inserted into the map for the first time.

hawkw

i noticed some potential performance issues with the current implementation that seem fairly significant. i've opened a PR against this branch (bryangarza#1) that rewrites some of this code to reduce the overhead of string allocation and global locking.

tracing-opentelemetry/src/metrics.rs

opentelemetry: remove per-update allocation & global lock

bryangarza · 2022-07-07T00:40:35Z

I cancelled the 5 CI tests that were running because seems that they are now hanging. Tried this out locally as well, but need to dig into it further.

This patch makes the read lock go out-of-scope before we block on obtaining the write lock. Without this change, trying to acquire a write lock hangs forever.

tracing-opentelemetry/src/metrics.rs

This patch updates the `MetricsMap` type to use static strings, this will reduce the number of allocations needed for the metrics processing.

carllerche

LGTM, thanks

hawkw

I have basically one remaining concern about this PR, which is that some aspects of the current implementation of the metrics code make it more or less impossible to implement metrics that follow OpenTelemetry's metric naming conventions

In particular, the metric names in this PR are

required to start with a prefix, which is part of the name of the metric that's output to OpenTelemetry
uppercase, for some reason?

I think that using prefixes to indicate the type of the metric is fine for now (although I'm definitely open to nicer ways of specifying metric types in the future), but I think we should change the code for recognizing metrics based on field name prefixes so that the metric value that's actually emitted to the OpenTelemetry collector doesn't include a prefix like MONOTONIC_COUNTER in the actual name of the metric. We could do this by using str::strip_prefix in the visitor implementation, rather than starts_with, and using the field name without the prefix as the hashmap key.

In addition, I think we may want to change the prefixes we recognize. Both tracing and OpenTelemetry support dots in field names, and semantically, a dot can be used as a namespacing operator. I think we should probably use prefixes like monotonic_counter. rather than MONOTONIC_COUNTER_ to more clearly separate the metric type from the rest of the metric name.

So, in summary, I think we should:

make the metric name prefixes lowercase
make the metric name prefixes end in a .
strip the prefix when matching metric names before using them as hashmap keys
change the metric names in the various examples and tests to be all lowercase

and then I'll be very happy to merge this PR!

tracing-opentelemetry/src/metrics.rs

davidbarsky · 2022-07-11T21:20:34Z

tracing-opentelemetry/src/metrics.rs

+    pub fn new(push_controller: PushController) -> Self {
+        let meter = push_controller
+            .provider()
+            .meter(INSTRUMENTATION_LIBRARY_NAME, Some(CARGO_PKG_VERSION));


does this add additional dimensions to the emitted metrics?

yeah it's something that opentelemetry includes during metrics exporting

This patch changes the metric name prefixes from using upper snakecase, to using lowercase and a `.` at the end. This is to align with the OpenTelemetry metric naming conventions.

bryangarza · 2022-07-12T00:09:15Z

make the metric name prefixes lowercase

make the metric name prefixes end in a .

strip the prefix when matching metric names before using them as hashmap keys

change the metric names in the various examples and tests to be all lowercase

@hawkw, thanks -- done in one of my recent commits

Co-authored-by: David Barsky <me@davidbarsky.com>

This patch renames OpenTelemetryMetricsSubscriber to MetricsSubscriber, to avoid the redundant word.

… into opentelemetry-metrics

Co-authored-by: David Barsky <me@davidbarsky.com>

- Remove example of what *not* to do6 - Change wording of a rustdoc that was too verbose

I implemented the PR comments into a new commit 1e26171

In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in #2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsSubscriber` from #2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.

Motivation: Currently, there is no way to publish metrics via tracing. Solution: Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric. Co-authored-by: Eliza Weisman <eliza@buoyant.io> Co-authored-by: David Barsky <me@davidbarsky.com>

In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in #2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsLayer` from #2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.

Motivation: Currently, there is no way to publish metrics via tracing. Solution: Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric. Co-authored-by: Eliza Weisman <eliza@buoyant.io> Co-authored-by: David Barsky <me@davidbarsky.com>

In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in #2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsLayer` from #2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.

Motivation: Currently, there is no way to publish metrics via tracing. Solution: Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name. Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric. Co-authored-by: Eliza Weisman <eliza@buoyant.io> Co-authored-by: David Barsky <me@davidbarsky.com>

In the upstream `opentelemetry` crate, the `trace` and `metrics` features are gated by separate feature flags. This allows users who are only using OpenTelemetry for tracing, or who are only using it for metrics, to pick and choose what they depend on. Currently, the release version of `tracing-opentelemetry` only provides tracing functionality, and therefore, it only depends on `opentelemetry` with the `trace` feature enabled. However, the metrics support added in tokio-rs#2185 adds a dependency on the `opentelemetry/metrics` feature. This is currently always enabled. We should probably follow the same approach as upstream `opentelemetry`, and allow enabling/disabling metrics and tracing separately. This branch adds a `metrics` feature to `tracing-opentelemetry`, and makes the `MetricsLayer` from tokio-rs#2185 gated on the `metrics` feature. This feature flag is on by default, like the upstream `opentelemetry/metrics` feature, but it can be disabled using `default-features = false`. We should probably do something similar for the tracing components of the crate, and make them gated on a `trace` feature flag, but adding a feature flag to released APIs is not semver-compatible, so we should save that until the next breaking release.

bryangarza commented Jun 29, 2022

View reviewed changes

tracing-opentelemetry/src/subscriber.rs Outdated Show resolved Hide resolved

bryangarza commented Jul 1, 2022

View reviewed changes

tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved

bryangarza added 2 commits July 1, 2022 11:44

Merge branch 'master' into opentelemetry-metrics

5e2f406

opentelemetry: add rust docs and support edge case

1b04b91

This patcd adds documentation for users to know how to utilize the new subscriber, and also handles the case where we receive an `i64` for a MonotonicCounter.

bryangarza marked this pull request as ready for review July 1, 2022 23:58

bryangarza requested review from jtescher and a team as code owners July 1, 2022 23:58

bryangarza requested a review from carllerche July 1, 2022 23:58

Merge branch 'master' into opentelemetry-metrics

93183f6

bryangarza added 4 commits July 1, 2022 17:59

opentelemetry: fix clippy lints, add extra test cases

a51245c

This patch just fixes lints that show up during the PR's CI, and adds a couple more unit tests to make sure that `u64` and `i64` are handled properly by the Subscriber.

opentelemetry: document metrics subscriber initial setup

7c03983

Rustdocs now show how to instantiate and register an `OpenTelemetryMetricsSubscriber` so that it can be used.

opentelemetry: fix doc test (make it compile)

a1b76a9

hawkw requested changes Jul 6, 2022

View reviewed changes

tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved

tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved

tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved

bryangarza added 2 commits July 6, 2022 15:33

Merge branch 'opentelemetry-metrics' into eliza/opentelemetry-metrics

6743acb

Merge pull request #1 from tokio-rs/eliza/opentelemetry-metrics

73558c8

opentelemetry: remove per-update allocation & global lock

opentelemetry: fix deadlock in update_metric method

ac2b3bf

This patch makes the read lock go out-of-scope before we block on obtaining the write lock. Without this change, trying to acquire a write lock hangs forever.

bryangarza requested a review from hawkw July 7, 2022 00:54

hawkw reviewed Jul 7, 2022

View reviewed changes

tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved

bryangarza requested a review from hawkw July 8, 2022 16:55

opentelememtry: switch MetricsMap to use &'static str

3122f7a

This patch updates the `MetricsMap` type to use static strings, this will reduce the number of allocations needed for the metrics processing.

carllerche approved these changes Jul 8, 2022

View reviewed changes

hawkw previously requested changes Jul 9, 2022

View reviewed changes

davidbarsky reviewed Jul 11, 2022

View reviewed changes

opentelemetry: update metric name prefixes

1e26171

This patch changes the metric name prefixes from using upper snakecase, to using lowercase and a `.` at the end. This is to align with the OpenTelemetry metric naming conventions.

bryangarza and others added 6 commits July 11, 2022 17:16

opentelemetry: slightly update wording of some comments

2499943

Co-authored-by: David Barsky <me@davidbarsky.com>

opentelemetry: rename OpenTelemetryMetricsSubscriber

1cd1cfe

This patch renames OpenTelemetryMetricsSubscriber to MetricsSubscriber, to avoid the redundant word.

Merge branch 'opentelemetry-metrics' of github.com:bryangarza/tracing…

ce11a80

… into opentelemetry-metrics

opentelemetry: update wording of floating point comment

7dc2bbd

Co-authored-by: David Barsky <me@davidbarsky.com>

opentelemetry: address PR comments regarding docs

d81b31d

- Remove example of what *not* to do6 - Change wording of a rustdoc that was too verbose

opentelemetry: update wording of Implementation Details

19d5560

bryangarza requested a review from hawkw July 12, 2022 00:35

bryangarza enabled auto-merge (squash) July 12, 2022 00:38

bryangarza merged commit 6af941d into tokio-rs:master Jul 12, 2022

bryangarza deleted the opentelemetry-metrics branch July 12, 2022 01:09

hawkw mentioned this pull request Jul 20, 2022

opentelemetry: feature-flag MetricsSubscriber #2234

Merged

hawkw mentioned this pull request Jul 29, 2022

backport a bunch of stuff #2255

Merged

davidbarsky mentioned this pull request Sep 26, 2023

chore: backport roughly a year's worth of changes #2728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opentelemetry: support publishing metrics #2185

opentelemetry: support publishing metrics #2185

bryangarza commented Jun 29, 2022 •

edited

jtescher commented Jun 29, 2022

bryangarza commented Jul 1, 2022

bryangarza commented Jul 1, 2022 •

edited

bryangarza commented Jul 2, 2022 •

edited

bryangarza commented Jul 6, 2022

hawkw left a comment

bryangarza commented Jul 7, 2022

carllerche left a comment

hawkw left a comment •

edited by bryangarza

davidbarsky Jul 11, 2022

bryangarza Jul 12, 2022

bryangarza commented Jul 12, 2022 •

edited

opentelemetry: support publishing metrics #2185

opentelemetry: support publishing metrics #2185

Conversation

bryangarza commented Jun 29, 2022 • edited

Motivation

Solution

Example

Other notes

jtescher commented Jun 29, 2022

bryangarza commented Jul 1, 2022

bryangarza commented Jul 1, 2022 • edited

bryangarza commented Jul 2, 2022 • edited

bryangarza commented Jul 6, 2022

hawkw left a comment

Choose a reason for hiding this comment

bryangarza commented Jul 7, 2022

carllerche left a comment

Choose a reason for hiding this comment

hawkw left a comment • edited by bryangarza

Choose a reason for hiding this comment

davidbarsky Jul 11, 2022

Choose a reason for hiding this comment

bryangarza Jul 12, 2022

Choose a reason for hiding this comment

bryangarza commented Jul 12, 2022 • edited

bryangarza commented Jun 29, 2022 •

edited

bryangarza commented Jul 1, 2022 •

edited

bryangarza commented Jul 2, 2022 •

edited

hawkw left a comment •

edited by bryangarza

bryangarza commented Jul 12, 2022 •

edited