Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opentelemetry: support publishing metrics #2185

Merged
merged 21 commits into from Jul 12, 2022

Conversation

bryangarza
Copy link
Member

@bryangarza bryangarza commented Jun 29, 2022

Motivation

Currently, there is no way to publish metrics via tracing.

Solution

Rendered docs

Update the tracing-opentelemetry crate to publish metrics for event fields that contain specific prefixes in the name.

Right now, we lazily instantiate and store one metrics object per-callsite, but a future improvement that we should add to tracing itself is the ability to store data per-callsite, so that we don't have to do a HashMap lookup each time we want to publish a metric.

Example

It's this simple:

info!(MONOTONIC_COUNTER_API_REQUESTS = 1 as u64);

Other notes

  • Consider this a V1 that works end-to-end, but we will probably want follow-up PRs on top of this.
  • If we had support for Valuable in the master branch, we could use an Enum instead of string prefixes (def. something we can follow up with)
  • I did not support OTel observers (docs), since the same metrics events should be consumable by other subscribers such as tokio-console, they should not be specific to OTel. I'm thinking if someone wants to use observers, they should directly use the rust-opentelemetry crate in their code for that (which might require changing the code slightly so that the application has access to the Meter)
  • Right now this is only for Events, but it should take very little additional work to also support Spans (follow-up PR?)
  • I have a separate repo where I was testing this out, you can find it here: https://github.com/bryangarza/tracing-metrics-feature-test

Currently, there is no way to publish metrics via tracing.

Update the tracing-opentelemetry crate to publish metrics for event
fields that contain specific prefixes in the name.

Right now, we lazily instantiate and store one metrics object
per-callsite, but a future improvement that we should add to tracing
itself is the ability to store data per-callsite, so that we don't have
to do a HashMap lookup each time we want to publish a metric.
@jtescher
Copy link
Collaborator

Great to get started thinking about how otel-metrics fits in to tracing, as the metrics impl is still in progress you may have an easier time maintaining a separate subscriber/layer that you can update without worrying as much about breaking changes / stability guarantees. e.g. open-telemetry/opentelemetry-rust#819 is a fairly substantial change to the otel metrics api and configuring metric views / multiple metric exporters is still to be built out.

Other option could be a feature flag that suggests the instability / warns about using it currently, but it may be more trouble than it's worth until the metrics side stablizes a bit more.

This patch moves out all the metrics-related code to its own layer, and
also adds integration tests. Unfortunately, the assertions don't
currently fail the tests because the panics are captured by the closure.

I tried using an Arc Mutex to store the Metrics data and assert from the
test functions themselves, but for some reason, the closure runs after
the assertions in that case, and the tests fail because of that race
condition. That is, the closure which should update the Arc Mutex does
so too late in the execution of the test.
@bryangarza
Copy link
Member Author

Great to get started thinking about how otel-metrics fits in to tracing, as the metrics impl is still in progress you may have an easier time maintaining a separate subscriber/layer that you can update without worrying as much about breaking changes / stability guarantees. e.g. open-telemetry/opentelemetry-rust#819 is a fairly substantial change to the otel metrics api and configuring metric views / multiple metric exporters is still to be built out.

Other option could be a feature flag that suggests the instability / warns about using it currently, but it may be more trouble than it's worth until the metrics side stablizes a bit more.

Thanks @jtescher for the comment. I thought about what you said and I liked the suggestion of putting this code in a separate Layer. I pushed a new commit that moves it out to OpenTelemetryMetricsSubscriber

Tomorrow I will write some docs, then it should be in a mergeable state.

@bryangarza
Copy link
Member Author

bryangarza commented Jul 1, 2022

I ran into an issue with my integration tests where if I put the assertions into the test function, the assertions seem to run before MetricsExporter::export, making it impossible to pass values from its closure into some shared state that the test function can check. As far as I know though, tokio::test runs single-threaded by default. Do I need to do something to force the MetricsExporter::export method to be called eagerly? I'm guessing that rust-opentelemetry is waiting for some buffer to fill before flushing the metrics out? cc @jtescher @carllerche

This patcd adds documentation for users to know how to utilize the new
subscriber, and also handles the case where we receive an `i64` for a
MonotonicCounter.
@bryangarza bryangarza marked this pull request as ready for review July 1, 2022 23:58
@bryangarza bryangarza requested review from jtescher and a team as code owners July 1, 2022 23:58
@bryangarza
Copy link
Member Author

bryangarza commented Jul 2, 2022

This morning I tried to improve the tests again, but the CheckpointSet behavior is not consistent, sometimes, when MetricsExporter::export is called there is data available, sometimes (the majority of the time) the CheckpointSet is empty. I have a separate branch in case anyone is curious, and there's a bit more explanation in the commit message bryangarza@4804dce

Other than that, I think this is ready to merge (if the CI passes) @carllerche

This patch just fixes lints that show up during the PR's CI, and adds a
couple more unit tests to make sure that `u64` and `i64` are handled
properly by the Subscriber.
Rustdocs now show how to instantiate and register an
`OpenTelemetryMetricsSubscriber` so that it can be used.
This patch reduces the number of allocations in the `MetricVisitor` by
making use of the fact that the metadata for each callsite has a static
name. This means that we don't need to convert to a `String` and can
instead directly use the `&'static str` in the `HashMap`s.
@bryangarza
Copy link
Member Author

Got some good feedback from @hlbarber (thanks!), which I've now incorporated into the PR (using &'static str in the MetricVisitor to reduce allocations)

This commit makes some changes that should significantly reduce the
performance overhead of the OpenTelemetry metrics layer. In particular:

* The current code will allocate a `String` with the metric name _every_
  time a metric value is recorded, even if this value already exists.
  This is in order to use the `HashMap::entry` API. However, the
  performance cost of allocating a `String` and copying the metric
  name's bytes into that string is almost certainly worse than
  performing the hashmap lookup a second time, and that overhead occurs
  *every* time a metric is recorded.

  This commit changes the code for recording metrics to perform hashmap
  lookups by reference. This way, in the common case (where the metric
  already exists), we won't allocate. The allocation only occurs when a
  new metric is added to the map, which is infrequent.

* The current code uses a `RwLock` to protect the map of metrics.
  However, because the current code uses the `HashMap::entry` API,
  *every* metric update must acquire a write lock, since it may insert a
  new metric. This essentially reduces the `RwLock` to a `Mutex` ---
  since every time a value is recorded, we must acquire a write lock, we
  are forcing global synchronization on every update, the way a `Mutex`
  would. However, an OpenTelemetry metric can have its value updated
  through an `&self` reference (presumably metrics are represented as
  atomic values?). This means that the write lock is not necessary when
  a metric has already been recorded once, and multiple metric updates
  can occur without causing all threads to synchronize.

  This commit changes the code for updating metrics so that the read
  lock is acquired when attempting to look up a metric. If that metric
  exists, it is updated through the read lock, allowing multiple metrics
  to be updated without blocking every thread. In the less common case
  where a new metric is added, the write lock is acquired to update the
  hashmap.

* Currently, a *single* `RwLock` guards the *entire* `Instruments`
  structure. This is unfortunate. Any given metric event will only touch
  one of the hashmaps for different metric types, so two distinct types
  of metric *should* be able to be updated at the same time. The big
  lock prevents this, as a global write lock is acquired that prevents
  *any* type of metric from being updated.

  This commit changes this code to use more granular locks, with one
  around each different metric type's `HashMap`. This way, updating (for
  example) an `f64` counter does not prevent a `u64` value recorder from
  being updated at the same time, even when both metrics are inserted
  into the map for the first time.
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i noticed some potential performance issues with the current implementation that seem fairly significant. i've opened a PR against this branch (bryangarza#1) that rewrites some of this code to reduce the overhead of string allocation and global locking.

tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
@bryangarza
Copy link
Member Author

I cancelled the 5 CI tests that were running because seems that they are now hanging. Tried this out locally as well, but need to dig into it further.

This patch makes the read lock go out-of-scope before we block on
obtaining the write lock. Without this change, trying to acquire a write
lock hangs forever.
@bryangarza bryangarza requested a review from hawkw July 7, 2022 00:54
@bryangarza bryangarza requested a review from hawkw July 8, 2022 16:55
This patch updates the `MetricsMap` type to use static strings, this
will reduce the number of allocations needed for the metrics processing.
Copy link
Member

@carllerche carllerche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

hawkw
hawkw previously requested changes Jul 9, 2022
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have basically one remaining concern about this PR, which is that some aspects of the current implementation of the metrics code make it more or less impossible to implement metrics that follow OpenTelemetry's metric naming conventions

In particular, the metric names in this PR are

  • required to start with a prefix, which is part of the name of the metric that's output to OpenTelemetry
  • uppercase, for some reason?

I think that using prefixes to indicate the type of the metric is fine for now (although I'm definitely open to nicer ways of specifying metric types in the future), but I think we should change the code for recognizing metrics based on field name prefixes so that the metric value that's actually emitted to the OpenTelemetry collector doesn't include a prefix like MONOTONIC_COUNTER in the actual name of the metric. We could do this by using str::strip_prefix in the visitor implementation, rather than starts_with, and using the field name without the prefix as the hashmap key.

In addition, I think we may want to change the prefixes we recognize. Both tracing and OpenTelemetry support dots in field names, and semantically, a dot can be used as a namespacing operator. I think we should probably use prefixes like monotonic_counter. rather than MONOTONIC_COUNTER_ to more clearly separate the metric type from the rest of the metric name.

So, in summary, I think we should:

  • make the metric name prefixes lowercase
  • make the metric name prefixes end in a .
  • strip the prefix when matching metric names before using them as hashmap keys
  • change the metric names in the various examples and tests to be all lowercase

and then I'll be very happy to merge this PR!

tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
tracing-opentelemetry/src/metrics.rs Outdated Show resolved Hide resolved
pub fn new(push_controller: PushController) -> Self {
let meter = push_controller
.provider()
.meter(INSTRUMENTATION_LIBRARY_NAME, Some(CARGO_PKG_VERSION));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this add additional dimensions to the emitted metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's something that opentelemetry includes during metrics exporting

This patch changes the metric name prefixes from using upper snakecase,
to using lowercase and a `.` at the end. This is to align with the
OpenTelemetry metric naming conventions.
@bryangarza
Copy link
Member Author

bryangarza commented Jul 12, 2022

  • make the metric name prefixes lowercase

  • make the metric name prefixes end in a .

  • strip the prefix when matching metric names before using them as hashmap keys

  • change the metric names in the various examples and tests to be all lowercase

@hawkw, thanks -- done in one of my recent commits

bryangarza and others added 6 commits July 11, 2022 17:16
Co-authored-by: David Barsky <me@davidbarsky.com>
This patch renames OpenTelemetryMetricsSubscriber to
MetricsSubscriber, to avoid the redundant word.
Co-authored-by: David Barsky <me@davidbarsky.com>
- Remove example of what *not* to do6
- Change wording of a rustdoc that was too verbose
@bryangarza bryangarza requested a review from hawkw July 12, 2022 00:35
@bryangarza bryangarza enabled auto-merge (squash) July 12, 2022 00:38
@bryangarza bryangarza dismissed hawkw’s stale review July 12, 2022 01:09

I implemented the PR comments into a new commit 1e26171

@bryangarza bryangarza merged commit 6af941d into tokio-rs:master Jul 12, 2022
@bryangarza bryangarza deleted the opentelemetry-metrics branch July 12, 2022 01:09
hawkw added a commit that referenced this pull request Jul 20, 2022
In the upstream `opentelemetry` crate, the `trace` and `metrics`
features are gated by separate feature flags. This allows users who are
only using OpenTelemetry for tracing, or who are only using it for
metrics, to pick and choose what they depend on.

Currently, the release version of `tracing-opentelemetry` only provides
tracing functionality, and therefore, it only depends on `opentelemetry`
with the `trace` feature enabled. However, the metrics support added in
#2185 adds a dependency on the `opentelemetry/metrics` feature. This is
currently always enabled. We should probably follow the same approach as
upstream `opentelemetry`, and allow enabling/disabling metrics and
tracing separately.

This branch adds a `metrics` feature to `tracing-opentelemetry`, and
makes the `MetricsSubscriber` from #2185 gated on the `metrics` feature.
This feature flag is on by default, like the upstream
`opentelemetry/metrics` feature, but it can be disabled using
`default-features = false`.

We should probably do something similar for the tracing components of
the crate, and make them gated on a `trace` feature flag, but adding a
feature flag to released APIs is not semver-compatible, so we should
save that until the next breaking release.
hawkw added a commit that referenced this pull request Jul 21, 2022
In the upstream `opentelemetry` crate, the `trace` and `metrics`
features are gated by separate feature flags. This allows users who are
only using OpenTelemetry for tracing, or who are only using it for
metrics, to pick and choose what they depend on.

Currently, the release version of `tracing-opentelemetry` only provides
tracing functionality, and therefore, it only depends on `opentelemetry`
with the `trace` feature enabled. However, the metrics support added in
#2185 adds a dependency on the `opentelemetry/metrics` feature. This is
currently always enabled. We should probably follow the same approach as
upstream `opentelemetry`, and allow enabling/disabling metrics and
tracing separately.

This branch adds a `metrics` feature to `tracing-opentelemetry`, and
makes the `MetricsSubscriber` from #2185 gated on the `metrics` feature.
This feature flag is on by default, like the upstream
`opentelemetry/metrics` feature, but it can be disabled using
`default-features = false`.

We should probably do something similar for the tracing components of
the crate, and make them gated on a `trace` feature flag, but adding a
feature flag to released APIs is not semver-compatible, so we should
save that until the next breaking release.
hawkw added a commit that referenced this pull request Jul 29, 2022
Motivation:
Currently, there is no way to publish metrics via tracing.

Solution:
Update the tracing-opentelemetry crate to publish metrics for event
fields that contain specific prefixes in the name.

Right now, we lazily instantiate and store one metrics object
per-callsite, but a future improvement that we should add to tracing
itself is the ability to store data per-callsite, so that we don't have
to do a HashMap lookup each time we want to publish a metric.

Co-authored-by: Eliza Weisman <eliza@buoyant.io>
Co-authored-by: David Barsky <me@davidbarsky.com>
hawkw added a commit that referenced this pull request Jul 29, 2022
In the upstream `opentelemetry` crate, the `trace` and `metrics`
features are gated by separate feature flags. This allows users who are
only using OpenTelemetry for tracing, or who are only using it for
metrics, to pick and choose what they depend on.

Currently, the release version of `tracing-opentelemetry` only provides
tracing functionality, and therefore, it only depends on `opentelemetry`
with the `trace` feature enabled. However, the metrics support added in
#2185 adds a dependency on the `opentelemetry/metrics` feature. This is
currently always enabled. We should probably follow the same approach as
upstream `opentelemetry`, and allow enabling/disabling metrics and
tracing separately.

This branch adds a `metrics` feature to `tracing-opentelemetry`, and
makes the `MetricsLayer` from #2185 gated on the `metrics` feature.
This feature flag is on by default, like the upstream
`opentelemetry/metrics` feature, but it can be disabled using
`default-features = false`.

We should probably do something similar for the tracing components of
the crate, and make them gated on a `trace` feature flag, but adding a
feature flag to released APIs is not semver-compatible, so we should
save that until the next breaking release.
@hawkw hawkw mentioned this pull request Jul 29, 2022
hawkw added a commit that referenced this pull request Jul 29, 2022
Motivation:
Currently, there is no way to publish metrics via tracing.

Solution:
Update the tracing-opentelemetry crate to publish metrics for event
fields that contain specific prefixes in the name.

Right now, we lazily instantiate and store one metrics object
per-callsite, but a future improvement that we should add to tracing
itself is the ability to store data per-callsite, so that we don't have
to do a HashMap lookup each time we want to publish a metric.

Co-authored-by: Eliza Weisman <eliza@buoyant.io>
Co-authored-by: David Barsky <me@davidbarsky.com>
hawkw added a commit that referenced this pull request Jul 29, 2022
In the upstream `opentelemetry` crate, the `trace` and `metrics`
features are gated by separate feature flags. This allows users who are
only using OpenTelemetry for tracing, or who are only using it for
metrics, to pick and choose what they depend on.

Currently, the release version of `tracing-opentelemetry` only provides
tracing functionality, and therefore, it only depends on `opentelemetry`
with the `trace` feature enabled. However, the metrics support added in
#2185 adds a dependency on the `opentelemetry/metrics` feature. This is
currently always enabled. We should probably follow the same approach as
upstream `opentelemetry`, and allow enabling/disabling metrics and
tracing separately.

This branch adds a `metrics` feature to `tracing-opentelemetry`, and
makes the `MetricsLayer` from #2185 gated on the `metrics` feature.
This feature flag is on by default, like the upstream
`opentelemetry/metrics` feature, but it can be disabled using
`default-features = false`.

We should probably do something similar for the tracing components of
the crate, and make them gated on a `trace` feature flag, but adding a
feature flag to released APIs is not semver-compatible, so we should
save that until the next breaking release.
kaffarell pushed a commit to kaffarell/tracing that referenced this pull request May 22, 2024
Motivation:
Currently, there is no way to publish metrics via tracing.

Solution:
Update the tracing-opentelemetry crate to publish metrics for event
fields that contain specific prefixes in the name.

Right now, we lazily instantiate and store one metrics object
per-callsite, but a future improvement that we should add to tracing
itself is the ability to store data per-callsite, so that we don't have
to do a HashMap lookup each time we want to publish a metric.

Co-authored-by: Eliza Weisman <eliza@buoyant.io>
Co-authored-by: David Barsky <me@davidbarsky.com>
kaffarell pushed a commit to kaffarell/tracing that referenced this pull request May 22, 2024
In the upstream `opentelemetry` crate, the `trace` and `metrics`
features are gated by separate feature flags. This allows users who are
only using OpenTelemetry for tracing, or who are only using it for
metrics, to pick and choose what they depend on.

Currently, the release version of `tracing-opentelemetry` only provides
tracing functionality, and therefore, it only depends on `opentelemetry`
with the `trace` feature enabled. However, the metrics support added in
tokio-rs#2185 adds a dependency on the `opentelemetry/metrics` feature. This is
currently always enabled. We should probably follow the same approach as
upstream `opentelemetry`, and allow enabling/disabling metrics and
tracing separately.

This branch adds a `metrics` feature to `tracing-opentelemetry`, and
makes the `MetricsLayer` from tokio-rs#2185 gated on the `metrics` feature.
This feature flag is on by default, like the upstream
`opentelemetry/metrics` feature, but it can be disabled using
`default-features = false`.

We should probably do something similar for the tracing components of
the crate, and make them gated on a `trace` feature flag, but adding a
feature flag to released APIs is not semver-compatible, so we should
save that until the next breaking release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants