Add initial experimental .NET CLR runtime metrics #1035

stevejgordon · 2024-05-14T10:23:35Z

Fixes #956

Changes

Adds proposed experimental .NET CLR runtime metrics to the semantic conventions. Based on discussions with the .NET runtime team, the implementation plan will be to port the existing metrics from OpenTelemetry.Instrumentation.Runtime as directly as possible into the runtime itself. The names have been modified to align with the runtime environment metrics conventions.

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
~~schema-next.yaml updated with changes to existing conventions.~~

stevejgordon · 2024-05-14T10:36:25Z

I'm not entirely sure why markdownlint fails on the autogenerated tables.

model/metrics/clr-metrics.yaml

trisch-me · 2024-05-14T14:58:38Z

Have you run both attribute-registry-generation and table-generation?

stevejgordon · 2024-05-14T15:27:28Z

Have you run both attribute-registry-generation and table-generation?

I thought I had done so as the contents were updated, but I can try those again. I'm having to hack around the tooling a bit to get things running on Windows.

stevejgordon · 2024-05-14T15:37:02Z

@trisch-me I've updated the formatting per your suggestion and rerun both of those make targets. Neither made any changes to the markdown files.

docs/runtime/clr-metrics.md

model/metrics/clr-metrics.yaml

trisch-me · 2024-05-15T11:58:20Z

Alternatively you could wait until #1000 will be merged, it has fix for this bug

trisch-me · 2024-05-15T12:40:28Z

@stevejgordon The #1000 is merged so please re-run code generation locally and update your files. Thanks

Co-authored-by: Alexandra Konrad <10500694+trisch-me@users.noreply.github.com>

stevejgordon · 2024-05-15T13:19:49Z

Thanks, @trisch-me. Looks good!

gregkalapos

Happy to see this 🎉

Left some comments below.

model/metrics/clr-metrics.yaml

docs/attributes-registry/clr.md

tarekgh · 2024-05-15T19:29:49Z

docs/runtime/clr-metrics.md

+### Metric: `clr.gc.allocations.size`
+
+This metric is [recommended][MetricRecommended].
+This metric is obtained from [`GC.GetTotalAllocatedBytes()`](https://learn.microsoft.com/en-us/dotnet/api/system.gc.gettotalallocatedbytes).


GC.GetTotalAllocatedBytes is not supported on .NET Framework. Just telling if this need to be captured here or it doesn't matter for this doc? This apply to some other APIs in this doc too.

Thanks for highlighting that. I guess some metrics will not be available on some targets unless we have another way to gather the data. I have some thoughts on that, which I'll add to the runtime issue. I am also reconsidering whether we should add these notes over the implementation used to collect the metric in the semantic conventions. On the one hand, it aligns with the JVM doc and helps provide context on how the value is calculated. On the other hand, it exposes implementation details that may change and are irrelevant to the point of a specification such as this one.

I am also reconsidering whether we should add these notes over the implementation used to collect the metric in the semantic conventions.

I suspect its more helpful than harmful. You could also frame it as specification if you wanted: "This metric reports the same values as calling XYZ API". Its not a promise the implementation calls the API, just that the result is the same. (In practice I assume the impl will call the API)

Co-authored-by: Alexandra Konrad <10500694+trisch-me@users.noreply.github.com>

linux-foundation-easycla · 2024-05-17T09:21:39Z

The committers listed above are authorized under a signed CLA.

✅ login: stevejgordon / name: Steve Gordon (5793993, 83fc98a, 2e5a86a, 2dc772a, 60cabc9, 9ed01be, f2e8ab8, 572cd26, a2b7fa4)
✅ login: trisch-me / name: Alexandra Konrad (43dedc6, 6aba64e)

lmolkova

Thank you for working on this!

The main concerns from my side:

it seems we're designing 'proper' CLR metrics based on the information we can get today, but native runtime instrumentation can do much better and provide:
- GC duration histogram
- lock contention duration
- ...
I'm no expert in all of the runtime details, but I assume that some metrics are used much more that others (e.g. CPU, GC, heap size, thread pool) while things like JIT metrics could be more advanced and specialized. I wonder if it's possible to start with basic-perf analysis (e.g. CPU, memory/GC) and then move on to more specific metrics in a follow up PRs?

lmolkova · 2024-05-24T17:52:19Z

.chloggen/clr-runtime.yaml

+change_type: new_component
+
+# The name of the area of concern in the attributes-registry, (e.g. http, cloud, db)
+component: clr


While CLR is technically precise, dotnet sounds more recognizable and user-friendly. WDYT?

Will these metrics be supported on .NET Framework? If so, dotnet not accurate? I like dotnet though if it is acceptable.

I like dotnet. I expect there are lots of places where we use "dotnet" or ".NET" in a broad sense of the entire platform/ecosystem so I wouldn't expect users to be confused. For example you can find both .NET 8 and .NET Framework downloads by following the download link at "https://dotnet.microsoft.com/"

Now I'm having second thoughts. I see the java runtime metrics are labeled "jvm" and not "java". I am worried that while "dotnet" might be more recognizable, it may also be ambiguous. Does "dotnet.runtime.xyz" feel too verbose?

I can accept dotnet if that's the consensus. On reflection, I don't think the specific runtime implementation matters. If a mono runtime added these metrics, they'd follow the conventions for naming and attributes. They can also choose not to implement the spec.

Arguably, "dotnet" is a product name rather than a runtime name, though, and these fall under "runtime" metrics in the conventions. That distinction may not matter here, but perhaps someone from the semantic conventions working group needs to weigh in on this as well.

My understanding is that CPU shouldn't be treated as a runtime metric at all, instead we'd want process.cpu.time?

@trask @jack-berg do you happen to remember what were the reasons to have JVM-specific metrics for CPU?

check out open-telemetry/opentelemetry-specification#3490 (and maybe open-telemetry/opentelemetry-specification#2291)

and semi-related open-telemetry/opentelemetry-specification#2388 (review) and open-telemetry/opentelemetry-specification#2388 (comment)

may be worth revisiting and documenting more clearly whatever decision is made around this

IIRC one of the influencing factors in JVM metric group decision was that it seemed like the least controversial approach

thanks a lot for the context!

If I understand correctly the TL;DR of these discussions is that process.* and system.* metrics are intended to be reported by an external observer such as collector which does not know runtime specifics, can't add runtime-specific attributes, etc.

It's actually documented here

semantic-conventions/docs/system/process-metrics.md

Lines 9 to 16 in a2c9cff

This document describes instruments and attributes for common OS process level

metrics in OpenTelemetry. Also consider the [general metric semantic

conventions](/docs/general/metrics.md#general-metric-semantic-conventions) when creating

instruments not explicitly defined in this document. OS process metrics are

not related to the runtime environment of the program, and should take

measurements from the operating system. For runtime environment metrics see

[semantic conventions for runtime environment

metrics](/docs/runtime/README.md#metrics).

If .NET implemented process.cpu.time it would in some way conflict or duplicate a metric collected on the OS level (if that's enabled) and would prevent .NET from adding any extra information that would not be available on the OS level.

Thanks that was different than my previous understanding of how this would done but its np, I want to follow the convention. To double check I've understood this correctly as a concrete example I'm looking at how the runtime would report cpu time on Windows. Based on this I'm expecting the runtime metrics would:

include a metric "dotnet.cpu.time"

its values would be populated by a call to the win32 function GetProcessTimes

it would include a single attribute process.cpu.state using the 'system' value for kernel CPU time and the 'user' value for user CPU time?

Assuming that is correct I assume similar logic would apply for a metric such as "dotnet.memory.virtual" that calls OS APIs to determine total process committed virtual memory?

lmolkova · 2024-05-24T18:05:20Z

model/metrics/clr-metrics.yaml

+  - id: metric.clr.exceptions.count
+    type: metric
+    metric_name: clr.exceptions.count
+    brief: "The number of exceptions that have been thrown in managed code."


probably not possible today, but if runtime is instrumented natively this metric should have error.type attribute

The current implementation that we'd likely port is:

AppDomain.CurrentDomain.FirstChanceException += (source, e) => { exceptionCounter.Add(1); };

As we have access to FirstChanceExceptionEventArgs, we could get the Type for the Exception perhaps? That should generally be reasonably constrained in cardinality and might be useful for consumers to break down the exceptions which are being thrown.

For reference, this request has already shown up on the existing OTel instrumentation: open-telemetry/opentelemetry-dotnet-contrib#562 with a dozen upvotes and a proposed PR. Ultimately the PR was closed in favor of doing an opt-in mechanism given the potential for high cardinality on the exception type names.

In practice the number of distinct exception types would probably be low. Even if it's in hundreds range, this metric would not have any other attributes to make cardinality problem too bad.

error.type would be super-useful - in other places we have a similar trade-off and choose usefulness when practical cardinality is low. E.g. http route can be of a high-ish cardinality - it's fine to have a 100 routes in the app (multiplied by method and status code variations resulting in 1000s of time series), but it's so useful we add it anyway.

I'm fine having the dimension in principle assuming we confirm it doesn't cause implementation issues.

model/metrics/clr-metrics.yaml

lmolkova · 2024-05-24T18:19:19Z

model/metrics/clr-metrics.yaml

+
+  - id: metric.clr.gc.committed_memory.size
+    type: metric
+    metric_name: clr.gc.committed_memory.size


does it have to be in the gc namespace?

A similar JVM metric is called jvm.memory.committed, would it make sense to use clr|dotnet.memory.commited too?

There are other forms of memory that .NET processes use that isn't the gc so using "gc" here does feel more precise. It allows us to add metrics for other forms of memory usage in the future. For example loading assemblies allocates memory for modules, types, and jitted code. Even some .NET objects are not allocated on the gc heap: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#non-gc-heap

There are other forms of memory that .NET processes use that isn't the gc so using "gc" here does feel more precise and allows us to add metrics for other forms of memory usage in the future. For example loading assemblies allocates memory for modules, types, and jitted code. Even some .NET objects are not allocated on the gc heap: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#non-gc-heap

+1, similar to #1035 (comment)

Something feels off (to me) about reporting memory usage in the gc namespace since the memory concept is wider than gc.

It's not a strong opinion, but I'd prefer to have dotnet.memory.heap vs dotnet.gc.memory.
Java uses a different approach by adding memory type attribute jvm.memory.type = heap | non_heap to distinguish those.

If we had a top-level committed memory metric a few things that seem awkward to me about that approach:

It gets our runtime metrics into the business of process metrics to report the committed memory used by everything else that isn't the GC.

In the future we might want to expose other pools of committed memory and there is no guarantee those pools are all mutually exclusive subsets of the memory. As a current example GC generation sizes are neither a subset nor a super-set of committed memory. Each generation may have part of its memory in the committed state and part of its memory in the reserved state.

We have a bunch of other GC memory metrics so unless we are planning to rename all of them to have a new prefix it seems weird to have them in separate namespace prefixes.

let's step back from the naming for a second and think about non-advanced (in .NET) user.

I do see what my .NET process consumes on the OS level from some generic external instrumentation.
Now I see a dotnet.gc.heap metric which is documented fairly well, but there is a gap between the sum of dimensions and what I see on the OS level - what is it? What are the metrics that report it?
It seems there will be

gc heap

non-gc heap

native?

what else?

If each of them is reported as a separate metric, I need to either know all of them or look into the docs.

Can we make this journey easy? What are the memory pools? Do we have an exhaustive list of metrics that describe all of them?

If we can't make it easy, are we ready to document it (in a format of semconv, not as a book)?

This is the same point I'm raising in
#1035 (comment)

I feel the fact that there is a gc heap and non-gc heap makes my point even stronger:
How is having dotnet.gc.heap and dotnet.non_gc.heap better than dotnet.heap metric with type = gen-0, gen-1, loh, non-gc, etc dimension?

You can always add new values to the type dimension in the same way as you can add more metrics.
But users see what's there without prior knowledge of heap gc/non gc types.

Move one step further and you have dotnet.memory.used(?) and users can learn about .net memory management from metric dimensions.

Alright I think I follow better what you are after now and also sounds like I may have had the wrong understanding about the appropriateness of runtimes reporting process-level metrics. I'm waiting to get confirmation on the cpu discussion above to make sure I don't create further confusion before I continue working through this one with you :)

lmolkova · 2024-05-24T18:24:19Z

model/metrics/clr-metrics.yaml

+    unit: "By"
+    stability: experimental
+
+  - id: metric.clr.gc.heap.size


following JVM example (jvm.memory.used), can we consider clr|dotnet.heap.used?

It'd emphasize that it's the amount used (vs available, vs max, etc)

I don't know how JVM memory model works, gc.heap does sound more accurate to me than dotnet.heap since there are heaps managed by .NET GC and heaps not managed by the .NET GC.

there is a balance between accuracy and user-friendliness.

E.g. if I'm looking at all metrics, how can I find the memory utilization. I'd be surprised to find it in anything.gc.memory.
Most of the users (myself included) rarely care about subtle technical differences.

If we had a metric that showed memory usage that users can slice and dice based on additional dimensions, it could be both, technically precise and also friendly.

In the case of Java metrics, they kinda achieved it in jvm.memory.used:

you can sum up by memory type to get both heap and non-heap

you can get heap only and break it down by GC generation

...

If multiple metrics are provided, we'd depend on the user to know all the different memory consumption sources - as a typical non-advanced user, I won't be able to answer a simple question of how much memory my .NET app consumes and how the break down by different memory types looks like.

lmolkova · 2024-05-24T18:28:30Z

model/metrics/clr-metrics.yaml

+
+  - id: metric.clr.gc.heap.fragmentation.size
+    type: metric
+    metric_name: clr.gc.heap.fragmentation.size


maybe

Suggested change

metric_name: clr.gc.heap.fragmentation.size

metric_name: clr.heap.fragmented

?

This is the prior art for naming of fragmentation in .NET that I could find:
BCL API: FragmentationBeforeBytes FragmentationAfterBytes
EventCounter: gc-fragmentation
OTel runtime instrumentation: process.runtime.dotnet.gc.heap.fragmentation.size

Although not identical, they all seem to agree on "fragmentation" as the term rather than "fragmented".

lmolkova · 2024-05-24T18:30:29Z

model/metrics/clr-metrics.yaml

+
+  - id: metric.clr.timer.count
+    type: metric
+    metric_name: clr.timer.count


clr.timers.count

The conventions prefer that pluralization should not be used for metrics. I based this name off of the existing contrib runtime metric and I think it best to keep it as is.

it's applied inconsistently (i.e. there are clr.assemblies.count and clr.exceptions.count), but also

Metric names SHOULD NOT be pluralized, unless the value being recorded represents discrete instances of a countable quantity. Generally, the name SHOULD be pluralized only if the unit of the metric in question is a non-unit (like {fault} or {operation}).

model/metrics/clr-metrics.yaml

trask · 2024-05-28T18:11:08Z

/easycla

lmolkova · 2024-06-03T18:55:27Z

Looking at the discussions in this PR, I want to reiterate the proposal #1035 (review):

Let's think about user experience - which metrics users want to see first - my assumption is CPU, memory (from all sources, ideally in one/few metrics, GC, maybe threads. Let try to come up with a few metrics that would not require users to have a deep prior expertise in .NET memory management or know subtle differences between different .NET flavors.

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things.
Let's focus on basics first.

noahfalk · 2024-06-03T23:39:02Z

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things.
Let's focus on basics first.

Are you suggesting lets review some things first and some things later? Or do you mean "Lets see if we can avoid including some of these metrics in .NET 9 at all?" If its just review ordering I'd say no problem. If the goal is to have these metrics not contain all the metrics in the current OTel .NET runtime instrumentation I think that leads to trouble. One of the major scenarios will be people migrating from using older metrics to these metrics and every metric we remove makes that migration harder or discourages them from migrating at all. If we want to have a simple set of metrics for folks just getting started I think a good approach to that will be docs or a pre-made dashboard, not by excluding metrics from the underlying instrumentation.

EDIT: Just to add I know a few of the metrics may feel a bit advanced or niche, but they are there because customer feedback wanted them to be there. We took a bunch of things out during the move from Windows Performance Counters -> EventCounters, but customers gave us feedback that we had cut too deep and please restore some of the metrics that were important to them.

Co-authored-by: Liudmila Molkova <limolkova@microsoft.com>

lmolkova · 2024-06-05T01:10:08Z

on reducing the scope:

@noahfalk thanks for the explanation. If it's either all or nothing, I believe we should still start from the basics (and try to do everything in .NET 9).

The main concerns on the basics from my side are:

CPU - Add initial experimental .NET CLR runtime metrics #1035 (comment) - seems trivial, just needs to be defined.
Memory: Add initial experimental .NET CLR runtime metrics #1035 (comment) and Add initial experimental .NET CLR runtime metrics #1035 (comment)

The rest seems more cosmetic based on our discussions on duration/histograms in the dotnet/runtime#85372 (comment)

noahfalk · 2024-06-07T06:23:04Z

The main concerns on the basics from my side are...

Thanks @lmolkova, much appreciated! I'm just waiting to make sure I've got the right idea about the cpu portion of the discussion and then I'll respond back on the memory part as well.

stevejgordon requested review from a team as code owners May 14, 2024 10:23

github-actions bot assigned arminru May 14, 2024

stevejgordon mentioned this pull request May 14, 2024

Metrics for System.Runtime dotnet/runtime#85372

Open

trisch-me reviewed May 14, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Show resolved Hide resolved

stevejgordon requested a review from trisch-me May 14, 2024 16:03

tarekgh reviewed May 14, 2024

View reviewed changes

docs/runtime/clr-metrics.md Outdated Show resolved Hide resolved

trisch-me reviewed May 15, 2024

View reviewed changes

docs/runtime/clr-metrics.md Outdated Show resolved Hide resolved

trisch-me reviewed May 15, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

trisch-me reviewed May 15, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

stevejgordon and others added 6 commits May 15, 2024 13:52

Add initial experimental CLR runtime metrics

5793993

Fixing lint failures

60cabc9

PR feedback

a2b7fa4

Update model/metrics/clr-metrics.yaml

2dc772a

Co-authored-by: Alexandra Konrad <10500694+trisch-me@users.noreply.github.com>

Update model/metrics/clr-metrics.yaml

2e5a86a

Co-authored-by: Alexandra Konrad <10500694+trisch-me@users.noreply.github.com>

Regen after generator fixes

83fc98a

stevejgordon force-pushed the dotnet-runtime-metrics branch from c0742fb to 83fc98a Compare May 15, 2024 13:05

gregkalapos reviewed May 15, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

trisch-me reviewed May 15, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

trisch-me reviewed May 15, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

trisch-me reviewed May 15, 2024

View reviewed changes

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved

reyang reviewed May 15, 2024

View reviewed changes

docs/attributes-registry/clr.md Show resolved Hide resolved

tarekgh reviewed May 15, 2024

View reviewed changes

stevejgordon and others added 3 commits May 16, 2024 08:48

Update model/metrics/clr-metrics.yaml

f2e8ab8

Co-authored-by: Alexandra Konrad <10500694+trisch-me@users.noreply.github.com>

Update based on PR feedback

9ed01be

Merge branch 'main' into dotnet-runtime-metrics

43dedc6

stevejgordon mentioned this pull request May 17, 2024

.NET custom OTel Metrics dashboard elastic/elastic-otel-dotnet#59

Open

Merge branch 'main' into dotnet-runtime-metrics

6aba64e

lmolkova reviewed May 24, 2024

View reviewed changes

Update model/metrics/clr-metrics.yaml

572cd26

Co-authored-by: Liudmila Molkova <limolkova@microsoft.com>

stevejgordon mentioned this pull request Jun 5, 2024

Create portable dashboard for OTel .NET runtime metrics elastic/kibana#184842

Draft

	This document describes instruments and attributes for common OS process level
	metrics in OpenTelemetry. Also consider the [general metric semantic
	conventions](/docs/general/metrics.md#general-metric-semantic-conventions) when creating
	instruments not explicitly defined in this document. OS process metrics are
	not related to the runtime environment of the program, and should take
	measurements from the operating system. For runtime environment metrics see
	[semantic conventions for runtime environment
	metrics](/docs/runtime/README.md#metrics).

	metric_name: clr.gc.heap.fragmentation.size
	metric_name: clr.heap.fragmented

Add initial experimental .NET CLR runtime metrics #1035

Are you sure you want to change the base?

Add initial experimental .NET CLR runtime metrics #1035

Conversation

stevejgordon commented May 14, 2024 • edited

Changes

Merge requirement checklist

stevejgordon commented May 14, 2024

trisch-me commented May 14, 2024

stevejgordon commented May 14, 2024

stevejgordon commented May 14, 2024

trisch-me commented May 15, 2024

trisch-me commented May 15, 2024

stevejgordon commented May 15, 2024

gregkalapos left a comment

Choose a reason for hiding this comment

tarekgh May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linux-foundation-easycla bot commented May 17, 2024 • edited

lmolkova left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Jun 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Jun 4, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk May 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Jun 3, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Jun 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noahfalk May 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmolkova Jun 4, 2024 • edited

Choose a reason for hiding this comment

trask commented May 28, 2024

lmolkova commented Jun 3, 2024 • edited

noahfalk commented Jun 3, 2024 • edited

lmolkova commented Jun 5, 2024 • edited

noahfalk commented Jun 7, 2024

stevejgordon commented May 14, 2024 •

edited

tarekgh May 15, 2024 •

edited

linux-foundation-easycla bot commented May 17, 2024 •

edited

lmolkova left a comment •

edited

lmolkova Jun 5, 2024 •

edited

lmolkova Jun 4, 2024 •

edited

noahfalk May 31, 2024 •

edited

lmolkova Jun 3, 2024 •

edited

lmolkova Jun 5, 2024 •

edited

noahfalk May 31, 2024 •

edited

lmolkova Jun 4, 2024 •

edited

lmolkova commented Jun 3, 2024 •

edited

noahfalk commented Jun 3, 2024 •

edited

lmolkova commented Jun 5, 2024 •

edited