Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial experimental .NET CLR runtime metrics #1035

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

stevejgordon
Copy link
Contributor

@stevejgordon stevejgordon commented May 14, 2024

Fixes #956

Changes

Adds proposed experimental .NET CLR runtime metrics to the semantic conventions. Based on discussions with the .NET runtime team, the implementation plan will be to port the existing metrics from OpenTelemetry.Instrumentation.Runtime as directly as possible into the runtime itself. The names have been modified to align with the runtime environment metrics conventions.

Merge requirement checklist

@stevejgordon
Copy link
Contributor Author

I'm not entirely sure why markdownlint fails on the autogenerated tables.

@trisch-me
Copy link
Contributor

Have you run both attribute-registry-generation and table-generation?

@stevejgordon
Copy link
Contributor Author

Have you run both attribute-registry-generation and table-generation?

I thought I had done so as the contents were updated, but I can try those again. I'm having to hack around the tooling a bit to get things running on Windows.

@stevejgordon
Copy link
Contributor Author

@trisch-me I've updated the formatting per your suggestion and rerun both of those make targets. Neither made any changes to the markdown files.

@trisch-me
Copy link
Contributor

Alternatively you could wait until #1000 will be merged, it has fix for this bug

@trisch-me
Copy link
Contributor

@stevejgordon The #1000 is merged so please re-run code generation locally and update your files. Thanks

stevejgordon and others added 6 commits May 15, 2024 13:52
Co-authored-by: Alexandra Konrad <10500694+trisch-me@users.noreply.github.com>
Co-authored-by: Alexandra Konrad <10500694+trisch-me@users.noreply.github.com>
@stevejgordon
Copy link
Contributor Author

Thanks, @trisch-me. Looks good!

Copy link
Member

@gregkalapos gregkalapos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this 🎉

Left some comments below.

model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved
model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved
model/metrics/clr-metrics.yaml Outdated Show resolved Hide resolved
### Metric: `clr.gc.allocations.size`

This metric is [recommended][MetricRecommended].
This metric is obtained from [`GC.GetTotalAllocatedBytes()`](https://learn.microsoft.com/en-us/dotnet/api/system.gc.gettotalallocatedbytes).
Copy link

@tarekgh tarekgh May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC.GetTotalAllocatedBytes is not supported on .NET Framework. Just telling if this need to be captured here or it doesn't matter for this doc? This apply to some other APIs in this doc too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for highlighting that. I guess some metrics will not be available on some targets unless we have another way to gather the data. I have some thoughts on that, which I'll add to the runtime issue. I am also reconsidering whether we should add these notes over the implementation used to collect the metric in the semantic conventions. On the one hand, it aligns with the JVM doc and helps provide context on how the value is calculated. On the other hand, it exposes implementation details that may change and are irrelevant to the point of a specification such as this one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also reconsidering whether we should add these notes over the implementation used to collect the metric in the semantic conventions.

I suspect its more helpful than harmful. You could also frame it as specification if you wanted: "This metric reports the same values as calling XYZ API". Its not a promise the implementation calls the API, just that the result is the same. (In practice I assume the impl will call the API)

stevejgordon and others added 3 commits May 16, 2024 08:48
Copy link

linux-foundation-easycla bot commented May 17, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

The main concerns from my side:

  • it seems we're designing 'proper' CLR metrics based on the information we can get today, but native runtime instrumentation can do much better and provide:
    • GC duration histogram
    • lock contention duration
    • ...
  • I'm no expert in all of the runtime details, but I assume that some metrics are used much more that others (e.g. CPU, GC, heap size, thread pool) while things like JIT metrics could be more advanced and specialized. I wonder if it's possible to start with basic-perf analysis (e.g. CPU, memory/GC) and then move on to more specific metrics in a follow up PRs?

change_type: new_component

# The name of the area of concern in the attributes-registry, (e.g. http, cloud, db)
component: clr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While CLR is technically precise, dotnet sounds more recognizable and user-friendly. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these metrics be supported on .NET Framework? If so, dotnet not accurate? I like dotnet though if it is acceptable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like dotnet. I expect there are lots of places where we use "dotnet" or ".NET" in a broad sense of the entire platform/ecosystem so I wouldn't expect users to be confused. For example you can find both .NET 8 and .NET Framework downloads by following the download link at "https://dotnet.microsoft.com/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm having second thoughts. I see the java runtime metrics are labeled "jvm" and not "java". I am worried that while "dotnet" might be more recognizable, it may also be ambiguous. Does "dotnet.runtime.xyz" feel too verbose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can accept dotnet if that's the consensus. On reflection, I don't think the specific runtime implementation matters. If a mono runtime added these metrics, they'd follow the conventions for naming and attributes. They can also choose not to implement the spec.

Arguably, "dotnet" is a product name rather than a runtime name, though, and these fall under "runtime" metrics in the conventions. That distinction may not matter here, but perhaps someone from the semantic conventions working group needs to weigh in on this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that CPU shouldn't be treated as a runtime metric at all, instead we'd want process.cpu.time?

@trask @jack-berg do you happen to remember what were the reasons to have JVM-specific metrics for CPU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out open-telemetry/opentelemetry-specification#3490 (and maybe open-telemetry/opentelemetry-specification#2291)

and semi-related open-telemetry/opentelemetry-specification#2388 (review) and open-telemetry/opentelemetry-specification#2388 (comment)

may be worth revisiting and documenting more clearly whatever decision is made around this

IIRC one of the influencing factors in JVM metric group decision was that it seemed like the least controversial approach

Copy link
Contributor

@lmolkova lmolkova Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the context!

If I understand correctly the TL;DR of these discussions is that process.* and system.* metrics are intended to be reported by an external observer such as collector which does not know runtime specifics, can't add runtime-specific attributes, etc.

It's actually documented here

This document describes instruments and attributes for common OS process level
metrics in OpenTelemetry. Also consider the [general metric semantic
conventions](/docs/general/metrics.md#general-metric-semantic-conventions) when creating
instruments not explicitly defined in this document. OS process metrics are
not related to the runtime environment of the program, and should take
measurements from the operating system. For runtime environment metrics see
[semantic conventions for runtime environment
metrics](/docs/runtime/README.md#metrics).

If .NET implemented process.cpu.time it would in some way conflict or duplicate a metric collected on the OS level (if that's enabled) and would prevent .NET from adding any extra information that would not be available on the OS level.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that was different than my previous understanding of how this would done but its np, I want to follow the convention. To double check I've understood this correctly as a concrete example I'm looking at how the runtime would report cpu time on Windows. Based on this I'm expecting the runtime metrics would:

  • include a metric "dotnet.cpu.time"
  • its values would be populated by a call to the win32 function GetProcessTimes
  • it would include a single attribute process.cpu.state using the 'system' value for kernel CPU time and the 'user' value for user CPU time?

Assuming that is correct I assume similar logic would apply for a metric such as "dotnet.memory.virtual" that calls OS APIs to determine total process committed virtual memory?

- id: metric.clr.exceptions.count
type: metric
metric_name: clr.exceptions.count
brief: "The number of exceptions that have been thrown in managed code."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not possible today, but if runtime is instrumented natively this metric should have error.type attribute

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation that we'd likely port is:

AppDomain.CurrentDomain.FirstChanceException += (source, e) =>
{
    exceptionCounter.Add(1);
};

As we have access to FirstChanceExceptionEventArgs, we could get the Type for the Exception perhaps? That should generally be reasonably constrained in cardinality and might be useful for consumers to break down the exceptions which are being thrown.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, this request has already shown up on the existing OTel instrumentation: open-telemetry/opentelemetry-dotnet-contrib#562 with a dozen upvotes and a proposed PR. Ultimately the PR was closed in favor of doing an opt-in mechanism given the potential for high cardinality on the exception type names.

Copy link
Contributor

@lmolkova lmolkova Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice the number of distinct exception types would probably be low. Even if it's in hundreds range, this metric would not have any other attributes to make cardinality problem too bad.

error.type would be super-useful - in other places we have a similar trade-off and choose usefulness when practical cardinality is low. E.g. http route can be of a high-ish cardinality - it's fine to have a 100 routes in the app (multiplied by method and status code variations resulting in 1000s of time series), but it's so useful we add it anyway.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine having the dimension in principle assuming we confirm it doesn't cause implementation issues.

model/metrics/clr-metrics.yaml Show resolved Hide resolved
model/metrics/clr-metrics.yaml Show resolved Hide resolved
model/metrics/clr-metrics.yaml Show resolved Hide resolved

- id: metric.clr.gc.committed_memory.size
type: metric
metric_name: clr.gc.committed_memory.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it have to be in the gc namespace?

A similar JVM metric is called jvm.memory.committed, would it make sense to use clr|dotnet.memory.commited too?

Copy link

@noahfalk noahfalk May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other forms of memory that .NET processes use that isn't the gc so using "gc" here does feel more precise. It allows us to add metrics for other forms of memory usage in the future. For example loading assemblies allocates memory for modules, types, and jitted code. Even some .NET objects are not allocated on the gc heap: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#non-gc-heap

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other forms of memory that .NET processes use that isn't the gc so using "gc" here does feel more precise and allows us to add metrics for other forms of memory usage in the future. For example loading assemblies allocates memory for modules, types, and jitted code. Even some .NET objects are not allocated on the gc heap: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-8/#non-gc-heap

+1, similar to #1035 (comment)

Copy link
Contributor

@lmolkova lmolkova Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels off (to me) about reporting memory usage in the gc namespace since the memory concept is wider than gc.

It's not a strong opinion, but I'd prefer to have dotnet.memory.heap vs dotnet.gc.memory.
Java uses a different approach by adding memory type attribute jvm.memory.type = heap | non_heap to distinguish those.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had a top-level committed memory metric a few things that seem awkward to me about that approach:

  1. It gets our runtime metrics into the business of process metrics to report the committed memory used by everything else that isn't the GC.
  2. In the future we might want to expose other pools of committed memory and there is no guarantee those pools are all mutually exclusive subsets of the memory. As a current example GC generation sizes are neither a subset nor a super-set of committed memory. Each generation may have part of its memory in the committed state and part of its memory in the reserved state.
  3. We have a bunch of other GC memory metrics so unless we are planning to rename all of them to have a new prefix it seems weird to have them in separate namespace prefixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's step back from the naming for a second and think about non-advanced (in .NET) user.

I do see what my .NET process consumes on the OS level from some generic external instrumentation.
Now I see a dotnet.gc.heap metric which is documented fairly well, but there is a gap between the sum of dimensions and what I see on the OS level - what is it? What are the metrics that report it?
It seems there will be

  • gc heap
  • non-gc heap
  • native?
  • what else?

If each of them is reported as a separate metric, I need to either know all of them or look into the docs.

Can we make this journey easy? What are the memory pools? Do we have an exhaustive list of metrics that describe all of them?

If we can't make it easy, are we ready to document it (in a format of semconv, not as a book)?

This is the same point I'm raising in
#1035 (comment)

Copy link
Contributor

@lmolkova lmolkova Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the fact that there is a gc heap and non-gc heap makes my point even stronger:
How is having dotnet.gc.heap and dotnet.non_gc.heap better than dotnet.heap metric with type = gen-0, gen-1, loh, non-gc, etc dimension?

You can always add new values to the type dimension in the same way as you can add more metrics.
But users see what's there without prior knowledge of heap gc/non gc types.

Move one step further and you have dotnet.memory.used(?) and users can learn about .net memory management from metric dimensions.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I think I follow better what you are after now and also sounds like I may have had the wrong understanding about the appropriateness of runtimes reporting process-level metrics. I'm waiting to get confirmation on the cpu discussion above to make sure I don't create further confusion before I continue working through this one with you :)

unit: "By"
stability: experimental

- id: metric.clr.gc.heap.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following JVM example (jvm.memory.used), can we consider clr|dotnet.heap.used?

It'd emphasize that it's the amount used (vs available, vs max, etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how JVM memory model works, gc.heap does sound more accurate to me than dotnet.heap since there are heaps managed by .NET GC and heaps not managed by the .NET GC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a balance between accuracy and user-friendliness.

E.g. if I'm looking at all metrics, how can I find the memory utilization. I'd be surprised to find it in anything.gc.memory.
Most of the users (myself included) rarely care about subtle technical differences.

If we had a metric that showed memory usage that users can slice and dice based on additional dimensions, it could be both, technically precise and also friendly.

In the case of Java metrics, they kinda achieved it in jvm.memory.used:

  • you can sum up by memory type to get both heap and non-heap
  • you can get heap only and break it down by GC generation
  • ...

If multiple metrics are provided, we'd depend on the user to know all the different memory consumption sources - as a typical non-advanced user, I won't be able to answer a simple question of how much memory my .NET app consumes and how the break down by different memory types looks like.


- id: metric.clr.gc.heap.fragmentation.size
type: metric
metric_name: clr.gc.heap.fragmentation.size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

Suggested change
metric_name: clr.gc.heap.fragmentation.size
metric_name: clr.heap.fragmented

?

Copy link

@noahfalk noahfalk May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the prior art for naming of fragmentation in .NET that I could find:
BCL API: FragmentationBeforeBytes FragmentationAfterBytes
EventCounter: gc-fragmentation
OTel runtime instrumentation: process.runtime.dotnet.gc.heap.fragmentation.size

Although not identical, they all seem to agree on "fragmentation" as the term rather than "fragmented".


- id: metric.clr.timer.count
type: metric
metric_name: clr.timer.count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clr.timers.count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conventions prefer that pluralization should not be used for metrics. I based this name off of the existing contrib runtime metric and I think it best to keep it as is.

Copy link
Contributor

@lmolkova lmolkova Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's applied inconsistently (i.e. there are clr.assemblies.count and clr.exceptions.count), but also

Metric names SHOULD NOT be pluralized, unless the value being recorded represents discrete instances of a countable quantity. Generally, the name SHOULD be pluralized only if the unit of the metric in question is a non-unit (like {fault} or {operation}).

model/metrics/clr-metrics.yaml Show resolved Hide resolved
@trask
Copy link
Member

trask commented May 28, 2024

/easycla

@lmolkova
Copy link
Contributor

lmolkova commented Jun 3, 2024

Looking at the discussions in this PR, I want to reiterate the proposal #1035 (review):

Let's think about user experience - which metrics users want to see first - my assumption is CPU, memory (from all sources, ideally in one/few metrics, GC, maybe threads. Let try to come up with a few metrics that would not require users to have a deep prior expertise in .NET memory management or know subtle differences between different .NET flavors.

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things.
Let's focus on basics first.

@noahfalk
Copy link

noahfalk commented Jun 3, 2024

If we need some advanced, precise things, they should come as an addition to basic CPU/mem/GC things.
Let's focus on basics first.

Are you suggesting lets review some things first and some things later? Or do you mean "Lets see if we can avoid including some of these metrics in .NET 9 at all?" If its just review ordering I'd say no problem. If the goal is to have these metrics not contain all the metrics in the current OTel .NET runtime instrumentation I think that leads to trouble. One of the major scenarios will be people migrating from using older metrics to these metrics and every metric we remove makes that migration harder or discourages them from migrating at all. If we want to have a simple set of metrics for folks just getting started I think a good approach to that will be docs or a pre-made dashboard, not by excluding metrics from the underlying instrumentation.

EDIT: Just to add I know a few of the metrics may feel a bit advanced or niche, but they are there because customer feedback wanted them to be there. We took a bunch of things out during the move from Windows Performance Counters -> EventCounters, but customers gave us feedback that we had cut too deep and please restore some of the metrics that were important to them.

Co-authored-by: Liudmila Molkova <limolkova@microsoft.com>
@lmolkova
Copy link
Contributor

lmolkova commented Jun 5, 2024

on reducing the scope:

@noahfalk thanks for the explanation. If it's either all or nothing, I believe we should still start from the basics (and try to do everything in .NET 9).

The main concerns on the basics from my side are:

The rest seems more cosmetic based on our discussions on duration/histograms in the dotnet/runtime#85372 (comment)

@noahfalk
Copy link

noahfalk commented Jun 7, 2024

The main concerns on the basics from my side are...

Thanks @lmolkova, much appreciated! I'm just waiting to make sure I've got the right idea about the cpu portion of the discussion and then I'll respond back on the memory part as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Discussions
Development

Successfully merging this pull request may close these issues.

Create CLR metrics semantic convention
10 participants