Profiling SIG: No place for decodable auxiliary/binary data for further detailed analysis #3979

beaubelgrave · 2024-04-02T23:04:58Z

Our proprietary profiling format has space for both sampled data, such as CPU samples, and supporting data, like when each thread was readied (and by whom) among many other things. Several existing documented formats/technologies for profiling allow for this, for example, ETW on Windows allows a profiler to get events about CPU sampling, but can also get non-sampled data like context switches or page faults that are occurring. The Perf kernel system (and the matching CLI tool) on Linux also support both capturing profiling data with supporting information, such as other tracepoints (uprobe, user_events, kprobes, etc) on the system.

Typically, auxiliary comes in at a higher rate and don't have callstacks (although some do). These don't quite match to a sample linked to addresses. These are more like a set of metadata linked to several instances of binary data that can be decoded after collection using the linked metadata.

I expected to see a way to store this type of auxiliary data within the profile format, however, I don't see a way to store binary data with metadata for decoding that supports understanding performance at a deeper level. While this auxiliary data might not be useful for aggregation, it is very useful when a profile comes in that is anomalous and a deep investigation is needed for that specific profile/time period.

For example, we may want to see when CPU samples are hot in a function, which branches are being mis-predicted or if cache-line misses are occurring. These events will not have callstacks or even an associated instruction pointer, it may be solely linked to a CPU at a time, and that data needs to be further mixed with other auxiliary data to truly understand the system.

felixge · 2024-04-03T12:59:32Z

Thanks for raising this. In my mind this use case sounds similar to the one of preserving JFR events that currently don't map well to the profiling spec. Go execution traces are in a very similar situation.

Right now the most natural "extension point" to transport such data as part of OTel is the original_payload field in the spec. However, we're still having debates in the SIG on how flexible we want to be with this field. Generally speaking the OTel architecture would prefer all data to be explicitly converted into OTLP protobuf messages. However, creating a format suitable to hold a superset of JFR/Go Execution Traces/Microsoft's Proprietary Format/etc. is problematic from a complexity as well as efficiency perspective.

That being said, if you have specific ideas for supporting the data you have in mind, please sketch them out here!

beaubelgrave · 2024-04-03T15:48:14Z

I agree, it's similar to JFR (and CLR) events that don't map well.

I would prefer, personally, to have this within the OTLP protobuf message a separate section, like the pprof extended is. For efficiency/complexity, I'm leaning toward how Linux has done tracepoints. You have a set of metadata that describes each event (by ID) and then you have data that is simply an event id + payload. A set of metadata (could be as simple as a set of KeyValue objects) just needs to be defined per-id.

Typically, metadata in those events are pretty basic data types (string, UTF-8/UTF-16, u16, u32, u64, s16, s32, s64, etc). They point to an offset within the event binary data that has that type (the type defines also the length). There is one special type in the Linux tracepoint architecture (__rel_loc/__data_loc) that offer a header that handles variable length types (string, char, struct, etc). See this.

An alternative of the metadata -> event ID approach, is an entirely self-defined payload (we use both approaches in our formats). Checkout this.

Depending on the approach, you could get as simple as just an array of byte sequences (EventHeader approach) or a set of KeyValues linked by event ID + byte sequence.

tigrannajaryan · 2024-04-19T17:57:00Z

We need an input from Profiling SIG on this. Will add to profiling SIG agenda.

felixge · 2024-04-22T06:07:07Z

@beaubelgrave how do you imagine the proposed encoding to work for JFR or Go Execution Traces? Both formats are emitted by the runtimes, and are heavily optimized for their respective data payloads. Converting them to an alternate representation is going to cause a significant amount of decoding/re-encoding overhead. The resulting data is likely to be less compact than before.

I'm assuming that for the data you're interested in, you can control the code that is producing the data?

beaubelgrave · 2024-04-22T16:28:34Z

I don't think you change them at all, you just create a "decoder ring" once for those events. The minimal way to achieve this is to first add metadata that describes where JFR or Go has put the various fields of each event you are capturing. Then when those events are captured, you simply append them with the appropriate metadata ID.

We don't often control the payload format, however, we do know these formats details in order to create the metadata. On Linux this can be found via the tracepoint/trace_event definition within tracefs (/sys/kernel/tracing/events). On Windows, this can be determined by the ETW manifest or it can be self-described within the event itself (in that case, we'd need the metadata to state it's using some well-known format instead of metadata field descriptions).

For a concrete example, let's take a look at the sched_waking event on Linux, which tells us when a thread is waking up.

You can get the format from /sys/kernel/tracing/events/sched/sched_waking/format:

name: sched_waking
ID: 404
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char comm[16];    offset:8;       size:16;        signed:0;
        field:pid_t pid;        offset:24;      size:4; signed:1;
        field:int prio; offset:28;      size:4; signed:1;
        field:int target_cpu;   offset:32;      size:4; signed:1;

In the above, each field has a type, name, offset, and finally size. We only need to capture the above format details once, then when the payload for "sched_waking" is captured, we simply copy the bytes and link it to the above metadata. There is no re-shaping of data.

I'm unfamiliar with JFR and Go events you think would be hard to describe here, but I am familiar with the kernel events and user_events on Linux as well as the CLR runtime events for C#. I believe those can all be described in some metadata block. The hardest case for these is when a dynamically sized array (strings, etc.) are in the middle of the event (instead of the end). The metadata format needs to have the proper language to describe this so it can be decoded properly.

With this approach, the writes/capture is very fast. The decoding can be done later, depending on the complexity of the events, may be costly. Typically runtime events are small and don't have dynamic data within them when they come at high-rates. However, it could be Go or JFR didn't take that path.

@felixge Can you share the details of the events you think would have a hard time describing in this format?

felixge · 2024-04-22T19:45:52Z

@beaubelgrave ah, I think I understand your idea better now. But I think that even creating this meta data will be challenging because of the following:

Both formats use LEB-128 encoding.
Both runtimes buffer their events in per-thread buffers before flushing them to the underlaying data stream.

That means the data is becoming available to user space in batches of events. To add a metadata ID in front of every event (or in a separate section) requires decoding the batches into individual events. This means that the events have to be LEB128 decoded to some degree, which is not cheap. The next problem is that both formats don't have a specification and the runtime may change the encoding details between minor versions.

JFRs are also already self-describing, it's just really complex.

So I'm a bit unsure whether or not adding a meta layer over these data sources will be a good idea. But I'm open to consider it further!

beaubelgrave · 2024-04-22T21:03:38Z

Regarding "The next problem is that both formats don't have a specification and the runtime may change the encoding details between minor versions.". I agree on this, perhaps those runtimes should have a stable event documentation like the CLR does.

@felixge Is there an alternative approach you were thinking about?

beaubelgrave · 2024-04-22T21:47:40Z

One possible approach could be that the metadata can also describe a set of formats. Like it may be able to describe per-event details, but it could also say format = "JFR" and then the byte blob from the per-thread buffers is just copied. For tools that understand "JFR" they could parse it.

While not ideal, it would allow to store a mix of well described data and data that is indescribable with basic field, type, offsets.

felixge · 2024-04-23T07:54:42Z

perhaps those runtimes should have a stable event documentation like the CLR does.

CLR ETW looks nice. What's the strategy when it comes to runtime internals changing and some events no longer making sense?

I'm asking because both JFR and Go are not specifying their format because it's a maintenance PITA for them when it comes to evolving the internals of the runtime and the format.

@felixge Is there an alternative approach you were thinking about?

Yeah, I was thinking to just use OpenTelemetry as an envelop format for these payloads. So like you suggest above, just say "format = JFR", and then follow it with the raw data. IMO that can be done for the whole recording without worrying about the batches or other format internals.

This puts the burden of interpreting the data on upstream receivers. E.g. for Go there an official library. For JFR similar unofficial (AFAIK) libraries exist.

It's not ideal, but the alternatives are even less appealing IMO. But again, I'm open to ideas.

beaubelgrave · 2024-04-23T15:29:16Z

perhaps those runtimes should have a stable event documentation like the CLR does.

CLR ETW looks nice. What's the strategy when it comes to runtime internals changing and some events no longer making sense?

I'm asking because both JFR and Go are not specifying their format because it's a maintenance PITA for them when it comes to evolving the internals of the runtime and the format.

@noahfalk, you want to take this? In general, ETW has a version byte that is used to version events if they need to change/append new data. However, I'm unsure about entirely deprecating events.

@felixge Is there an alternative approach you were thinking about?

Yeah, I was thinking to just use OpenTelemetry as an envelop format for these payloads. So like you suggest above, just say "format = JFR", and then follow it with the raw data. IMO that can be done for the whole recording without worrying about the batches or other format internals.

This puts the burden of interpreting the data on upstream receivers. E.g. for Go there an official library. For JFR similar unofficial (AFAIK) libraries exist.

It's not ideal, but the alternatives are even less appealing IMO. But again, I'm open to ideas.

I would like a way for technology that is mature enough to have well described events to be able to represent them clearly in OTel. However, I totally understand the need for some opaque pass-through models as well. I think the metadata format would allow for both. If it's simply passthrough and nothing else, you'd just have a single byte array with a single metadata stating format = JFR. For well described cases, you'd had an array of metadata and binary blobs. I think it can handle both.

noahfalk · 2024-04-23T23:16:11Z

I'm asking because both JFR and Go are not specifying their format because it's a maintenance PITA for them when it comes to evolving the internals of the runtime and the format.

@noahfalk, you want to take this? In general, ETW has a version byte that is used to version events if they need to change/append new data. However, I'm unsure about entirely deprecating events.

CLR has a couple different approaches to this.

We defined two different providers, a public one and a private one. The public one is intended to be relatively stable, the private one is intended for random internal details that might change at any time.
Our events can be versioned in a back-compatible way by increasing a version number and appending new fields to the end. The reader can ignore trailing fields it doesn't understand.
If the runtime changed in a way to make some old event useless we could stop generating that event and start generating a new one. No examples of us doing this are coming to mind though so I'm guessing it has been rare. I wasn't in charge in the event portion until ~5 years ago so maybe it happened more in the past.

I think there are two different levels to the format. The doc you pointed at Beau are .NET specific semantic conventions defining specific fields and the meaning of those fields for different event types. There is also the Nettrace format which describes how data for arbitrary events gets encoded in a file. This is CLR's platform neutral tracing format that might be the analog of JFR, pprof, or the new format being standardized here. CLR can write the same events into nettrace format on any platform, or write to ETW on Windows/Lttng on Linux.

beaubelgrave added the spec:miscellaneous For issues that don't match any other spec label label Apr 2, 2024

trask mentioned this issue Apr 3, 2024

Create @open-telemetry/specs-profiling-approvers team open-telemetry/community#2033

Closed

reyang added spec:profiling Related to the specification/profiling directory and removed spec:miscellaneous For issues that don't match any other spec label labels Apr 4, 2024

tedsuo added the sig-issue label Apr 5, 2024

jack-berg added the triage:deciding:needs-info label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling SIG: No place for decodable auxiliary/binary data for further detailed analysis #3979

Profiling SIG: No place for decodable auxiliary/binary data for further detailed analysis #3979

beaubelgrave commented Apr 2, 2024

felixge commented Apr 3, 2024

beaubelgrave commented Apr 3, 2024

tigrannajaryan commented Apr 19, 2024

felixge commented Apr 22, 2024

beaubelgrave commented Apr 22, 2024 •

edited

felixge commented Apr 22, 2024 •

edited

beaubelgrave commented Apr 22, 2024 •

edited

beaubelgrave commented Apr 22, 2024

felixge commented Apr 23, 2024

beaubelgrave commented Apr 23, 2024

noahfalk commented Apr 23, 2024

Profiling SIG: No place for decodable auxiliary/binary data for further detailed analysis #3979

Profiling SIG: No place for decodable auxiliary/binary data for further detailed analysis #3979

Comments

beaubelgrave commented Apr 2, 2024

felixge commented Apr 3, 2024

beaubelgrave commented Apr 3, 2024

tigrannajaryan commented Apr 19, 2024

felixge commented Apr 22, 2024

beaubelgrave commented Apr 22, 2024 • edited

felixge commented Apr 22, 2024 • edited

beaubelgrave commented Apr 22, 2024 • edited

beaubelgrave commented Apr 22, 2024

felixge commented Apr 23, 2024

beaubelgrave commented Apr 23, 2024

noahfalk commented Apr 23, 2024

beaubelgrave commented Apr 22, 2024 •

edited

felixge commented Apr 22, 2024 •

edited

beaubelgrave commented Apr 22, 2024 •

edited