New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling SIG: No place for decodable auxiliary/binary data for further detailed analysis #3979
Comments
Thanks for raising this. In my mind this use case sounds similar to the one of preserving JFR events that currently don't map well to the profiling spec. Go execution traces are in a very similar situation. Right now the most natural "extension point" to transport such data as part of OTel is the That being said, if you have specific ideas for supporting the data you have in mind, please sketch them out here! |
I agree, it's similar to JFR (and CLR) events that don't map well. I would prefer, personally, to have this within the OTLP protobuf message a separate section, like the pprof extended is. For efficiency/complexity, I'm leaning toward how Linux has done tracepoints. You have a set of metadata that describes each event (by ID) and then you have data that is simply an event id + payload. A set of metadata (could be as simple as a set of KeyValue objects) just needs to be defined per-id. Typically, metadata in those events are pretty basic data types (string, UTF-8/UTF-16, u16, u32, u64, s16, s32, s64, etc). They point to an offset within the event binary data that has that type (the type defines also the length). There is one special type in the Linux tracepoint architecture (__rel_loc/__data_loc) that offer a header that handles variable length types (string, char, struct, etc). See this. An alternative of the metadata -> event ID approach, is an entirely self-defined payload (we use both approaches in our formats). Checkout this. Depending on the approach, you could get as simple as just an array of byte sequences (EventHeader approach) or a set of KeyValues linked by event ID + byte sequence. |
We need an input from Profiling SIG on this. Will add to profiling SIG agenda. |
@beaubelgrave how do you imagine the proposed encoding to work for JFR or Go Execution Traces? Both formats are emitted by the runtimes, and are heavily optimized for their respective data payloads. Converting them to an alternate representation is going to cause a significant amount of decoding/re-encoding overhead. The resulting data is likely to be less compact than before. I'm assuming that for the data you're interested in, you can control the code that is producing the data? |
I don't think you change them at all, you just create a "decoder ring" once for those events. The minimal way to achieve this is to first add metadata that describes where JFR or Go has put the various fields of each event you are capturing. Then when those events are captured, you simply append them with the appropriate metadata ID. We don't often control the payload format, however, we do know these formats details in order to create the metadata. On Linux this can be found via the tracepoint/trace_event definition within tracefs (/sys/kernel/tracing/events). On Windows, this can be determined by the ETW manifest or it can be self-described within the event itself (in that case, we'd need the metadata to state it's using some well-known format instead of metadata field descriptions). For a concrete example, let's take a look at the sched_waking event on Linux, which tells us when a thread is waking up. You can get the format from /sys/kernel/tracing/events/sched/sched_waking/format:
In the above, each field has a type, name, offset, and finally size. We only need to capture the above format details once, then when the payload for "sched_waking" is captured, we simply copy the bytes and link it to the above metadata. There is no re-shaping of data. I'm unfamiliar with JFR and Go events you think would be hard to describe here, but I am familiar with the kernel events and user_events on Linux as well as the CLR runtime events for C#. I believe those can all be described in some metadata block. The hardest case for these is when a dynamically sized array (strings, etc.) are in the middle of the event (instead of the end). The metadata format needs to have the proper language to describe this so it can be decoded properly. With this approach, the writes/capture is very fast. The decoding can be done later, depending on the complexity of the events, may be costly. Typically runtime events are small and don't have dynamic data within them when they come at high-rates. However, it could be Go or JFR didn't take that path. @felixge Can you share the details of the events you think would have a hard time describing in this format? |
@beaubelgrave ah, I think I understand your idea better now. But I think that even creating this meta data will be challenging because of the following:
That means the data is becoming available to user space in batches of events. To add a metadata ID in front of every event (or in a separate section) requires decoding the batches into individual events. This means that the events have to be LEB128 decoded to some degree, which is not cheap. The next problem is that both formats don't have a specification and the runtime may change the encoding details between minor versions. JFRs are also already self-describing, it's just really complex. So I'm a bit unsure whether or not adding a meta layer over these data sources will be a good idea. But I'm open to consider it further! |
Regarding "The next problem is that both formats don't have a specification and the runtime may change the encoding details between minor versions.". I agree on this, perhaps those runtimes should have a stable event documentation like the CLR does. @felixge Is there an alternative approach you were thinking about? |
One possible approach could be that the metadata can also describe a set of formats. Like it may be able to describe per-event details, but it could also say format = "JFR" and then the byte blob from the per-thread buffers is just copied. For tools that understand "JFR" they could parse it. While not ideal, it would allow to store a mix of well described data and data that is indescribable with basic field, type, offsets. |
CLR ETW looks nice. What's the strategy when it comes to runtime internals changing and some events no longer making sense? I'm asking because both JFR and Go are not specifying their format because it's a maintenance PITA for them when it comes to evolving the internals of the runtime and the format.
Yeah, I was thinking to just use OpenTelemetry as an envelop format for these payloads. So like you suggest above, just say "format = JFR", and then follow it with the raw data. IMO that can be done for the whole recording without worrying about the batches or other format internals. This puts the burden of interpreting the data on upstream receivers. E.g. for Go there an official library. For JFR similar unofficial (AFAIK) libraries exist. It's not ideal, but the alternatives are even less appealing IMO. But again, I'm open to ideas. |
@noahfalk, you want to take this? In general, ETW has a version byte that is used to version events if they need to change/append new data. However, I'm unsure about entirely deprecating events.
I would like a way for technology that is mature enough to have well described events to be able to represent them clearly in OTel. However, I totally understand the need for some opaque pass-through models as well. I think the metadata format would allow for both. If it's simply passthrough and nothing else, you'd just have a single byte array with a single metadata stating format = JFR. For well described cases, you'd had an array of metadata and binary blobs. I think it can handle both. |
CLR has a couple different approaches to this.
I think there are two different levels to the format. The doc you pointed at Beau are .NET specific semantic conventions defining specific fields and the meaning of those fields for different event types. There is also the Nettrace format which describes how data for arbitrary events gets encoded in a file. This is CLR's platform neutral tracing format that might be the analog of JFR, pprof, or the new format being standardized here. CLR can write the same events into nettrace format on any platform, or write to ETW on Windows/Lttng on Linux. |
Our proprietary profiling format has space for both sampled data, such as CPU samples, and supporting data, like when each thread was readied (and by whom) among many other things. Several existing documented formats/technologies for profiling allow for this, for example, ETW on Windows allows a profiler to get events about CPU sampling, but can also get non-sampled data like context switches or page faults that are occurring. The Perf kernel system (and the matching CLI tool) on Linux also support both capturing profiling data with supporting information, such as other tracepoints (uprobe, user_events, kprobes, etc) on the system.
Typically, auxiliary comes in at a higher rate and don't have callstacks (although some do). These don't quite match to a sample linked to addresses. These are more like a set of metadata linked to several instances of binary data that can be decoded after collection using the linked metadata.
I expected to see a way to store this type of auxiliary data within the profile format, however, I don't see a way to store binary data with metadata for decoding that supports understanding performance at a deeper level. While this auxiliary data might not be useful for aggregation, it is very useful when a profile comes in that is anomalous and a deep investigation is needed for that specific profile/time period.
For example, we may want to see when CPU samples are hot in a function, which branches are being mis-predicted or if cache-line misses are occurring. These events will not have callstacks or even an associated instruction pointer, it may be solely linked to a CPU at a time, and that data needs to be further mixed with other auxiliary data to truly understand the system.
The text was updated successfully, but these errors were encountered: