Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document compatibility requirements for profiles #559

Conversation

tigrannajaryan
Copy link
Member

Profiles maintainers (@open-telemetry/profiling-maintainers), please take a look at the pprof compatibility criteria added to CONTRIBUTING.md.

@tigrannajaryan
Copy link
Member Author

CC @aalexand

All changes to profile proto should meet the following compatibility criteria:

- Every valid pprof profile is also a valid Otel profile AND,
- Every Otel profile is a valid pprof profile, after any new fields introduced in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expand if this is not clear enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That second requirement is interesting. What does "valid" mean here? That it doesn't cause a decoding error? Or that it allows the pprof CLI tool to render flame graph? Arguably there isn't much value in being "valid" if it doesn't enable interoperability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"doesn't cause a decoding error" is not good enough for me. All we would need to accomplish this is to just assign all new field numbers for Otel pprofextended messages and there would be no decoding errors since Protobuf decoder happily ignores unknown fields. This is obviously of no practical value.

What I am aiming for here is the following. If we extend profiles in Otel by new capabilities then those new capabilities obviously can't be understood by pprof. However ignoring those new additional capabilities should still result in a meaningful profile from pprof perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss the meaning of 'meaningful' in the next SIG meeting. Taking into account that the scope of pprofextended is larger than pprof, having every OTel profile be a valid and meaningful pprof profile may not be workable.

Quick examples that come to mind include missing symbols due to delayed symbolization (see the agenda for recent discussions around standardizing a symbol upload protocol) and deprecation / non-usage of fields that pprof expects to be present (e.g. Sample.location_index).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's discuss in the SIG.

This criterion came up in a meeting last week (suggested by @rsc). This is essentially a forward compatibility requirement. Due to how unknown fields are by default handled by Protobuf deserializers existing pprof tooling can read pprofextended with no code changes. There is certain value in this, although I don't know what practical use cases would benefit from this.

@aalexand also suggested an alternate criteria: "All Otel pprofextended profiles must be convertible to valid pprof profiles" (the precise meaning of "convertible" TBD). This is a less restrictive criteria that gives more flexibility to Otel profiles, but means if we want existing pprof tools to work with Otel profiles we either need a "converter" tool or need to teach pprof to understand the new Otel profiles directly.

There is probably other options that we can discuss.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aalexand it would be great if you can also join the Profiling SIG on May 30 to discuss the options.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll try to join.

The main question is basically whether OTel's pprofextended can become a source of truth for existing pprof clients without requiring them to migrate all at once (which is infeasible). For that to happen there needs to be an arbitrarily long period of time where the pprof's subset of pprofextended functionality is semantically equivalent to what pprof has. Today this is not the case - the most prominent example is the change from ID referencing to index referencing, since it's wire and semantics breaking.

An alternative is to simply consider pprofextended a fork of pprof where some sort of converter would exist from pprofextended to pprof format for pprof to be able to read pprofextended format. The converter does not have to be a part of pprof in this case, it could be anywhere. I suspect that Go runtime would probably keep using the current pprof format as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pprofextended already uses different names than pprof. Is this considered as conflict?
E.g. pprofextended Location.mapping_index vs pprof Location.mapping_id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is to simply consider pprofextended a fork of pprof where some sort of converter would exist from pprofextended to pprof format for pprof to be able to read pprofextended format. The converter does not have to be a part of pprof in this case, it could be anywhere. I suspect that Go runtime would probably keep using the current pprof format as well.

If we do this I think we loose the benefits of having one standard. In that case we may as well give Otel a complete freedom to designs its own format. I would like to avoid this.

pprofextended already uses different names than pprof. Is this considered as conflict?
E.g. pprofextended Location.mapping_index vs pprof Location.mapping_id.

We should consider 2 aspects of compatibility: wire compatibility and symbolic compatibility. Renaming fields does not impact wire encoding, so if we aim just for wire compatibility renames are fine. However I would argue that we should probably aim for symbolic compatibility and avoid renames, so that we make pprof tooling evolution to newer pprofextended versions easier.

In this particular case it may be nice to rename from mapping_id to mapping_index but I think names are similar enough that we can live with the old name too (assuming of course we also preserve the semantics).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important change in this particular case is not just the name change, but the change in meaning:

mapping_id in pprof means "the positive value of the Mapping.id field of the referenced mapping or zero if this is a nil reference".

mapping_index in pprofextended means "the index of the mapping in the repeated Profile.mappings field" (plus rather undefined currently way to express nil references)

This semantics change is going to be fairly hard to deal with in terms of switching existing pprof clients to pprofextended.

Profiles maintainers, please take a look at the pprof
compatibility criteria added to CONTRIBUTING.md.

## Profiles Compatibility Requirements

The Otel profile format in
Copy link
Member

@arminru arminru May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Otel profile format in
The OTel profile format in

(nit as per https://github.com/open-telemetry/opentelemetry-specification/blob/v1.33.0/specification/README.md?plain=1#L72)

@tigrannajaryan tigrannajaryan marked this pull request as draft May 21, 2024 14:47
@tigrannajaryan
Copy link
Member Author

I marked this draft for now. We should discuss the options.

@tigrannajaryan
Copy link
Member Author

@open-telemetry/profiling-maintainers I think we need to place a short moratorium on merging changes to profiles until we have a resolution on this PR otherwise we risk making changes that we may need to undo.

If you have any thoughts on alternate definitions of "compatibility" that you would prefer instead please comment and suggest. I do not want to block other PRs for too long, so ideally we should use the time we have until the next SIG meeting for offline discussion.

cc @jsuereth

@@ -12,3 +12,16 @@ for general information about the project.

After making any changes to .proto files make sure to generate all
implementation by running `make gen-all`.

## Profiles Compatibility Requirements
Copy link
Member

@jack-berg jack-berg May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Java SIG we've been discussing whether our in-memory representation of the profiling data model should include the deprecated pprof fields. I argued that they should not, since doing so is confusing and error prone for users who wish to implement the data model: Which field do I need to provide when two represent the same information? What happens if there's a conflict - are exporters expected to perform validation?

Its simpler to say that our in-memory data model needs to be guaranteed to be able to encode information to the pprof format. In other words, its compatible with pprof rather than a perfect apples to apples reflection of it.

This is a more relaxed compatibility requirement that what has been proposed so far with the profiling in this repo. If applied here, it would allow the profiling proto representation to completely diverge from pprof so long as it could maintain compatibility - as long as it could be translated losslessly to pprof. This wouldn't weaken otel from a tooling standpoint, since as long as we guaranteed that compatiblity and maintained OTLP exporters as well as pprof serializers, users could plug otel profiling data into all the tools that support pprof.

However, it would mean that we have yet another standard, since pprof and the new pprof compatible otel standard would exist where previously only pprof existed. So the value in maintaining a stricter compatibility requirement where we have a profiling message type which is a strict extension of the equivalent pprof proto message type is that it provides a route to having only a single standard.

I think this is valuable.

But what would this actually mean in practice? All opentelemetry OTLP profile exporters would export data which is wrapped in additional messages such that it can't be directly plugged into pprof tools. It seems like we need a way for the pprof Profile message to have some value on its own in OpenTelemetry? For example, we could define a export rpc for the OTLP ProfilesService which accepts Profile in addition to the rpc which accepts ExportProfilesServiceRequest. This would really force the issue and ensure that existing pprof tooling can interact with opentelemetry in material way.

@tigrannajaryan
Copy link
Member Author

The PR does not explain why I am suggesting to have the compatibility criteria and I think it is important to capture it. This has been discussed verbally but I don't think it is written down anywhere, so here is a summary of what I am aware of.

The Profiling SIG made a decision to derive Otel profiling format from pprof. As far as I remember the primary reason for this decision was to avoid creating yet another new standard and instead to reuse an existing one.

We discussed with the pprof team the option to extend pprof format in a way that is compatible with existing pprof format and can be adopted by pprof tooling in the future. This will ensure that we do not create one more format but instead evolve an existing one.

For this to be achievable obviously the changes we make need to be "compatible", and what exactly that compatibility means needs to be defined precisely so that we can follow that definition when introducing changes.

If we don't define these compatibility requirements and follow them then what we will likely end up with is an Otel profile format that was derived from pprof, but because of changes became incompatible with it and essentially is a fork of pprof, is yet another standard, so we will fail the original goal.

If we didn't have this goal we could have instead decided that we want to start from scratch and create a complete new Otel profiling format. This would be a very different design approach. It would be less restricting and potentially would allow Otel to design a more efficient format due to lifting off this restriction. However, we chose not to do this since we felt that the benefit of having one less standard outweighs the potential benefits that are possible from a greenfield design.

I strongly believe we must continue looking for ways to remain pprof-compatible.

That said, I recognize that there is more than one way to define the compatibility requirements. The version I present in this PR is one of the more restrictive versions. I can envision alternate approaches that are less restrictive. We would like to discuss these different approaches in the next Profiling SIG meeting.

@@ -12,3 +12,16 @@ for general information about the project.

After making any changes to .proto files make sure to generate all
implementation by running `make gen-all`.

Copy link
Contributor

@florianl florianl May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this section should also align with Compatibility With Original pprof in the data model.

The proposed changes would conflict as the original statement is:

It is not forward compatible, meaning that a pprof file generated by the new proto cannot be parsed by existing software. This is mainly due to the sharing of the call stacks between samples + new format for labels (more on these differences below).

I think, it is fine to change these original compatibility requirements, but it should be aligned and adapted correctly.

@tigrannajaryan
Copy link
Member Author

Closing for now, issue is recorded to figure this out: #567

@tigrannajaryan tigrannajaryan deleted the feature/tigran/pprofcompat branch June 5, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants