Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracing: Recorded flag not propagated between activities #9011

Open
Costo opened this issue May 17, 2024 · 6 comments
Open

Tracing: Recorded flag not propagated between activities #9011

Costo opened this issue May 17, 2024 · 6 comments
Assignees

Comments

@Costo
Copy link
Contributor

Costo commented May 17, 2024

Hi,

I'm experiencing something similar to this issue in Orleans: open-telemetry/opentelemetry-dotnet#4474
The recorded flag is not propagated between activities, making impossible to sample or exclude whole traces from the root activity.

I was able to fix the issue with a 1-line change to ActivityPropagationOutgoingGrainCallFilter.
See the diff here: https://github.com/dotnet/orleans/compare/main...Costo:orleans:fix/recorded-flag?expand=1

I'm not sure I understand what the problem is and why my change works. I tried to reproduce the issue in ActivityPropagationTests but did not succeed.

However, I can give the graph below as a "proof" that the fix works.
When we migrated to from Orleans 3.5 to Orleans 8.0 with OpenTelemetry, we quickly realized that something was wrong as we were quickly hitting the 100GB daily cap every day. Since the fix was applied, it is much better, even if we still have some root traces we can filter out.

Screenshot 2024-05-17 155703

@ReubenBond
Copy link
Member

Thanks for reporting, @Costo!
@noahfalk are you able to advise us on what we should be doing here - is the one-line fix @Costo used the right fix? https://github.com/dotnet/orleans/compare/main...Costo:orleans:fix/recorded-flag?expand=1

@ReubenBond ReubenBond self-assigned this May 17, 2024
@noahfalk
Copy link
Member

Yeah, I think that is correct. OpenTelemetry requested we add this extra IsRemote bit to ActivityContext that is intended to indicate whether an ID originated from a remote machine across the wire or it was captured from some locally running Activity. Their sampling logic allows folks to use different heuristics depending on whether the parent is remote or not. However we already had APIs like ActivitySource.CreateActivity(string traceId, ...) that didn't accept the extra bit so they had to make an arbitrary choice whether to assume isRemote=true or isRemote=false for these calls. We wound up with setting isRemote=false as the default. In hindsight I wish we'd gone the other way but we've got back compat so no plans to change it. This fix works around that questionable choice of default by creating the ActivityContext yourself with the bit set correctly and then calling an overload that takes the ActivityContext.

Let me know if you've got any questions. Hope that helps :)

@noahfalk
Copy link
Member

fyi @tarekgh

@ReubenBond
Copy link
Member

Thanks for the explanation, @noahfalk!

ActivityContext.TryParse(traceParent, traceState, isRemote: true, out ActivityContext parentContext);
activity = source.CreateActivity(context.Request.GetActivityName(), ActivityKind.Server, parentContext);

When can TryParse fail here? If it does, should we be passing traceParent to CreateActivity instread of parentContext?
Is this the ideal solution, or is there something superior we should be considering?

@noahfalk
Copy link
Member

It fails when the format of traceParent doesn't match the W3C spec (code here: https://source.dot.net/#System.Diagnostics.DiagnosticSource/System/Diagnostics/ActivityContext.cs,70)

If it does, should we be passing traceParent to CreateActivity instread of parentContext?

There is an older .NET specific id format called hierarchical ID that ActivitySource.CreateActivity() also accepts, but I'm not sure if you'd every encounter it in your scenarios? For .NET Framework apps it is the default format produced when creating Activities but on .NET Core apps you'd only get it by explicitly opting into the legacy behavior which I suspect almost nobody does. If your code is .NET Core only I don't think you'd be giving up much by saying you only support W3C TraceParent IDs.

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/distributed-tracing-concepts#activity-ids

@ReubenBond
Copy link
Member

Great, thanks again. We'll likely merge this, then. @Costo would you be happy to open a PR based on your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants