Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch pipeline trace to DiagnosticSource #1169

Merged
merged 47 commits into from Jul 21, 2020
Merged

Switch pipeline trace to DiagnosticSource #1169

merged 47 commits into from Jul 21, 2020

Conversation

tillig
Copy link
Member

@tillig tillig commented Jul 17, 2020

This PR is part of #788 and is the first step in getting the Autofac.Diagnostics.DotGraph package out.

It updates the diagnostic mechanism to use System.Diagnostics.DiagnosticSource to emit events and moves the diagnostics classes into an Autofac.Diagnostics namespace.

Subscribing to events in this system means creating a System.Diagnostics.DiagnosticListener and attaching it to the container's DiagnosticSource. Internally this follows an observer pattern that receives a stream of events, keyed by string IDs, with which the listener may do... whatever they want. There are convenience methods for subscribing to the container's diagnostics, like:

container.SubscribeToDiagnostics<MyListener>();

The default diagnostic tracer was updated to be a diagnostic listener in this format. Rather than paste an example in here, check that out.

I tried to follow patterns I'd seen in areas like ASP.NET Core with respect to things like:

  • Checking to see if there's a listener subscribed prior to emitting an event.
  • Keeping some things internal, like the strings that are the list of event IDs.

That latter one was sort of interesting. I wondered why it was internal but realized a couple of things:

  1. Constants are faster because they get compiled into your code, but... they get compiled into your code. So changing the constant later doesn't actually take effect in consuming code outside the assembly. You'd have to use static readonly, but that's not as fast access-wise and this needs to be super low overhead.
  2. Adding, removing, or changing events means adding, removing, or changing these event IDs. If there's not a public constants class, it means we can do that pretty freely (within reason) and the most "breaking" it will be for a consumer is that they won't get that event anymore. If it's part of the public API, it actually breaks in a more fatal way, and that's not cool.

The history is pretty messy because I started with the DOT graph tracer in this PR and then it moved out to a separate package. I tried to squash/rebase it but that, too, got kinda dicey since I'd been pulling changes from the v6 branch and... yeah.

tillig and others added 30 commits July 6, 2020 15:07
Remove IsTopLevelOperation and ITracingIdentifier
@tillig
Copy link
Member Author

tillig commented Jul 17, 2020

A few extra things did sneak in here:

  • The .editorconfig got updated based on styles we're hoping to make uniform. They do provide some "info" level suggestions in VS Code; we'll want to look at fixing some of that up once v6 is in develop.
  • We have a .github repo now with things like the bug/issue templates, the code of conduct, and so on. We need to remove them from here to allow that central repo to work.
  • I found a couple of PSScriptAnalyzer warnings I fixed in the build.ps1 script.
  • The MetadataKeys file had a TODO that'd been in there for quite some time and was actually obsolete. I removed the associated "TODO: delete this stuff" stuff. I did that because I was trying to use TODO to track the stuff I was working on and... my OCD took over.

I didn't run the perf metrics against it to see if things changed, though I can't imagine it's significant enough to make a huge impact. I've been wrong.

One thing that is different is that the diagnostic source is shared from the container down. You attach at the container level, not at any arbitrary lifetime scope level. I'm not sure if this reduction in functionality matters too much, but it's important to be aware - that's different than the pipeline tracer approach from before.

Copy link
Member

@alistairjevans alistairjevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally looks great, I can see the benefits of using the DiagnosticSource system.

Unfortunately, there is a performance regression from v6. I've run the benchmarks for the DeepGraph and ChildScope against v6 vs issue788:

DeepGraphResolve - ~1.2μs / 13% slower

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
v6 9.513 μs 0.0358 μs 0.0299 μs 2.0599 0.0458 - 16.92 KB
issue-788 10.78 μs 0.092 μs 0.087 μs 2.0599 0.0458 - 16.91 KB

ChildScopeResolve - ~2μs / 7% slower

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
v6 30.47 μs 0.176 μs 0.156 μs 3.9063 1.9531 - 32.09 KB
issue-788 32.70 μs 0.456 μs 0.427 μs 4.0894 2.0142 - 33.88 KB

There's some thoughts and proposed changes in the individual comments, including on the performance, let me know what you think!

// Optimise the path depending on whether a tracer is attached.
if (ctxt.TracingEnabled)
// Optimise the path depending on whether diagnostics are enabled.
if (ctxt.DiagnosticSource.MiddlewareDiagnosticsEnabled())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line may be at least partly responsible for the performance regression when there are no listeners; this line of code is possibly the hottest path in the entire pipeline.

In the original version of tracing I had to add a boolean cache of this 'Enabled' flag and store it in the Request Context. Can we do that again, and cache the result of this method when the Request begins?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tough because the thing being checked is looking to see if any subscribers have this enabled, right? And there could be multiple subscribers, which could be added at any time.

I could simplify the checks - check for literally the exact event being subscribed instead of one of a logical set. Right now I check for start or success or fail to determine if any subscribers exist, but in the majority case I'm guessing that's three "false" checks every time. I could switch it to check only for the specific event being issued. It'll change the flow a bit since you could subscribe to a failure event without subscribing to start or success, but maybe that's okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably justified for performance reasons to say "If you change diagnostic subscriptions to middleware events, that change may not take effect until the next resolve operation.

Especially since the 99% use case is going to be turning it on once, for all events, and never changing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So evaluate all the subscriptions at the operation level and pass that all down through the stack? I think that might add some complexity. Let me see about making the checks more precise and see where that leaves us. I also have a not-totally-formed idea about using flags internally to do bitwise checks and speed up comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I was just thinking that if someone changes the set of suscribers or subscribed events such that our checking method returns a different value, we...might not raise some middleware events during that transition. But then if someone did change that set during a request, they would have missed related operation and request messages anyway, which could introduce different problems.

Is this sort of thing what the DiagnosticSource Activity functionality is for?

Fundamentally it's whatever gets us the perf saving with no subscribers, so whatever approach works.

Are we anticipating encouraging the use of diagnostics in production? If so, do we need to bench the diagnostic source capture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about using Activity but it didn't seem worth the allocations.

I'm less concerned about changing subscriptions in the middle of the request than I am about trying to be too smart in the caching mechanism such that we stop responding correctly to observables or it becomes hard to add a new event in here. I could see wanting diagnostic events for all the regular lifetime events, too.

I don't think diagnostics in prod would be recommended. More like "turn this on if you find a problem and turn it back off" or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized now that I was basically checking enabled twice for any given write.

First time in if(source.MiddlewareDiagnosticsEnabled()) and later in the actual call to write the event. Same for operation and request.

I've removed the aggregate checks and just leave it to the call to write the event. This should remove some of the overhead. Let's see where this puts us perf-wise and I can try to tweak some more if it looks like it still needs help.

I'm trying to not cart too much additional context around if possible, and while a single Boolean isn't that big a deal, the orchestration of keeping the cache up to date with the actual settings seems like it could be a challenge long-term, similar to how container rebuilding got to be a bit sticky and unpredictable.

I can see where there is a use case for success or failure instrumentation, like perf counters or something, to track stats. That'd mean subscribing to just success/failure events but maybe not start events. The ability to have that granularity seems interesting to me, rather than just "all on" or "all off" for a whole request.

I can also see that we may want to support (at some point) the ability to conditionally determine if you're 'enabled' based on the data itself which wouldn't be possible if we try to pre-cache things too much. Like, "I want to see events that involve resolving a certain type" to diagnose an issue would be a use case for being enabled based on certain data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed an update to have the event source DiagnosticListener check for IsEnabled() before checking if the specific named subscription is enabled. Benchmark the crap out if it, it should still basically boil down to a check for subscriptions != null before doing anything, which should be just as fast as what we had before. Granted, it's going to do that null check for each event rather than just one time, but that still should be pretty fast.

src/Autofac/Diagnostics/DiagnosticTracerBase.cs Outdated Show resolved Hide resolved
src/Autofac/Diagnostics/DiagnosticTracerBase.cs Outdated Show resolved Hide resolved
src/Autofac/Autofac.csproj Show resolved Hide resolved
src/Autofac/Core/Container.cs Outdated Show resolved Hide resolved
src/Autofac/Core/Resolving/ResolveOperationBase.cs Outdated Show resolved Hide resolved
src/Autofac/Core/Resolving/ResolveOperationBase.cs Outdated Show resolved Hide resolved
src/Autofac/ILifetimeScope.cs Outdated Show resolved Hide resolved
@tillig
Copy link
Member Author

tillig commented Jul 20, 2020

First push today fixes the this references. Diving into other stuff now.

@tillig
Copy link
Member Author

tillig commented Jul 20, 2020

A few more pushes address

  • this usage - removed all the this from the src/** files
  • documentation updates
  • switching things so only IContainer exposes DiagnosticSource
  • disposal of DiagnosticSource when the container is disposed
  • updating the event handling methods of the base tracer to be protected

@tillig
Copy link
Member Author

tillig commented Jul 20, 2020

Pushed an update to the way things get checked for subscription enablement to try to speed that up.

I think it's ready for another review and benchmarking.

@alsami
Copy link
Member

alsami commented Jul 21, 2020

I think it's ready for another review and benchmarking.

Just ran the Benchmarks on Linux, hope that's fine.

DeepGraphResolve Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
v6 16.67 us 0.206 us 0.183 us 0.4883 - - 16.92 KB
issue-788 17.50 us 0.106 us 0.099 us 0.4883 - - 16.91 KB

Increase around ~4%

ChildScopeResolve Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
v6 57.70 us 0.163 us 0.127 us 1.0986 0.5493 - 37.62 KB
issue-788 58.22 us 0.277 us 0.259 us 1.0986 0.5493 - 37.61 KB

Increase around ~0.9%

Copy link
Member

@alistairjevans alistairjevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the functionality changes look good, API surface on the tracer is nice and compact now.

Perf-wise, some improvement, although not back to v6 levels. It's probably worth pointing out though that we are still a slight gain on develop even after these changes (I've included results from develop here).

DeepGraphResolve - ~0.9μs / 8% slower (before was ~1.2μs / 13% slower)

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
develop 10.82 us 0.056 us 0.052 us 2.2430 0.0458 - 18.45 KB
v6 9.569 μs 0.0553 μs 0.0517 μs 2.0599 0.0458 - 16.92 KB
issue-788 10.42 μs 0.080 μs 0.071 μs 2.0599 0.0458 - 16.91 KB

ChildScopeResolve - ~1.5μs / 5% slower (before was ~2μs / 7% slower)

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
develop 37.33 us 0.229 us 0.214 us 6.4697 0.4272 - 52.86 KB
v6 30.92 μs 0.152 μs 0.142 μs 3.9063 1.9531 - 32.09 KB
issue-788 32.48 μs 0.141 μs 0.132 μs 4.0894 2.0142 - 33.88 KB

Regardless, I'm going to approve the changes, since we're down to refining it now. If you wanted to merge and address any remaining perf investigation separately (or under #1127) that's fine with me.

@tillig tillig merged commit 7c7c773 into v6 Jul 21, 2020
@tillig tillig deleted the issue-788 branch July 21, 2020 16:39
@alistairjevans
Copy link
Member

🍻

@tillig
Copy link
Member Author

tillig commented Jul 21, 2020

I appreciate the patience on this - it was definitely a long and winding road to get here, but I think it's good overall. Now let me see about getting that DOT graph tracer updated and PR'd based on these changes. 🎉

@alistairjevans alistairjevans linked an issue Jul 22, 2020 that may be closed by this pull request
@alistairjevans alistairjevans added this to the v6.0 milestone Sep 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DiagnosticSource Tracing
4 participants