New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monotonic clock #1173
Monotonic clock #1173
Conversation
When measuring durations, it's preferable to use the system monotonic clock rather than `Time.now` in order to avoid problems with the wall clock potentially moving forwards or backwards for corrections that would throw off calculation of elapsed time. https://blog.dnsimple.com/2018/03/elapsed-time-with-ruby-the-right-way/
This extracts the stateful duration attributes from the Span class into a Duration class that is able to calculate duration for itself using the wall clock if times are explicitly set, or a monotonic clock if no times are explicitly set. This was extracted partially because the logic in Span was getting messy, and partly because Rubocop was complaining that the Span class was too long.
|
||
def duration | ||
if @wall_clock_duration | ||
(@end_time - @start_time).to_f rescue 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method can be simplified to
def duration
(@duration_end - @duration_start).to_f rescue 0.0
end
since we already do the work behind the scenes to update duration_(start|end)
to represent wall clock if when flag is toggled.
For some reason this is causing specific tests to return negative times, need to figure out whether this is an edge case or just an issue specific to tests. either way we should never return a negative time here so this requires further investigation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I recognize now that there is a very tiny breaking change here, in that we've removed the ability to set start_time
multiple times. This was popping up in the case of our grape tests because grape has some very old code which both creates/starts a span (tracer.trace) and then later manually sets start_time
to what active support says was the start time. By enforcing start_time
as when the span was created from tracer.trace, without explicitly supplying the active support start
start_time
option to tracer.trace, we end up using a start time of Time.now
which, is later than the active support notifications finish
time.
I'd argue this is appropriate, I've updated the grape instrumentation to correctly pass in start_time
to tracer.trace
when creating the span. Perhaps we should have a safeguard that returns a duration of 0 when Span.duration
is negative, although i'm relatively sure this gets accounted for at the agent level anyway
This should overall improve users timing accuracy, but I do have some concerns that a user with some complex "incorrect" async manual instrumentation may see their timing change as a result. I'm not sure we should be worrying about this case since in this situation they've been using an undocumented approach anyway, but it's something to be aware of.
@marcotc what do you think here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How easy is it to mess up that timing again? What if someone in their manual instrumentation sets start_time
after the span has started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone in their manual instrumentation sets start_time after the span has started?
calling span.start_time =
or span.start(<start_time>)
after the span has started will no-op. start_time
can effectively be set only once.
I think this makes it harder to mess up the timing and is a more correct way to handle this case, but to be clear, this is a change in behavior. Previously you could modify start_time
as many times as you wanted. If we want to maintain the "set start_time as much as you want behavior" I think that's technically possible, but my two cents is that it makes more sense to enforce setting start_time once while we still can (pre-1.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, we might not be able to make that change then.
being able to dynamically change the start/end time might be necessary, and given it is possible, it means a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes more sense to enforce setting start_time once while we still can (pre-1.0)
I actually think the opposite, if we make this change pre-1.0 then we might unknowingly break people's tracing since they might today rely on the ability to set more than once. so silently we will break their instrumentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more-aligned with @ericmustin 's perspective BOTH that the behavior of changing a start time is not-reasonable and arguably undocumented and therefor unsupported (though it did apparently trip the Grape test(s)). But also that pre-1.0 this is an easier change to make. Once you're 1.0, ANY change like this will cause major version upgrades going by SemVer.
Argh, I appreciate the difficulty in ANY behavior changes with a library like this, but... I seriously wonder if it's essentially hiding errors as-is which, honestly, could result in DD support tickets.
EDIT: I guess I'd add.. are there really good cases for needing to do this? I couldn't think of any, but that doesn't mean there aren't. Wish there was a way to find out who (and how many) might be doing this "multi-start" and why. Still scratching my head about that, because it really seems like it would be an error!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the perspective and feedback here.
Broadly speaking, I'd like to be in a place where this behavior is not possible. Marking the start_time
and telling the span to start
should be tightly coupled, imo. Telling the span to start kicks off a number of actions on top of timing, but I think timing is the most important. Intuitively it just makes sense, but there are also some assumptions made by other components of the tracers, like Runtime Metric collection and Log/Trace Correlation, where we need confidence that 'the span we say is active right now' does not later have it's timing details changed.
That being said, given that our own instrumentation had this antipattern (in a very minor edge case, but it was present nonetheless), it's reasonable to assume that users of Manual instrumentation, especially those doing instrumentation of async workloads, have accidentally put themselves in the same situation. We should work from the assumption that changing the behavior will change users instrumentation unexpectedly. There are a number of times similar situations have occcured in the past and unless it is introducing clear perf or reliability issues, we do our best to maintain compatibility, or provide a long runway of deprecation.
After syncing with @marcotc internally, I think we've found a reasonable middle ground. I think we want to leverage the chance to move to 1.0 to push through these sort of changes. There's an open issue on the subject here: #1164 . Internally we're working to compile a list of breaking changes that will both improve the tracer in the long term, and that we should group in the 1.0 release. Making it part of the move to 1.0 will help set more clear messaging and expectation management for users, since upgrading to 1.0 will be an explicit choice, which gives us the highest likelihood for them to read any release note warnings/instructions/breaking changes/etc. And most importantly, doing it after 1.0 will be considerably more difficult.
Does that help give context here? tl;dr i think it's something we want to do but probably will decouple from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel better that this is on the list "for 1.0" rather than "post-1.0". I appreciate the good stewardship!
at this point we need to write a test for the span behavior so that there's not a regression in the future. Other than that this should be good to go. One benefit here is we're also cleaning up a very unusual edge case where a user could set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great to me. I'm happy.
benchmarks for context...we're making a few extra calls in span but don't see any real bump in memory/timing/gc for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this to the finish line, @ericmustin. 🎉
And thank you, @soulcutter, for doing all the groundwork and still coming back years later to help with code review!
👋 @soulcutter This has been released in 0.41.0 https://github.com/DataDog/dd-trace-rb/releases/tag/v0.41.0, thanks again for the contribution! |
Builds on top of #424 and updates with current state of dd-trace-rb, attempts to remove dependancy on concurrent-ruby.
WiP as we'd still like to improve perf here, and ensure compatibility with all supported rubies