Monotonic clock #1173

ericmustin · 2020-09-14T10:31:18Z

Builds on top of #424 and updates with current state of dd-trace-rb, attempts to remove dependancy on concurrent-ruby.

WiP as we'd still like to improve perf here, and ensure compatibility with all supported rubies

When measuring durations, it's preferable to use the system monotonic clock rather than `Time.now` in order to avoid problems with the wall clock potentially moving forwards or backwards for corrections that would throw off calculation of elapsed time. https://blog.dnsimple.com/2018/03/elapsed-time-with-ruby-the-right-way/

…bies

This extracts the stateful duration attributes from the Span class into a Duration class that is able to calculate duration for itself using the wall clock if times are explicitly set, or a monotonic clock if no times are explicitly set. This was extracted partially because the logic in Span was getting messy, and partly because Rubocop was complaining that the Span class was too long.

lib/ddtrace/span/duration.rb

lib/ddtrace/span.rb

ericmustin · 2020-09-15T09:11:44Z

lib/ddtrace/span.rb

+
+    def duration
+      if @wall_clock_duration
+        (@end_time - @start_time).to_f rescue 0.0


this method can be simplified to

def duration (@duration_end - @duration_start).to_f rescue 0.0 end

since we already do the work behind the scenes to update duration_(start|end) to represent wall clock if when flag is toggled.

For some reason this is causing specific tests to return negative times, need to figure out whether this is an edge case or just an issue specific to tests. either way we should never return a negative time here so this requires further investigation

Ok, I recognize now that there is a very tiny breaking change here, in that we've removed the ability to set start_time multiple times. This was popping up in the case of our grape tests because grape has some very old code which both creates/starts a span (tracer.trace) and then later manually sets start_time to what active support says was the start time. By enforcing start_time as when the span was created from tracer.trace, without explicitly supplying the active support start start_time option to tracer.trace, we end up using a start time of Time.now which, is later than the active support notifications finish time.

I'd argue this is appropriate, I've updated the grape instrumentation to correctly pass in start_time to tracer.trace when creating the span. Perhaps we should have a safeguard that returns a duration of 0 when Span.duration is negative, although i'm relatively sure this gets accounted for at the agent level anyway

This should overall improve users timing accuracy, but I do have some concerns that a user with some complex "incorrect" async manual instrumentation may see their timing change as a result. I'm not sure we should be worrying about this case since in this situation they've been using an undocumented approach anyway, but it's something to be aware of.

@marcotc what do you think here?

How easy is it to mess up that timing again? What if someone in their manual instrumentation sets start_time after the span has started?

What if someone in their manual instrumentation sets start_time after the span has started?

calling span.start_time = or span.start(<start_time>) after the span has started will no-op. start_time can effectively be set only once.

I think this makes it harder to mess up the timing and is a more correct way to handle this case, but to be clear, this is a change in behavior. Previously you could modify start_time as many times as you wanted. If we want to maintain the "set start_time as much as you want behavior" I think that's technically possible, but my two cents is that it makes more sense to enforce setting start_time once while we still can (pre-1.0).

hmm, we might not be able to make that change then.

being able to dynamically change the start/end time might be necessary, and given it is possible, it means a breaking change.

makes more sense to enforce setting start_time once while we still can (pre-1.0)

I actually think the opposite, if we make this change pre-1.0 then we might unknowingly break people's tracing since they might today rely on the ability to set more than once. so silently we will break their instrumentation.

👍 fair enough

I'm more-aligned with @ericmustin 's perspective BOTH that the behavior of changing a start time is not-reasonable and arguably undocumented and therefor unsupported (though it did apparently trip the Grape test(s)). But also that pre-1.0 this is an easier change to make. Once you're 1.0, ANY change like this will cause major version upgrades going by SemVer.

Argh, I appreciate the difficulty in ANY behavior changes with a library like this, but... I seriously wonder if it's essentially hiding errors as-is which, honestly, could result in DD support tickets.

EDIT: I guess I'd add.. are there really good cases for needing to do this? I couldn't think of any, but that doesn't mean there aren't. Wish there was a way to find out who (and how many) might be doing this "multi-start" and why. Still scratching my head about that, because it really seems like it would be an error!

Thanks for the perspective and feedback here.

Broadly speaking, I'd like to be in a place where this behavior is not possible. Marking the start_time and telling the span to start should be tightly coupled, imo. Telling the span to start kicks off a number of actions on top of timing, but I think timing is the most important. Intuitively it just makes sense, but there are also some assumptions made by other components of the tracers, like Runtime Metric collection and Log/Trace Correlation, where we need confidence that 'the span we say is active right now' does not later have it's timing details changed.

That being said, given that our own instrumentation had this antipattern (in a very minor edge case, but it was present nonetheless), it's reasonable to assume that users of Manual instrumentation, especially those doing instrumentation of async workloads, have accidentally put themselves in the same situation. We should work from the assumption that changing the behavior will change users instrumentation unexpectedly. There are a number of times similar situations have occcured in the past and unless it is introducing clear perf or reliability issues, we do our best to maintain compatibility, or provide a long runway of deprecation.

After syncing with @marcotc internally, I think we've found a reasonable middle ground. I think we want to leverage the chance to move to 1.0 to push through these sort of changes. There's an open issue on the subject here: #1164 . Internally we're working to compile a list of breaking changes that will both improve the tracer in the long term, and that we should group in the 1.0 release. Making it part of the move to 1.0 will help set more clear messaging and expectation management for users, since upgrading to 1.0 will be an explicit choice, which gives us the highest likelihood for them to read any release note warnings/instructions/breaking changes/etc. And most importantly, doing it after 1.0 will be considerably more difficult.

Does that help give context here? tl;dr i think it's something we want to do but probably will decouple from this PR.

I feel better that this is on the list "for 1.0" rather than "post-1.0". I appreciate the good stewardship!

lib/ddtrace/span.rb

ericmustin · 2020-09-15T21:49:30Z

at this point we need to write a test for the span behavior so that there's not a regression in the future. Other than that this should be good to go.

One benefit here is we're also cleaning up a very unusual edge case where a user could set span.end_time and prevent their span from ever getting finished/flushed, which could potentially cause memory bloat if done frequently enough. That's no longer possible now, we explicitly finish spans when setting span.end_time

soulcutter

Overall this looks great to me. I'm happy.

ericmustin · 2020-09-16T14:45:42Z

benchmarks for context...we're making a few extra calls in span but don't see any real bump in memory/timing/gc for the nested traces microbench vs master
benchmark_master_09162020.txt
benchmark_soulcutter-monotonic_clock.txt

marcotc

Thanks for taking this to the finish line, @ericmustin. 🎉
And thank you, @soulcutter, for doing all the groundwork and still coming back years later to help with code review!

ericmustin · 2020-09-30T15:35:15Z

👋 @soulcutter This has been released in 0.41.0 https://github.com/DataDog/dd-trace-rb/releases/tag/v0.41.0, thanks again for the contribution!

soulcutter and others added 6 commits May 10, 2018 16:18

Placate Rubocop and add some method comments

259da18

Use concurrent-ruby's monotonic clock for compatibility with older ru…

c44e391

…bies

bring monotonic clock work up to date with master

f9b295d

replace concurrent ruby timing with utils

2a3962a

ericmustin requested a review from a team September 14, 2020 10:31

rubocop fixes for monotonic clock work

c058e72

brettlangdon reviewed Sep 14, 2020

View reviewed changes

lib/ddtrace/span/duration.rb Outdated Show resolved Hide resolved

lib/ddtrace/span/duration.rb Outdated Show resolved Hide resolved

ericmustin commented Sep 14, 2020

View reviewed changes

lib/ddtrace/span/duration.rb Outdated Show resolved Hide resolved

ericmustin added 2 commits September 14, 2020 16:19

ensure duration is calculated using wall clock or monotonic depending

ff61e4e

migrate duration class into span

64b094c

brettlangdon reviewed Sep 14, 2020

View reviewed changes

lib/ddtrace/span.rb Show resolved Hide resolved

ericmustin added 2 commits September 14, 2020 23:27

remove references to duration class

5c252c8

update duration to do correct math

8dba2be

ericmustin commented Sep 15, 2020

View reviewed changes

make sure we start grape spans with accurate timing

fab7f8f

ericmustin changed the title ~~[WIP] monotonic clock~~ Monotonic clock Sep 15, 2020

ericmustin added 2 commits September 15, 2020 16:03

simplify duration tracking

201d7ca

rubocop cleaning duration work

f0ad99c

brettlangdon reviewed Sep 15, 2020

View reviewed changes

lib/ddtrace/span.rb Show resolved Hide resolved

lib/ddtrace/span.rb Outdated Show resolved Hide resolved

lib/ddtrace/span.rb Show resolved Hide resolved

dry up helper methods on span and document code internally

3f78071

ericmustin added 2 commits September 16, 2020 00:35

add tests for monotonic vs wall time

5498343

skip tests on 2.0 since monotonic time is not supported

60877c9

soulcutter approved these changes Sep 16, 2020

View reviewed changes

marcotc approved these changes Sep 16, 2020

View reviewed changes

ericmustin merged commit d08c600 into master Sep 17, 2020

marcotc added this to the 0.41.0 milestone Sep 30, 2020

ivoanjo deleted the soulcutter-monotonic_clock branch July 16, 2021 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monotonic clock #1173

Monotonic clock #1173

ericmustin commented Sep 14, 2020

ericmustin Sep 15, 2020

ericmustin Sep 15, 2020 •

edited

brettlangdon Sep 15, 2020

ericmustin Sep 15, 2020 •

edited

brettlangdon Sep 15, 2020

brettlangdon Sep 15, 2020

ericmustin Sep 15, 2020

soulcutter Sep 16, 2020 •

edited

ericmustin Sep 16, 2020

soulcutter Sep 16, 2020

ericmustin commented Sep 15, 2020

soulcutter left a comment

ericmustin commented Sep 16, 2020

marcotc left a comment

ericmustin commented Sep 30, 2020

Monotonic clock #1173

Monotonic clock #1173

Conversation

ericmustin commented Sep 14, 2020

ericmustin Sep 15, 2020

Choose a reason for hiding this comment

ericmustin Sep 15, 2020 • edited

Choose a reason for hiding this comment

brettlangdon Sep 15, 2020

Choose a reason for hiding this comment

ericmustin Sep 15, 2020 • edited

Choose a reason for hiding this comment

brettlangdon Sep 15, 2020

Choose a reason for hiding this comment

brettlangdon Sep 15, 2020

Choose a reason for hiding this comment

ericmustin Sep 15, 2020

Choose a reason for hiding this comment

soulcutter Sep 16, 2020 • edited

Choose a reason for hiding this comment

ericmustin Sep 16, 2020

Choose a reason for hiding this comment

soulcutter Sep 16, 2020

Choose a reason for hiding this comment

ericmustin commented Sep 15, 2020

soulcutter left a comment

Choose a reason for hiding this comment

ericmustin commented Sep 16, 2020

marcotc left a comment

Choose a reason for hiding this comment

ericmustin commented Sep 30, 2020

ericmustin Sep 15, 2020 •

edited

ericmustin Sep 15, 2020 •

edited

soulcutter Sep 16, 2020 •

edited