Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monotonic clock #1173

Merged
merged 17 commits into from Sep 17, 2020
Merged

Monotonic clock #1173

merged 17 commits into from Sep 17, 2020

Conversation

ericmustin
Copy link
Contributor

Builds on top of #424 and updates with current state of dd-trace-rb, attempts to remove dependancy on concurrent-ruby.

WiP as we'd still like to improve perf here, and ensure compatibility with all supported rubies

soulcutter and others added 6 commits May 10, 2018 16:18
When measuring durations, it's preferable to use the system monotonic
clock rather than `Time.now` in order to avoid problems with the wall
clock potentially moving forwards or backwards for corrections that
would throw off calculation of elapsed time.

https://blog.dnsimple.com/2018/03/elapsed-time-with-ruby-the-right-way/
This extracts the stateful duration attributes from the Span class into
a Duration class that is able to calculate duration for itself using the
wall clock if times are explicitly set, or a monotonic clock if no times
are explicitly set.

This was extracted partially because the logic in Span was getting
messy, and partly because Rubocop was complaining that the Span class
was too long.
@ericmustin ericmustin requested a review from a team September 14, 2020 10:31
lib/ddtrace/span/duration.rb Outdated Show resolved Hide resolved
lib/ddtrace/span/duration.rb Outdated Show resolved Hide resolved

def duration
if @wall_clock_duration
(@end_time - @start_time).to_f rescue 0.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method can be simplified to

    def duration
      (@duration_end - @duration_start).to_f rescue 0.0
    end

since we already do the work behind the scenes to update duration_(start|end) to represent wall clock if when flag is toggled.

For some reason this is causing specific tests to return negative times, need to figure out whether this is an edge case or just an issue specific to tests. either way we should never return a negative time here so this requires further investigation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I recognize now that there is a very tiny breaking change here, in that we've removed the ability to set start_time multiple times. This was popping up in the case of our grape tests because grape has some very old code which both creates/starts a span (tracer.trace) and then later manually sets start_time to what active support says was the start time. By enforcing start_time as when the span was created from tracer.trace, without explicitly supplying the active support start start_time option to tracer.trace, we end up using a start time of Time.now which, is later than the active support notifications finish time.

I'd argue this is appropriate, I've updated the grape instrumentation to correctly pass in start_time to tracer.trace when creating the span. Perhaps we should have a safeguard that returns a duration of 0 when Span.duration is negative, although i'm relatively sure this gets accounted for at the agent level anyway

This should overall improve users timing accuracy, but I do have some concerns that a user with some complex "incorrect" async manual instrumentation may see their timing change as a result. I'm not sure we should be worrying about this case since in this situation they've been using an undocumented approach anyway, but it's something to be aware of.

@marcotc what do you think here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy is it to mess up that timing again? What if someone in their manual instrumentation sets start_time after the span has started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone in their manual instrumentation sets start_time after the span has started?

calling span.start_time = or span.start(<start_time>) after the span has started will no-op. start_time can effectively be set only once.

I think this makes it harder to mess up the timing and is a more correct way to handle this case, but to be clear, this is a change in behavior. Previously you could modify start_time as many times as you wanted. If we want to maintain the "set start_time as much as you want behavior" I think that's technically possible, but my two cents is that it makes more sense to enforce setting start_time once while we still can (pre-1.0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, we might not be able to make that change then.

being able to dynamically change the start/end time might be necessary, and given it is possible, it means a breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes more sense to enforce setting start_time once while we still can (pre-1.0)

I actually think the opposite, if we make this change pre-1.0 then we might unknowingly break people's tracing since they might today rely on the ability to set more than once. so silently we will break their instrumentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 fair enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more-aligned with @ericmustin 's perspective BOTH that the behavior of changing a start time is not-reasonable and arguably undocumented and therefor unsupported (though it did apparently trip the Grape test(s)). But also that pre-1.0 this is an easier change to make. Once you're 1.0, ANY change like this will cause major version upgrades going by SemVer.

Argh, I appreciate the difficulty in ANY behavior changes with a library like this, but... I seriously wonder if it's essentially hiding errors as-is which, honestly, could result in DD support tickets.

EDIT: I guess I'd add.. are there really good cases for needing to do this? I couldn't think of any, but that doesn't mean there aren't. Wish there was a way to find out who (and how many) might be doing this "multi-start" and why. Still scratching my head about that, because it really seems like it would be an error!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the perspective and feedback here.

Broadly speaking, I'd like to be in a place where this behavior is not possible. Marking the start_time and telling the span to start should be tightly coupled, imo. Telling the span to start kicks off a number of actions on top of timing, but I think timing is the most important. Intuitively it just makes sense, but there are also some assumptions made by other components of the tracers, like Runtime Metric collection and Log/Trace Correlation, where we need confidence that 'the span we say is active right now' does not later have it's timing details changed.

That being said, given that our own instrumentation had this antipattern (in a very minor edge case, but it was present nonetheless), it's reasonable to assume that users of Manual instrumentation, especially those doing instrumentation of async workloads, have accidentally put themselves in the same situation. We should work from the assumption that changing the behavior will change users instrumentation unexpectedly. There are a number of times similar situations have occcured in the past and unless it is introducing clear perf or reliability issues, we do our best to maintain compatibility, or provide a long runway of deprecation.

After syncing with @marcotc internally, I think we've found a reasonable middle ground. I think we want to leverage the chance to move to 1.0 to push through these sort of changes. There's an open issue on the subject here: #1164 . Internally we're working to compile a list of breaking changes that will both improve the tracer in the long term, and that we should group in the 1.0 release. Making it part of the move to 1.0 will help set more clear messaging and expectation management for users, since upgrading to 1.0 will be an explicit choice, which gives us the highest likelihood for them to read any release note warnings/instructions/breaking changes/etc. And most importantly, doing it after 1.0 will be considerably more difficult.

Does that help give context here? tl;dr i think it's something we want to do but probably will decouple from this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel better that this is on the list "for 1.0" rather than "post-1.0". I appreciate the good stewardship!

@ericmustin ericmustin changed the title [WIP] monotonic clock Monotonic clock Sep 15, 2020
lib/ddtrace/span.rb Show resolved Hide resolved
lib/ddtrace/span.rb Outdated Show resolved Hide resolved
lib/ddtrace/span.rb Show resolved Hide resolved
@ericmustin
Copy link
Contributor Author

at this point we need to write a test for the span behavior so that there's not a regression in the future. Other than that this should be good to go.

One benefit here is we're also cleaning up a very unusual edge case where a user could set span.end_time and prevent their span from ever getting finished/flushed, which could potentially cause memory bloat if done frequently enough. That's no longer possible now, we explicitly finish spans when setting span.end_time

Copy link
Contributor

@soulcutter soulcutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great to me. I'm happy.

@ericmustin
Copy link
Contributor Author

benchmarks for context...we're making a few extra calls in span but don't see any real bump in memory/timing/gc for the nested traces microbench vs master
benchmark_master_09162020.txt
benchmark_soulcutter-monotonic_clock.txt

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this to the finish line, @ericmustin. 🎉
And thank you, @soulcutter, for doing all the groundwork and still coming back years later to help with code review!

@ericmustin ericmustin merged commit d08c600 into master Sep 17, 2020
@marcotc marcotc added this to the 0.41.0 milestone Sep 30, 2020
@ericmustin
Copy link
Contributor Author

👋 @soulcutter This has been released in 0.41.0 https://github.com/DataDog/dd-trace-rb/releases/tag/v0.41.0, thanks again for the contribution!

@ivoanjo ivoanjo deleted the soulcutter-monotonic_clock branch July 16, 2021 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants