New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ethon & Typhoeus tracing support #778
Conversation
22394f8
to
2f94ed6
Compare
Fixed ruby 1.9 support & rebased to get rid of failed mongo tests |
@al-kudryavtsev The pull request is looking quite filled out! Is this ready for review? Also, since I'm not familiar with Typhoeus/Ethon, could you explain the relationship between the two? What are the responsibilities of each library? Would a user ever want to activate Typhoeus tracing without Ethon or vice versa? I saw Typhoeus added as a dependency, and Ethon getting some instrumentation, but I didn't see any instrumentation explicitly defined for Typhoeus, and it wasn't clear what this means for Typhoeus support. |
@delner Yes, the PR is ready for review, thanks! Ethon is a Initially I wanted to support Typhoeus because that's what we use for HTTP requests. The problem is that with extra features it is actually harder to implement proper & reliable instrumentation, especially with multiple parallel asynchronous requests. At the same time, Ethon's Easy object is a good candidate for tracking asynchronous requests because it is used when request starts and when response is ready. What I did instead is instrumenting Ethon. This way HTTP requests coming through Typhoeus are fully supported and other use cases of Ethon are also supported (or support can be added relatively easy). I added Typhoeus as a dependency in order to unit test the instrumentation with Typhoeus primitives. So, to answer your question, if the user wants to instrument HTTP requests made with Typhoeus, they will need to enable Ethon tracing. If they need to trace some other features of Typhoeus (like caching of responses), they will need a different instrumentation. Also I'm not an author of any of these libraries, so it is possible I'm missing something and there is an easier way. There are hooks in both libraries, however it doesn't seem to be a reliable approach to tracing. Please let me know what you think. |
@al-kudryavtsev This is a super helpful description, thank you! Sounds like it'll be good then to have this Ethon instrumentation either way, only major question to me outstanding is: "does this constitute 'Typhoeus' support?" I don't think I have this answer; I asked the folks on #527 what they think. If it does, then we somehow have to make it obvious/intuitive to users who want Typhoeus traces to activate the Ethon instrumentation. This might require us to add something like If it doesn't, I will happily accept this PR as Ethon instrumentation on its own merits, but we won't be able to close #527, and as a community we'll need to find a better way to give the proper feature coverage to Typhoeus at a later date. In any case, I'll give this a review a bit later! |
@delner Did you have a chance to look at the PR by any chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this is looking very good; the tests are thorough, and pretty much all the boxes on features appear to be checked. :)
My feedback primarily is about developing my understanding of how the instrumentation works in the Ethon framework, and if there's any way to simplify it (e.g. avoiding use of state/instance variables, etc), and a little bit about reorganizing the tests into smaller groupings.
I think once I can get my understanding to "click" at bit more and we make a few tweaks, we should be able to get this approved.
module InstanceMethods | ||
include InstanceMethodsCompatibility unless Gem::Version.new(RUBY_VERSION) >= Gem::Version.new('2.0.0') | ||
|
||
def http_request(url, action_name, options = {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you walk me through the flow of a request here a bit? Order the of the functions being called, sync vs async requests, etc.
I see a number of methods being patched here, each collecting and storing some information as instance variables. I'd like to understand how the state is built. Is this information not available on perform
without first collecting in between all these methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure! I'll start with sync requests.
To make a sync request, the client uses Easy
object. First it is created, and as a part of initialization it calls the set_attributes
method which will in turn call headers=
, but only if headers are provided. The client can also set the headers directly using headers=
method. The problem is that as soon as headers are set they are converted to FFI pointer, that's why I store their original version. And we don't have a span yet to inject the tracing headers.
http_request
method is used as a helper to populate the easy
with necessary information for HTTP request. In particular, HTTP method is passed (for example, easy.http_request("www.example.com", :post, { params: { a: 1 }, body: { b: 2 } })
). The problem is that this data is not preserved on easy
in a way that can be used to recover the HTTP method easily. The factories set various low-level attributes (https://github.com/typhoeus/ethon/tree/master/lib/ethon/easy/http), and libcurl
itself figures out HTTP method based on all these attributes. So I figured it is easier to just store the method instead of trying to recover it from libcurl
attributes.
The sync way to execute an easy
is calling perform
. I patch the perform
to create the span and to inject the headers if needed (that's why I stored the original headers on the object, to avoid reading them back from FFI). perform
uses libcurl
's easy_perform
method and then calls the complete
method on easy
that I patch to finish the span.
Async requests are executed using Multi
. Individual requests are represented by the same Easy
objects, however the execution flow is different. Easy
instance is added to the Multi
using add
method. I consider this to be the beginning of the request execution, that's why the span is created there. There is no easy way to know if it is the beginning of the request or if multi
is still on hold because multi
execution can be in progress in the other thread. I follow here the way Typhoeus
uses Multi
: it adds the easy objects to Multi
right before executing the perform
. It also has hooks for before-request event which are executed in the add
call.
The execution of multi
happens when the perform
method is called. This method uses libcurl
's multi_perform
and calls complete
on easy
objects which finished executing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this is pretty interesting, and exactly the kind of explanation I was looking for. (Thank you!)
I can see why you'd want to store this info before it gets turned into a format that's hard to read from; as long these objects are 1-1 with requests, then this seems okay. Just have to be careful in any scenario in which an Easy
object is re-used or re-ran? Not sure if that happens, but some food for thought.
If Multi
uses multiple Easy
objects, would it make sense to add an additional parent span to the multi operation? Thinking about the case if users want to see the batch as an operation, in addition to its constituent parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why you'd want to store this info before it gets turned into a format that's hard to read from; as long these objects are 1-1 with requests, then this seems okay. Just have to be careful in any scenario in which an Easy object is re-used or re-ran? Not sure if that happens, but some food for thought.
That's a good point. I added extra safety net by patching the reset
method which is used before re-using easy instance. This way I can be sure that HTTP method or headers are not mistakenly passed in the next request.
def set_attributes(options) | ||
return super unless tracer_enabled? | ||
|
||
# Make sure headers= will get called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for forcing headers=
to be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to force the @datadog_original_headers
to be set as soon as possible, so that later on I can assume it is set. Now it seems to me that it's not essential & we can reduce the number of patched methods. I can store them in headers=
and then check if variable is set or not when injecting the headers.
handles = super(easy) | ||
return handles if handles.nil? || !tracer_enabled? | ||
|
||
easy.datadog_before_request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familiar with the construction of the framework, but is Multi
composed of several Easy
instances? And when it executes, will it call Easy#perform
? If so, would that call datadog_before_request
making this redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a more detailed description that I put in a different comment; Multi
uses a different approach to execute Easy
objects that it holds, it won't call Easy#perform
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my response above, is there an equivalent #perform
method for Multi
? Might want to consider adding a parent span around the lifetime of this multi operation which bundles up the Easy
requests. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the parent span support. The logic is the following: it is created when the first easy
is added to multi
and it is finalized when multi#perform
has finished. Multi
can be reused - every time easy
is added it creates a parent span if it doesn't exist yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I left some comments & will work on the tweaks
module InstanceMethods | ||
include InstanceMethodsCompatibility unless Gem::Version.new(RUBY_VERSION) >= Gem::Version.new('2.0.0') | ||
|
||
def http_request(url, action_name, options = {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure! I'll start with sync requests.
To make a sync request, the client uses Easy
object. First it is created, and as a part of initialization it calls the set_attributes
method which will in turn call headers=
, but only if headers are provided. The client can also set the headers directly using headers=
method. The problem is that as soon as headers are set they are converted to FFI pointer, that's why I store their original version. And we don't have a span yet to inject the tracing headers.
http_request
method is used as a helper to populate the easy
with necessary information for HTTP request. In particular, HTTP method is passed (for example, easy.http_request("www.example.com", :post, { params: { a: 1 }, body: { b: 2 } })
). The problem is that this data is not preserved on easy
in a way that can be used to recover the HTTP method easily. The factories set various low-level attributes (https://github.com/typhoeus/ethon/tree/master/lib/ethon/easy/http), and libcurl
itself figures out HTTP method based on all these attributes. So I figured it is easier to just store the method instead of trying to recover it from libcurl
attributes.
The sync way to execute an easy
is calling perform
. I patch the perform
to create the span and to inject the headers if needed (that's why I stored the original headers on the object, to avoid reading them back from FFI). perform
uses libcurl
's easy_perform
method and then calls the complete
method on easy
that I patch to finish the span.
Async requests are executed using Multi
. Individual requests are represented by the same Easy
objects, however the execution flow is different. Easy
instance is added to the Multi
using add
method. I consider this to be the beginning of the request execution, that's why the span is created there. There is no easy way to know if it is the beginning of the request or if multi
is still on hold because multi
execution can be in progress in the other thread. I follow here the way Typhoeus
uses Multi
: it adds the easy objects to Multi
right before executing the perform
. It also has hooks for before-request event which are executed in the add
call.
The execution of multi
happens when the perform
method is called. This method uses libcurl
's multi_perform
and calls complete
on easy
objects which finished executing.
def set_attributes(options) | ||
return super unless tracer_enabled? | ||
|
||
# Make sure headers= will get called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to force the @datadog_original_headers
to be set as soon as possible, so that later on I can assume it is set. Now it seems to me that it's not essential & we can reduce the number of patched methods. I can store them in headers=
and then check if variable is set or not when injecting the headers.
handles = super(easy) | ||
return handles if handles.nil? || !tracer_enabled? | ||
|
||
easy.datadog_before_request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a more detailed description that I put in a different comment; Multi
uses a different approach to execute Easy
objects that it holds, it won't call Easy#perform
.
@delner I updated the PR - dropped ruby 1.9 support, removed |
@al-kudryavtsev Cool, I'll review this again soon. Can you also rebase this against |
def7d78
to
ae9d7f6
Compare
@delner 👍 Rebased |
@al-kudryavtsev This is looking great; very happy with the thoroughness here, and the tests! Just left one more suggestion regarding |
@delner Thanks! I made some changes to improve the correctness and added parent span support. |
@@ -19,19 +19,48 @@ def add(easy) | |||
handles = super(easy) | |||
return handles if handles.nil? || !tracer_enabled? | |||
|
|||
easy.datadog_before_request | |||
easy.datadog_before_request(parent_span: datadog_multi_span) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible there's a delay between #add
and #perform
? When the span is created, it's auto created with the start time. I'm thinking about the possibility where someone creates a Multi
request with Easy
children, but then delays a while before actually executing it; could that cause the timing to be recorded incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, more precise solution would probably require storing some state on Multi
object. The complexity comes from the fact that extra Easy
object can be added to Multi
during the execution of perform
(in easy complete callback, this is how Typhoeus implementation works). Current solution will work for Typhoeus since it always adding easy
objects to multi
right before executing it.
The same issue as you described can happen with Easy
span because it is started when easy object is added to multi
(which is not necessarily executing yet).
We can probably solve the issue by storing some flag on multi
saying whether it is performing now or not.
Please let me know what you think. I'll be away from computer for one more week and will be able to work on this feature afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth it, as long as there isn't too much coupling between Easy
and Multi
instrumentation.
Is there any way to create the span at perform
time for Easy
? Even if that means storing other state on Easy
initialization beforehand? Given these objects sound idempotent anyway, some state would be permissible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out extra state is not needed. Easy
handles are stored on Multi
instances so I can go through them and start child spans. When adding easy
to multi
, I can check if there is a span on multi
which is an indication of multi
's perform
execution in progress.
@al-kudryavtsev I noticed the builds broke on Mongo: you'll probably want to rebase this PR onto the top of |
1d6c7d2
to
d5ff7f6
Compare
Wow, that screenshot looks awesome! I'll take another pass over the code, see if we can get this merged. |
@delner Did you have a chance to look at the code? |
@al-kudryavtsev Apologies, was on PTO. I'm going to try to wrap this one up for our next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, couple of minor changes I think, but nothing serious, should be quick. Once those are addressed we'll merge it.
|
||
# Store headers to call this method again when span is ready | ||
@datadog_original_headers = headers | ||
super headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but can you just call super
here?
def datadog_tag_request | ||
span = @datadog_span | ||
uri = URI.parse(url) | ||
method = defined?(@datadog_method) ? @datadog_method.to_s : '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be instance_variable_defined?
?
span = @datadog_span | ||
uri = URI.parse(url) | ||
method = defined?(@datadog_method) ? @datadog_method.to_s : '' | ||
span.resource = "#{method} #{uri.path}".lstrip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only want to set the HTTP method here, e.g. GET
, but omit the path.
Sounds crazy, and I see the value of including the path. However, the resource is meant to be a GROUP BY
like key, which in practice requires values that meaningfully group traces on some user-defined dimension. HTTP paths often contain unique input (such as numeric IDs, API tokens), which tend to flatten these traces into groups too small to be meaningful. Hence, as a default behavior, we've been only setting the method as the resource for HTTP client integrations.
If the omission of path
as a default behavior is an issue for users, the concession we could offer is the option to customize their span and set their own resource using a callback. See HTTP as an example: https://github.com/DataDog/dd-trace-rb/blob/master/lib/ddtrace/contrib/http/instrumentation.rb#L80. Implementing something like this is optional though, not necessary to merge this PR.
Also, it looks like the method
can be nil
; if you remove the path, it looks like the resource could then be an empty string, which will cause the trace to be malformed and dropped. We'll want to make sure resource is never blank, even if we have to give it some kind of default value instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I was wondering why the resource contains only method name in other instrumentations. I changed the default value of method
to N/A
& handled the case when it is nil
.
let(:configuration_options) { { tracer: tracer } } | ||
|
||
before do | ||
Datadog::Contrib::Ethon::Patcher.patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be necessary: c.use
will call patch
automatically for the integration, and I think we want to verify that users don't need to manually patch.
let(:configuration_options) { { tracer: tracer } } | ||
|
||
before do | ||
Datadog::Contrib::Ethon::Patcher.patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
@delner Thanks for the review! I addressed the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good! We'll merge this for our next version. Thanks a ton for this contribution @al-kudryavtsev! 🎉
Goal
Support distributed tracing when making requests through
Ethon
library (includingTyphoeus
requests which useEthon
)The related issue is #527
Implementation details
Links to the libraries:
https://github.com/typhoeus/ethon
https://github.com/typhoeus/typhoeus
The implementation was initially based on
rest_client
integration, but I had to change the approach becauseEthon
is a wrapper aroundlibcurl
which can execute many requests concurrently. I had to patch quite a few methods onEasy
to support more use cases. There are different examples of the requests in the spec (made through Ethon'sEasy
&Multi
and through Typhoeus'sRequest
&Hydra
).This code has been in production for a few weeks, didn't see any issues. However we don't use concurrent requests (done through
Multi
orHydra
) so relying on unit tests for other use cases.Ethon
also supports non-http requests, this PR does not address these use cases.Here is a screenshot of distributed tracing achieved through new integration