add experimental tracing using a native event collector #4250

rochdev · 2024-04-20T00:55:23Z

What does this PR do?

Add experimental tracing using a native event collector that stores and processes traces outside of the application.

Motivation

This was previously just a proof-of-concept. The goal of this PR is to have at last something on the main branch with proper CI tests so that we stay compatible with the use of events.

Additional Notes

github-actions · 2024-04-20T00:56:03Z

Overall package size

Self size: 6.48 MB
Deduped: 59.19 MB
No deduping: 59.47 MB

Dependency sizes

name	version	self size	total size
@datadog/native-iast-taint-tracking	2.1.0	14.91 MB	14.92 MB
@datadog/native-appsec	7.1.1	14.39 MB	14.4 MB
@datadog/pprof	5.2.0	8.84 MB	9.21 MB
protobufjs	7.2.5	2.77 MB	6.56 MB
@datadog/native-iast-rewriter	2.3.0	2.15 MB	2.24 MB
@opentelemetry/core	1.14.0	872.87 kB	1.47 MB
@datadog/native-metrics	2.0.0	898.77 kB	1.3 MB
@opentelemetry/api	1.4.1	780.32 kB	780.32 kB
import-in-the-middle	1.7.4	70.19 kB	739.86 kB
msgpack-lite	0.1.26	201.16 kB	281.59 kB
opentracing	0.14.7	194.81 kB	194.81 kB
semver	7.5.4	93.4 kB	123.8 kB
pprof-format	2.1.0	111.69 kB	111.69 kB
@datadog/sketches-js	2.1.0	109.9 kB	109.9 kB
lodash.sortby	4.7.0	75.76 kB	75.76 kB
lru-cache	7.14.0	74.95 kB	74.95 kB
ipaddr.js	2.1.0	60.23 kB	60.23 kB
ignore	5.2.4	51.22 kB	51.22 kB
int64-buffer	0.1.10	49.18 kB	49.18 kB
shell-quote	1.8.1	44.96 kB	44.96 kB
istanbul-lib-coverage	3.2.0	29.34 kB	29.34 kB
tlhunter-sorted-set	0.1.0	24.94 kB	24.94 kB
limiter	1.1.5	23.17 kB	23.17 kB
dc-polyfill	0.1.4	23.1 kB	23.1 kB
retry	0.13.1	18.85 kB	18.85 kB
node-abort-controller	3.1.1	16.89 kB	16.89 kB
jest-docblock	29.7.0	8.99 kB	12.76 kB
crypto-randomuuid	1.0.0	11.18 kB	11.18 kB
path-to-regexp	0.1.7	6.78 kB	6.78 kB
koalas	1.0.2	6.47 kB	6.47 kB
methods	1.1.2	5.29 kB	5.29 kB
module-details-from-path	1.0.3	4.47 kB	4.47 kB

_{🤖 This report was automatically generated by heaviest-objects-in-the-universe}

pr-commenter · 2024-04-20T01:05:14Z

Benchmarks

Benchmark execution time: 2024-05-10 20:16:51

Comparing candidate commit 0ce5417 in PR branch collector-beta with baseline commit 52da415 in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 259 metrics, 6 unstable metrics.

scenario:plugin-graphql-with-depth-on-max-18

🟩 max_rss_usage [-156.073MB; -89.875MB] or [-16.340%; -9.410%]

codecov · 2024-04-20T01:11:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.19%. Comparing base (7000ae5) to head (12d68ac).
Report is 4 commits behind head on master.

❗ Current head 12d68ac differs from pull request most recent head 0ce5417. Consider uploading reports for the commit 0ce5417 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #4250       +/-   ##
===========================================
- Coverage   69.19%   45.19%   -24.00%     
===========================================
  Files           1       89       +88     
  Lines         198     2518     +2320     
  Branches       33       33               
===========================================
+ Hits          137     1138     +1001     
- Misses         61     1380     +1319

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

packages/dd-trace/src/collector/exporter.js

packages/dd-trace/test/plugins/util/web.spec.js

simon-id

didn't do a super in depth review (especially for encoder.js), few nitpicks and questions but LGTM

simon-id · 2024-05-03T07:58:56Z

comments closed, LGTM 👍

szegedi · 2024-05-06T14:34:40Z

packages/dd-trace/src/plugins/util/web.js

-    .filter(val => val)
-    .join(' ')
-
-  span.setTag(RESOURCE_NAME, resource)


Why is RESOURCE_NAME tag no longer added?

We're kinda-sorta using it here for establishing endpoint names to associate with profiles, although we can also compute the value from HTTP_METHOD and HTTP_ROUTE tags when RESOURCE_NAME isn't present.

It's still set but later in the pipeline. At this point the resource name cannot be trusted anyway because we allow changing it up until the span is finished, which is often used by customers.

Would it be possible to simply always use HTTP_METHOD and HTTP_ROUTE for profiling since the real resource name would already be ignored anyway even with the current code? (although it could be argued these also can be changed by a span hook, but at that point the profile would need to wait for the span to actually be finished before looking at the resource name)

Qard · 2024-05-13T13:07:57Z

packages/dd-trace/src/collector/span.js

+    const ticks = finishTime
+      ? now() + finishTime - trace.ticks - trace.startTime
+      : now() - trace.ticks


Can we do this on the collector side? Otherwise we need to keep the timing state in the library. I feel like it should be safe to just send a Date.now() in the start event and monotonic timestamps in each of the start and end, then compute the delta in the collector.

That's basically what this is doing except that it does that at the trace level instead of the span level. This is for efficiency but also because it was otherwise causing intermittent slight drifts. It's also worth noting that the computation is only happening when a finish time if provided, otherwise it only sends the monotonic timestamp.

In both cases though you're holding the trace.ticks value. I'm suggesting we should try to avoid holding that in the library if at all possible. And I'm not sure why there would be drift coming from the same process on the same machine between monotonic timestamps that both came from that same process. 🤔

This is also an optimization so that for applications under high load that generate traces quickly the events only need to send <=32 bits for the ticks instead of constantly 64 bits.

It also means an easier to reason about floor for the ticks which would always be 0 for the segment. However, since the number is arbitrary anyway, that doesn't actually matter too much.

Basically, it seemed to me that adding a single subtraction operation didn't add enough complexity to justify going with the alternative that saves a subtraction but ends up sending a lot more data. With the current structure of events, the timestamp is basically otherwise the largest piece of data of an event. I even considered going with microseconds since nanoseconds are rarely (read: never) useful for tracing data.

But you're trading an extra potential maximum of 32 bits of data over the wire for more than that held in process memory for the duration of that span. That adds up quickly when there are potentially hundreds of thousands of concurrent spans. Process memory is a lot more scarce than bandwidth.

More concerning to me though is that we need a place to store any data like this which is held in-process for the span duration. I'm trying to get us away from storing the big bag of data that is SpanContext, but things like this further entangle it.

It's actually the opposite. It trades the memory that would otherwise be held in memory for every span because the data is buffered, to a single integer that is held only once for the whole segment. So this actually saves memory in-process, and also has the benefit of reducing bandwidth. You could argue it increases CPU usage a tiny bit by adding a subtraction, but that's minimal enough to justify the trade-off.

I'm not clear in what way this is reducing memory. We're still holding data in the SpanContext and therefore holding the SpanContext, are we not? So don't we still have the same problem with holding this big bag of data that we've always had? 😕

Qard · 2024-05-13T13:14:18Z

packages/dd-trace/src/collector/span.js

+
+let segmentId = 0
+
+class DatadogCollectorSpan extends DatadogSpan {


This sub-classing is rather difficult to follow. Can we not just put all this logic in-place in DatadogSpan and use channel.hasSubscribers to skip it? There's a lot of circular dependency between this and the parent class calling methods on this that makes it challenging to understand the flow. I feel like it'd be a lot easier to understand if it was done in the parent class instead, or even with plain functions called from the parent class. 🤔

The idea was to isolate the logic to a separate class instead of mixing it with the existing logic. I could instead just put all the code in the same place but it seemed like it would be even more confusing. I felt like keeping them separate also makes it more obvious what is shared versus what isn't.

It could be separate but as functions that are just called from the other. I feel like the flow would be a lot clearer that way. As a sub-class you have to do a lot of spelunking into the parent class to understand what is being called when. 😐

The parent calling a protected method on the child is how most of the tracer works since it comes from an OpenTracing background and OT does that extensively, so I don't think it should be surprising to anyone reading the code.

With that said, I can change it if you really want, but I'm not sure of the best approach then. I don't necessarily want to do it in the same file, because one of the ideas here was to not load this code at all when it's not needed. Not doing this is currently a huge problem in the library especially for serverless, and I didn't want to exacerbate the issue. What approach would you suggest with that in mind?

Not going to block on this one, just wanted to point out it's really hard to follow so it would be ideal if we could make it more coherent somehow.

One possibility to enable only loading when you need it is multiple layers of diagnostics channel. You could have the existing class publish events with subscribers that do the transformations these functions are doing and then publish the result to those collector channels. Not sure that would be any clearer, but it would enable optional loading.

We could also consider some sort of dependency injection type thing where we register export strategies on the tracer and the span class could just loop over registered strategies and call each. Then the flow goes in a more singular direction rather than this confusing circular dependency.

These sound like great ideas to explore, but in my opinion are out of the scope of this PR. Internals are already loaded in various ways that can end up being hard to follow, and standardizing on a way to do this would be more beneficial than introducing yet another way to load things for this PR specifically.

I'm more concerned that piling on more complexity will make it even harder to untangle in the future whenever we get to cleaning that up. The present existence of tech debt doesn't justify making more. 😅

packages/dd-trace/src/collector/span.js

juan-fernandez · 2024-05-23T11:22:48Z

packages/dd-trace/src/collector/encoder.js

+
+      this._timer = clearTimeout(this._timer)
+
+      collector.send_events(data)


❓ is collector.send_events sync? Do we need a done callback then? Can't we just return? Or is this done to preserve the flush signature from other exporters?

rochdev added the semver-minor label Apr 20, 2024

rochdev force-pushed the collector-beta branch from f939b4a to 3130882 Compare April 20, 2024 00:57

rochdev force-pushed the collector-beta branch from 5ef0c2f to 0659a6d Compare April 20, 2024 04:16

rochdev marked this pull request as ready for review April 23, 2024 09:48

rochdev requested review from a team as code owners April 23, 2024 09:48

rochdev requested a review from wconti27 April 23, 2024 09:48

simon-id reviewed Apr 30, 2024

View reviewed changes

packages/dd-trace/src/collector/exporter.js Show resolved Hide resolved

simon-id reviewed Apr 30, 2024

View reviewed changes

packages/dd-trace/src/collector/exporter.js Show resolved Hide resolved

simon-id reviewed Apr 30, 2024

View reviewed changes

packages/dd-trace/test/plugins/util/web.spec.js Outdated Show resolved Hide resolved

simon-id reviewed Apr 30, 2024

View reviewed changes

rochdev force-pushed the collector-beta branch from 12d68ac to 1adbcda Compare April 30, 2024 19:28

rochdev requested a review from a team as a code owner April 30, 2024 19:28

szegedi reviewed May 6, 2024

View reviewed changes

rochdev marked this pull request as draft May 7, 2024 22:44

rochdev force-pushed the collector-beta branch 2 times, most recently from 39e4e9f to f44dbab Compare May 8, 2024 01:48

rochdev marked this pull request as ready for review May 8, 2024 01:51

rochdev added 6 commits May 10, 2024 16:07

add 0.5 endpoint to test agent for collector exporter

87f87bc

expose zeroId for equality check in collector encoder

563e06e

move web resource formatting to formatter

44cfa8e

add event collector based tracer implementation

956f77a

update tracer to be extendable and extend it

6e94bf2

add ci workflow

0ce5417

rochdev force-pushed the collector-beta branch from f44dbab to 0ce5417 Compare May 10, 2024 20:07

Qard reviewed May 13, 2024

View reviewed changes

juan-fernandez reviewed May 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add experimental tracing using a native event collector #4250

add experimental tracing using a native event collector #4250

rochdev commented Apr 20, 2024

github-actions bot commented Apr 20, 2024 •

edited

pr-commenter bot commented Apr 20, 2024 •

edited

codecov bot commented Apr 20, 2024 •

edited

simon-id left a comment

simon-id commented May 3, 2024

szegedi May 6, 2024

rochdev May 6, 2024

rochdev May 6, 2024

Qard May 13, 2024

rochdev May 13, 2024

Qard May 14, 2024

rochdev May 14, 2024

Qard May 15, 2024

rochdev May 15, 2024

Qard May 16, 2024

Qard May 13, 2024

rochdev May 13, 2024

Qard May 14, 2024

rochdev May 14, 2024

Qard May 15, 2024

rochdev May 15, 2024

Qard May 16, 2024

juan-fernandez May 23, 2024


		let segmentId = 0

		class DatadogCollectorSpan extends DatadogSpan {


		this._timer = clearTimeout(this._timer)

		collector.send_events(data)

add experimental tracing using a native event collector #4250

Are you sure you want to change the base?

add experimental tracing using a native event collector #4250

Conversation

rochdev commented Apr 20, 2024

What does this PR do?

Motivation

Additional Notes

github-actions bot commented Apr 20, 2024 • edited

Overall package size

Dependency sizes

pr-commenter bot commented Apr 20, 2024 • edited

Benchmarks

scenario:plugin-graphql-with-depth-on-max-18

codecov bot commented Apr 20, 2024 • edited

Codecov Report

simon-id left a comment

Choose a reason for hiding this comment

simon-id commented May 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 20, 2024 •

edited

pr-commenter bot commented Apr 20, 2024 •

edited

codecov bot commented Apr 20, 2024 •

edited