tracing: allow turning on the opentelemetry layer on and off by default #13002

guswynn · 2022-06-08T20:18:21Z

Otel traces for cloud instances are already eating up honeycomb limits. To prevent problems, and to

Steps:

Wrap the otel layer in a reload::Layer<Option<[otel layer]>>, defaulting to None if a --dynamic-opentelemetry flag is on
- this flag will be on in cloud
Add a http endpoint that allows people to swap the layer in and out.
- We can later also add a way to dynamically adjust the filter for this layer after the fact.
  - Note that this need to be communicated to the sub-services as well, how should we do that?
- decide how to gate this to only engineers in the future
Write docs on how engineers can use this endpoint on their instances to turn things on and off. Express urgently that tracing should only be on when needed

The text was updated successfully, but these errors were encountered:

benesch · 2022-06-09T05:49:48Z

Do we need to make the reload layer configurable? I see that it says "adds a small amount of overhead", but if we're going to run with it on in cloud I think we should also run with it on locally so that trace performance isn't somehow better when running locally!

Adding this sounds good, though. Every service already has an internal HTTP server that's perfect for this. I think I would shy away from doing anything too fancy to propagate dynamic trace enablement across process boundaries. Instead we can push the complexity into a script in MaterializeInc/cloud that loops through all the pods in a namespace and frobs the trace enablement endpoints on all of them.

pH14 · 2022-06-09T13:39:29Z

Another option would be to push diagnostic knobs like telemetry settings, debug levels, etc into a ConfigMap that's mounted to all materialize services. If MZ occasionally reloads those values from disk, it gives us a way to declaratively state whether certain features are enabled or not.

guswynn · 2022-06-09T19:00:48Z

@benesch yes we can set to None by default when there is no opentelemetry_endpoint!

aalexandrov · 2022-06-09T19:06:36Z

Do we need to make the reload layer configurable?

I can imagine piggy-backing the somewhat heavy traces of the various representations of a query in the optimizer lifecycle conditionally, and the reload layer configurable is probably the way to do it.

I should be able to do that if Reload also contains a configuration similar to TracingCliArgs.

guswynn · 2022-06-10T15:24:42Z

Blocked on tokio-rs/tracing#2159

benesch · 2022-07-11T17:24:35Z

Fixed by #13361.

guswynn added C-refactoring Category: replacing or reorganizing code and removed C-refactoring Category: replacing or reorganizing code labels Jun 8, 2022

guswynn mentioned this issue Jun 9, 2022

Setup the core of distributed tracing with mz #13019

Closed

18 tasks

benesch closed this as completed Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing: allow turning on the opentelemetry layer on and off by default #13002

tracing: allow turning on the opentelemetry layer on and off by default #13002

guswynn commented Jun 8, 2022 •

edited

benesch commented Jun 9, 2022

pH14 commented Jun 9, 2022

guswynn commented Jun 9, 2022

aalexandrov commented Jun 9, 2022 •

edited

guswynn commented Jun 10, 2022

benesch commented Jul 11, 2022

tracing: allow turning on the opentelemetry layer on and off by default #13002

tracing: allow turning on the opentelemetry layer on and off by default #13002

Comments

guswynn commented Jun 8, 2022 • edited

benesch commented Jun 9, 2022

pH14 commented Jun 9, 2022

guswynn commented Jun 9, 2022

aalexandrov commented Jun 9, 2022 • edited

guswynn commented Jun 10, 2022

benesch commented Jul 11, 2022

guswynn commented Jun 8, 2022 •

edited

aalexandrov commented Jun 9, 2022 •

edited