Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: allow turning on the opentelemetry layer on and off by default #13002

Closed
7 tasks
Tracked by #13019
guswynn opened this issue Jun 8, 2022 · 6 comments
Closed
7 tasks
Tracked by #13019

Comments

@guswynn
Copy link
Contributor

guswynn commented Jun 8, 2022

Otel traces for cloud instances are already eating up honeycomb limits. To prevent problems, and to

Steps:

  • Wrap the otel layer in a reload::Layer<Option<[otel layer]>>, defaulting to None if a --dynamic-opentelemetry flag is on
    • this flag will be on in cloud
  • Add a http endpoint that allows people to swap the layer in and out.
    • We can later also add a way to dynamically adjust the filter for this layer after the fact.
      • Note that this need to be communicated to the sub-services as well, how should we do that?
    • decide how to gate this to only engineers in the future
  • Write docs on how engineers can use this endpoint on their instances to turn things on and off. Express urgently that tracing should only be on when needed
@guswynn guswynn added C-refactoring Category: replacing or reorganizing code and removed C-refactoring Category: replacing or reorganizing code labels Jun 8, 2022
@benesch
Copy link
Member

benesch commented Jun 9, 2022

Do we need to make the reload layer configurable? I see that it says "adds a small amount of overhead", but if we're going to run with it on in cloud I think we should also run with it on locally so that trace performance isn't somehow better when running locally!

Adding this sounds good, though. Every service already has an internal HTTP server that's perfect for this. I think I would shy away from doing anything too fancy to propagate dynamic trace enablement across process boundaries. Instead we can push the complexity into a script in MaterializeInc/cloud that loops through all the pods in a namespace and frobs the trace enablement endpoints on all of them.

@pH14
Copy link
Contributor

pH14 commented Jun 9, 2022

Another option would be to push diagnostic knobs like telemetry settings, debug levels, etc into a ConfigMap that's mounted to all materialize services. If MZ occasionally reloads those values from disk, it gives us a way to declaratively state whether certain features are enabled or not.

@guswynn
Copy link
Contributor Author

guswynn commented Jun 9, 2022

@benesch yes we can set to None by default when there is no opentelemetry_endpoint!

@aalexandrov
Copy link
Contributor

aalexandrov commented Jun 9, 2022

Do we need to make the reload layer configurable?

I can imagine piggy-backing the somewhat heavy traces of the various representations of a query in the optimizer lifecycle conditionally, and the reload layer configurable is probably the way to do it.

I should be able to do that if Reload also contains a configuration similar to TracingCliArgs.

@guswynn
Copy link
Contributor Author

guswynn commented Jun 10, 2022

Blocked on tokio-rs/tracing#2159

@benesch
Copy link
Member

benesch commented Jul 11, 2022

Fixed by #13361.

@benesch benesch closed this as completed Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants