Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Distributed tracing using OpenTelemtry #4029

Merged
merged 2 commits into from Dec 13, 2021

Conversation

rvalkenaers
Copy link
Contributor

@rvalkenaers rvalkenaers commented Nov 17, 2021

This commit implements tracing using the net/http automatic instrumentation wrappers on the server and topdown/http packages.

Following configuration flags are added:
--distributed-tracing enable distributed tracing using OpenTelemetry Tracing
--distributed-tracing-address string address of the OpenTelemetry Collector gRPC endpoint (default "localhost:4317")
--distributed-tracing-sample-rate int precentage of traces that are sampled and exported (default 100)
--distributed-tracing-service-name string logical name of the service used by OpenTelemetry Tracing (default "opa")
--distributed-tracing-tls {off,tls,mtls} set distributed tracing tls scheme (default off)
--distributed-tracing-tls-cert-file string set path of TLS certificate file of the tracing exporter
--distributed-tracing-tls-private-key-file string set path of TLS private key file of the tracing exporter
--distributed-tracing-tls-skip-verify disable certificate chain verification
Fixes #1469
Signed-off-by: Rien Valkenaers rien.valkenaers@gmail.com

This is an example trace for a "Pull Data during Evaluation" scenario:
opa_trace

Screenshot 2021-11-17 at 17 03 56

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this. It'll be great to have tracing support soon. 🎉

Two things that caught my eye on a quick glance, please see comments inline 👇 🙃

distributedtracing/distributedtracing.go Outdated Show resolved Hide resolved
distributedtracing/distributedtracing.go Outdated Show resolved Hide resolved
@rvalkenaers rvalkenaers force-pushed the opentelemetry branch 2 times, most recently from f80fdcc to cab362b Compare November 19, 2021 11:08
@anderseknert
Copy link
Member

Love this! 👍

One thing I'm not too sure about though is all the new arguments added to opa run. Not in the sense that they aren't necessary, but I wonder if it'd be a good idea to instead have them added to the OPA configuration? In addition to adding clutter to the opa run --help output, command line arguments have some disadvantages over config, including:

  1. These attributes aren't available for review at the /v1/config endpoint.
  2. They can't be set/modified by discovery bundles.
  3. They can't be set programmatically / by the SDK (not sure if applicable here though).

We've already seen this being problematic elsewhere, so perhaps this would be a good candidate for config rather than CLI arguments? Note that it's still possible to set config attributes through the CLI with --set config=value if one prefers that.

Thoughts, @srenatus @tsandall ?

@srenatus
Copy link
Contributor

@anderseknert I'm on board with making this part of the config instead 👍 Thanks for compiling the pros and cons.

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments about what "keeping the embeddings unaffected" means for this PR. What do you think?

distributedtracing/distributedtracing.go Outdated Show resolved Hide resolved
distributedtracing/distributedtracing.go Outdated Show resolved Hide resolved
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The setup without the package-level global looks good to me.

I believe there are a few points we should still get figured out:

  1. Can we have some sort of e2e test? There're a bunch of them in test/e2e, and I would imagine that some sort of mock (or in-memory) trace collector could be used to assert that the right things get sent for requests.
  2. The config via json/yaml instead of CLI args would be nice. I'll try to find an example of how we do it in other places...

I've seen the additions to the v1 data API, would you consider that a first step, and that the other APIs would eventually follow, or is this something you'd perhaps want to include in this PR? I'd thing that the v0 data API could use this, too, and perhaps also the policy API. What do you think? If it's getting unwieldy, I suppose we could tackle that in follow-up PRs, too.

cmd/run.go Outdated
runCommand.Flags().BoolVarP(&cmdParams.rt.DistributedTracingEncryptionSkipVerify, "distributed-tracing-tls-skip-verify", "", false, "disable certificate chain verification")
runCommand.Flags().StringVarP(&cmdParams.distributedTracingTLSCertFile, "distributed-tracing-tls-cert-file", "", "", "set path of TLS certificate file of the tracing exporter")
runCommand.Flags().StringVarP(&cmdParams.distributedTracingTLSPrivateKeyFile, "distributed-tracing-tls-private-key-file", "", "", "set path of TLS private key file of the tracing exporter")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @anderseknert mentioned, it would be nice to put these into the yaml/json config instead. I'm trying to find a few pointers on what to do where 👀

@srenatus
Copy link
Contributor

@rvalkenaers thanks for bearing with me! This is a tremendous feature, but also quite a bit of work, I'm afraid 😅

@rvalkenaers
Copy link
Contributor Author

rvalkenaers commented Nov 30, 2021

@srenatus

I've seen the additions to the v1 data API, would you consider that a first step, and that the other APIs would eventually follow, or is this something you'd perhaps want to include in this PR? I'd thing that the v0 data API could use this, too, and perhaps also the policy API. What do you think? If it's getting unwieldy, I suppose we could tackle that in follow-up PRs, too.

I did a quick test and we're actually creating spans for all API endpoints.

@srenatus
Copy link
Contributor

I did a quick test and we're actually creating spans for all API endpoints.

Also the v0 data API with http.send and all that? I thought we hadn't wired that config into the call to rego.New.

@rvalkenaers
Copy link
Contributor Author

Also the v0 data API with http.send and all that? I thought we hadn't wired that config into the call to rego.New.

For the v0 data API we only have the server spans, not the http.send spans. I would prefer to implement those in follow-up PRs.

@srenatus
Copy link
Contributor

srenatus commented Dec 3, 2021

@rvalkenaers I'm not actually completely sure how the config change would look like -- @anderseknert, would making the config a piece of JSON mean that this would be a plugin? Like status, logs, bundles...

@anderseknert
Copy link
Member

I hadn't thought about that. I guess it makes sense from an architectural POV, but you're probably better equipped to judge that :) On the other hand, if this attaches to both the server component and things like http.send, maybe treating it like an integrated component makes things simpler..

@srenatus
Copy link
Contributor

srenatus commented Dec 4, 2021

On the other hand, if this attaches to both the server component and things like http.send, maybe treating it like an integrated component makes things simpler..

Yup I think that's the case. I think we'll need something along these lines, then, right? 👉 https://github.com/open-policy-agent/opa/blob/v0.35.0/topdown/cache/cache.go#L22-L60 Wired up in the same way, it would be configurable either via CLI --set flags, or via a discovery bundle.

Let's also ensure that this gets documented so that people are aware of this cool new feature. But we should probably first settle on how to configure it.

@rvalkenaers rvalkenaers force-pushed the opentelemetry branch 2 times, most recently from d2d3d54 to 0fd87de Compare December 7, 2021 16:21
@srenatus
Copy link
Contributor

srenatus commented Dec 9, 2021

Sorry for the wait, I haven't forgotten about this 😅 I'll play with this some locally, but it looks good!

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've stumbled upon something we'll need to fix here: logging.

When I run this locally, with DT enabled, and the endpoint isn't available, I see a bunch of these:

{"client_addr":"127.0.0.1:50944","level":"info","msg":"Received request.","req_id":7,"req_method":"GET","req_path":"/health","time":"2021-12-09T13:34:23+01:00"}
{"client_addr":"127.0.0.1:50944","level":"info","msg":"Sent response.","req_id":7,"req_method":"GET","req_path":"/health","resp_bytes":2,"resp_duration":0.291205,"resp_status":200,"time":"2021-12-09T13:34:23+01:00"}
2021/12/09 13:34:27 context deadline exceeded
2021/12/09 13:34:27 traces exporter is disconnected from the server 127.0.0.1:4317: context deadline exceeded
2021/12/09 13:34:42 context deadline exceeded
{"client_addr":"127.0.0.1:50981","level":"info","msg":"Received request.","req_id":8,"req_method":"GET","req_path":"/health","t

It's a mix of OPA's logs, using its logging infrastructure, and some other logs, coming from the DT client. We somehow need to get the latter logs to use OPA's logging, too.

I haven't dug in deep yet to find a fix, but maybe you know what to do right away? 😃

test/e2e/distributedtracing/distributedtracing_test.go Outdated Show resolved Hide resolved
docs/content/configuration.md Outdated Show resolved Hide resolved
@srenatus
Copy link
Contributor

srenatus commented Dec 9, 2021

👉 https://pkg.go.dev/go.opentelemetry.io/otel#SetErrorHandler could using this be the way to go?

The default ErrorHandler instance returned will log all errors to STDERR until an override ErrorHandler is set with SetErrorHandler. All ErrorHandler returned prior to this will automatically forward errors to the set instance instead of logging.

@rvalkenaers
Copy link
Contributor Author

rvalkenaers commented Dec 9, 2021

👉 https://pkg.go.dev/go.opentelemetry.io/otel#SetErrorHandler could using this be the way to go?

The default ErrorHandler instance returned will log all errors to STDERR until an override ErrorHandler is set with SetErrorHandler. All ErrorHandler returned prior to this will automatically forward errors to the set instance instead of logging.

I had this in place initially
https://github.com/open-policy-agent/opa/blob/e93a88d7c4c74abde5d8546f79b31b171afa9ad4/distributedtracing/distributedtracing.go#L45

I removed it when I got rid of the global vars but I didn't found another way to do it.

@srenatus
Copy link
Contributor

srenatus commented Dec 9, 2021

Ahh, alright. Are there non-global ways to provide an error handler?

runtime/runtime.go Outdated Show resolved Hide resolved
@srenatus
Copy link
Contributor

srenatus commented Dec 9, 2021

Can we put the call to set up the global handler in a place where it won't affect people embedding OPA? I'm thinking about cmd/run.go... That way, if you use the SDK, start a runtime, etc, it'll all not be affected; but if you use opa run -s, you'll get the proper logging setup. What do you think?

@rvalkenaers
Copy link
Contributor Author

Can we put the call to set up the global handler in a place where it won't affect people embedding OPA? I'm thinking about cmd/run.go... That way, if you use the SDK, start a runtime, etc, it'll all not be affected; but if you use opa run -s, you'll get the proper logging setup. What do you think?

Sounds good to me.

@srenatus
Copy link
Contributor

Thank you, logging looks alright now:

{"client_addr":"127.0.0.1:53744","level":"info","msg":"Received request.","req_id":12,"req_method":"GET","req_path":"/health","time":"2021-12-10T11:17:56+01:00"}
{"client_addr":"127.0.0.1:53744","level":"info","msg":"Sent response.","req_id":12,"req_method":"GET","req_path":"/health","resp_bytes":2,"resp_duration":0.280679,"resp_status":200,"time":"2021-12-10T11:17:56+01:00"}
{"level":"error","msg":"Distributed tracing: context deadline exceeded","time":"2021-12-10T11:18:04+01:00"}
{"level":"error","msg":"Distributed tracing: max retry time elapsed: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:4317: connect: connection refused\"","time":"2021-12-10T11:18:09+01:00"}

Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small-impact nitpicks, I think we're good to go then. 🚀 Thanks for bearing with me!

internal/distributedtracing/distributedtracing.go Outdated Show resolved Hide resolved
internal/distributedtracing/distributedtracing.go Outdated Show resolved Hide resolved
Copy link
Member

@tsandall tsandall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvalkenaers first of all, thank you SO much for putting in the work to enable tracing in OPA, it's much appreciated 🙏

I agree with @srenatus that this is very close to being mergeable--there's just one thing I'd like us to discuss before merging and that has to do w/ package dependencies:

  1. The rego and topdown packages have been extended to include the internal/distributedtracing#Options type in exported functions. The problem with that is that callers outside the top-level OPA package will not be able to provide values of this type, because the type is behind an internal package in OPA. I think we'll need to create a new top-level package (e.g., github.com/open-policy-agent/opa/tracing) where we export the necessary types... I think Options is the only type but I could be wrong.

  2. These changes put a direct dependency on OpenTelemetry into OPA, and not just OPA but also the low-level packages (rego and topdown) that users rely on solely for policy evaluation. Today those packages are constructed intentionally to avoid large 3rdparty dependencies. I'd like to know how feasible it would be to define a small shim interface in OPA--this would go into the exported package I mentioned in (1). OPA would need to implement some boilerplate but it would remove the hard dependency on opentelemetry from the rego and topdown packages.

@srenatus PTAL and let me know if this makes sense. EDIT: If you'd rather sort this out post-merge, that also works.

@srenatus
Copy link
Contributor

@tsandall I agree, the dependency thing is a good point. As far as I am concerned, I'd leave this up to @rvalkenaers: I could take on dealing with those two points between merge and 0.36.0 (i.e. next week); or you (@rvalkenaers) could deal with it before we merge. Either way is fine with me. 🔀 😄

@rvalkenaers
Copy link
Contributor Author

@tsandall I agree, the dependency thing is a good point. As far as I am concerned, I'd leave this up to @rvalkenaers: I could take on dealing with those two points between merge and 0.36.0 (i.e. next week); or you (@rvalkenaers) could deal with it before we merge. Either way is fine with me. 🔀 😄

@srenatus If you could pick it up it would be much appreciated 🙏

@srenatus
Copy link
Contributor

@rvalkenaers OK sure! Would you mind squashing these commits, adding a short commit message, so I can push the merge button?

This commit implements tracing using the net/http automatic instrumentation wrappers on the server and topdown/http packages.

Fixes open-policy-agent#1469
Signed-off-by: Rien Valkenaers <rien.valkenaers@gmail.com>
Co-Authored-By: Stephan Renatus <stephan@styra.com>
Copy link
Contributor

@srenatus srenatus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution that get's the ball rolling with OPA and OpenTelemetry 👏 🎉 🥳

I'll merge this, and take on the remaining little cleanup we've discussed in another PR.

@srenatus srenatus merged commit ce50274 into open-policy-agent:main Dec 13, 2021
@anderseknert
Copy link
Member

This is great! Thanks @rvalkenaers 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add OpenTracing/Census/Telemetry support for OPA API's
5 participants