Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add tracing support #3057

Closed
fuweid opened this issue Mar 1, 2019 · 27 comments
Closed

proposal: add tracing support #3057

fuweid opened this issue Mar 1, 2019 · 27 comments
Assignees

Comments

@fuweid
Copy link
Member

fuweid commented Mar 1, 2019

Proposal

There are many component systems here, such as content, snapshotter, container and tasks. And the containerd package provides smart client with powerful functionality, something like Backend for frontend. For instance, the Client.Pull is simple function call, but it already combines several calls to different components in there.

If we can add the tracing in containerd smart client side client and daemon, it will be easy to figure out the whole request call chain. When we run into some issues, the tracing data can help us to locate the problem quickly.

Document

In order to make it easy to discuss, I make a google doc for this proposal.

https://docs.google.com/document/d/12l0dEa9W8OSZdUYUNZJZ7LengDlJvO5J2ZLategEdl4/edit?usp=sharing

And welcome to comment for this.

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 1, 2019

I agree tracing would be helpful.
I am concerned about integrating with jaeger on the client. How data is exported should be left up to the consumer of the client.

It would be tremendously helpful to also propagate tracing over GRPC to the daemon.

@fuweid
Copy link
Member Author

fuweid commented Mar 1, 2019

Hi @cpuguy83, thanks for comment.

I am concerned about integrating with jaeger on the client. How data is exported should be left up to the consumer of the client.

Yes, absolutely. I agree. For client, I think it is always up to users.
But for the ctr command, in order to make it easier to use, we can create a option to use jaeger tracer. How do you think?

It would be tremendously helpful to also propagate tracing over GRPC to the daemon.

I was thinking that the containerd components don't call each other. It seems that the client is good enough. However, it is good to add that into daemon.

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 1, 2019

How do you think?
💯

I was thinking that the containerd components don't call each other.

Where this becomes useful is so a client can receive tracing data propagated from the daemon.

@jterry75
Copy link
Contributor

jterry75 commented Mar 1, 2019

@kevpar - FYI.

We have been looking into this for distributed tracing as well because the runtime v2 shim calls happen out of proc and it would be amazing to flow activity id's across the gRPC boundaries for logs correlation.

@Random-Liu
Copy link
Member

Random-Liu commented Mar 2, 2019

/cc @dashpole

Who is working on tracing in Kubernetes, and also experimented tracing down to containerd.

@dashpole
Copy link

dashpole commented Mar 2, 2019

I had an intern last fall who did a PoC of something similar. His containerd code changes are here: master...Monkeyanator:opencensus-tracing.

We ended up using OpenCensus instead of OpenTracing just because then we didn't need pick a tracing backend in open-source components, such as containerd. Then you can run the "OC Agent" to handle pushing traces to a tracing backend, and can ship the components independently.

Since containerd uses a stripped-down version of gRPC for it's shims, we explicitly added the trace context to the API, rather than try and add context propagation back in. But if we actually added it, we should do that properly. Other than that, adding tracing to containerd was pretty simple.

If you are curious how I think this should work in kubernetes, feel free to check out my proposal: kubernetes/enhancements#650.

pod_creation_trace

We used tracing to help us debug a slow start runc issue our autoscaling team sent us, so I definitely think this has value. Here is the trace from the reproduction in just containerd:
screenshot from 2018-11-20 16-47-50

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 2, 2019

I've spent some time specifically thinking about how to propagate traces over ttrpc.
The GRPC way (through the metadata mechanism) is not great since it necessarily requires allocating, often unnecessary, objects on every request.

@fuweid
Copy link
Member Author

fuweid commented Mar 2, 2019

Thanks for @dashpole and @cpuguy83 input.

I will update the proposal which includes the shim part and involved project. :)

@crosbymichael
Copy link
Member

Sounds good to me

@jterry75
Copy link
Contributor

jterry75 commented Mar 5, 2019

It seems the industry is moving toward opencensus. @kevpar - Did some investigations on our side and thought it was the likely candidate for us to move to as well. The one thing that we dislike about it is that it seems to only export begin/end at end. Which means that if you have a panic trace you loose all data because there was no entry logging. opentracing however did show the entry trace down to the point of panic.

@kevpar
Copy link
Member

kevpar commented Mar 5, 2019

opencensus does have the momentum. However, the logging support in opentracing is better, and the API is much more flexible. opencensus is working on a logging API, but they seem to be leaning towards unstructured logging, which is unfortunate.

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 5, 2019

FYI, I created a rough tracing package here: https://github.com/virtual-kubelet/virtual-kubelet/blob/master/trace/trace.go

In VK we use opencensus and logrus... have interfaces for trace and logging... but these are primarily for the purposes of supporting logging through a span and not so much about providing multiple backing implementations of the tracing... but here is how it works:

ctx, span := trace.StartSpan(ctx)
log.G(ctx).WithField("foo", "bar").Debug("whatever")

In this case, the trace.StartSpan has injected it's own logger which is wrapping the logger already configured in ctx.
This will output a structured message both to the span and logrus (which is what we configure as the logger).

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 5, 2019

I guess godoc would be better: https://godoc.org/github.com/virtual-kubelet/virtual-kubelet/trace

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 5, 2019

All this to say, I don't think we should be worried about what opencensus does wrt logging or not... use what's most convenient/has the most community support.

@fuweid
Copy link
Member Author

fuweid commented Mar 7, 2019

Hi @cpuguy83 and @jterry75

I make a doc for proposal. please take a look. Thanks

https://docs.google.com/document/d/12l0dEa9W8OSZdUYUNZJZ7LengDlJvO5J2ZLategEdl4/edit?usp=sharing

/cc @containerd/containerd-reviewers @containerd/containerd-maintainers

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 7, 2019

+1 doc looks good.

OC is very flexible and very nice that you can write exporters or even just use an agent exporter and not have to pull in a ton of dependencies in.

@jterry75
Copy link
Contributor

jterry75 commented Mar 8, 2019

+1 for the doc. Thanks @fuweid

@alakesh
Copy link
Contributor

alakesh commented Apr 5, 2021

How far are we with the tracing support? Is anyone actively working on it. @kevpar? I would like to help/collaborate on tasks actively. Appreciate your comments!

@kevpar
Copy link
Member

kevpar commented Apr 5, 2021

How far are we with the tracing support? Is anyone actively working on it. @kevpar? I would like to help/collaborate on tasks actively. Appreciate your comments!

Previously it was unclear if we should really be using OpenTracing or OpenCensus, so we got bogged down in that. Now that they have merged into OpenTelemetry it's much clearer what we should use. I've been intending to work on this again once Go OpenTelemetry was production ready, but maybe we don't really need to wait for that.

@alakesh
Copy link
Contributor

alakesh commented Apr 5, 2021

Thanks @kevpar. I will spend some time this week to come up to speed on how we can use OpenTelemetry. I guess coming up with a detailed plan on how to implements is the next step?

@kevpar
Copy link
Member

kevpar commented Apr 5, 2021

I think the first steps are:

  • Hook up GRPC and TTRPC with hooks to auto-create/propagate spans
  • Add code to create spans in important areas and add useful attributes to them (e.g. adding image name to the image pull span)
  • Figure out what exporters should be supported (I think adding a simple Logrus exporter to start would be useful)

@alakesh
Copy link
Contributor

alakesh commented Jun 29, 2021

Regarding PR #5570 has few TODO items.
a. Fix issues identified by reviewers.
b. Update PR to use the most recent v1.0.0-RC1 for opentelemetry go sdk.
c. Add support for containerd to send tracing data to OTLP collector rather than getting tied to an exporter. This way any exporter can obtain tracing data from collector.
d. Remove the stdout exporter support.
e. Fix sampling to be done based on config. And avoid using AlwaysSample by default.
f. Add as many attributes as possible to make traces useful.

@lengrongfu
Copy link
Contributor

Will containerd pass traceing to registry, CSI downstream, CRI downstream?

@lengrongfu
Copy link
Contributor

I found that there is still a lot to do, and I can complete the trace delivery related to image pull. I am currently discussing the added trace issue in the distribution project.
distribution/distribution#3473

@kzys
Copy link
Member

kzys commented Oct 7, 2021

The remaining items are tracked by #6083.

@estesp
Copy link
Member

estesp commented Feb 23, 2022

Now that basic OTEL support is in containerd (as of 1.6.0), should we close this general proposal and handle follow-ups via #6550? What do you think @fuweid

@fuweid fuweid removed this from the 1.7 milestone Feb 23, 2022
@fuweid
Copy link
Member Author

fuweid commented Feb 23, 2022

@estesp sgtm. I removed the milestone and am closing it.

@fuweid fuweid closed this as completed Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests