Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move tracing to OpenTelemetry #1972

Open
bwplotka opened this issue Jan 9, 2020 · 40 comments
Open

Move tracing to OpenTelemetry #1972

bwplotka opened this issue Jan 9, 2020 · 40 comments

Comments

@bwplotka
Copy link
Member

bwplotka commented Jan 9, 2020

It would be nice to reuse ready libraries then invent and maintain our own (:

AC:

  • Our tracing library is replaced by https://opentelemetry.io/
  • Support for existing providers is preserved: Jaeger, Elastic, Lightstep, Stackdriver
@squat
Copy link
Member

squat commented Jan 9, 2020

Would love to see this. And we’ll be in line with other projects moving in this direction eg prom

@kakkoyun
Copy link
Member

kakkoyun commented Jan 9, 2020

I'm happy to work on this.

Turns out, I'm too busy to spare time for this one.
So if anyone up for the task, feel free.

@daixiang0
Copy link
Member

daixiang0 commented Jan 30, 2020

@bwplotka i am happy to work on this.

Update:
With a quick look, here is a lib in OpenTelemetry to forward OpenTracing API calls to the OpenTelemetry SDK.

So there are two ways to move:

  • just use bridge lib to forward, smallest change, but still keep OpenTracing dependence
  • totally use OpenTelemetry SDK, a big change

BTW, OpenTelemetry is still under active development, some features in OpenTracing may be not supported i think.

@bwplotka
Copy link
Member Author

Awesome! Thanks for taking a look. My thoughts:

  • What about different backends (e.g lightstep, ELK APM) that OpenTelemetry does not support?
  • Definitely IF we are doing the change, let's start small, so option 1.

BTW, OpenTelemetry is still under active development, some features in OpenTracing may be not supported i think.

  • Yes, that's my worry. That's why we should be careful. Let's focus now on what is missing that we are using? I guess small try does not hurt, if you want to give it a go - but you need to make sure you test it against e.g some open source provider like Jaeger.

This is kind of worrying state of Go right now. Maybe it's worth to put a friendly issue on https://github.com/open-telemetry/opentelemetry-go github with a question for current dates/blockers?

image

@daixiang0
Copy link
Member

I file a issue about Elastic/Lightstep support, i would try the bridge.

@lawrencejones
Copy link

Given opencencus and opentracing are both mature libraries with LTS guarantees, would it make more sense to pick one of those for now? I've held off using opentelemetry while it's in alpha to avoid needless changes, and under the expectation that opencencus/opentracing -> opentelemetry will eventually become a well worn migration path.

@bwplotka
Copy link
Member Author

bwplotka commented Feb 6, 2020

We use opentracing right now, but it does not define actual RPC protocol, only client API. So we are forced to maintain client code against different tracing backends ourselves. Something that we would like to avoid

@kakkoyun
Copy link
Member

kakkoyun commented Feb 6, 2020

As far as I know, OpenTelemetry has a fairly stable API for tracing.
However, they are still in alpha, maybe we should wait for a little bit more. https://opentelemetry.io/project-status/

@daixiang0
Copy link
Member

Yep, here is the roadmap for opentelemetry, maybe we can wait for the 1.0 version of golang based on feedback.

@stale
Copy link

stale bot commented Mar 8, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 8, 2020
@bwplotka
Copy link
Member Author

bwplotka commented Mar 8, 2020 via email

@stale stale bot removed the stale label Mar 8, 2020
@stale
Copy link

stale bot commented Apr 7, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Apr 7, 2020
@stale stale bot closed this as completed Apr 14, 2020
Thanos Roadmap automation moved this from To do to Done Apr 14, 2020
@brancz
Copy link
Member

brancz commented Apr 15, 2020

We still want this.

@brancz brancz reopened this Apr 15, 2020
Thanos Roadmap automation moved this from Done to In progress Apr 15, 2020
@stale stale bot removed the stale label Apr 15, 2020
@stale
Copy link

stale bot commented May 15, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 15, 2020
@kakkoyun kakkoyun removed the stale label May 15, 2020
@stale
Copy link

stale bot commented Jun 14, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 2, 2021
@kakkoyun kakkoyun removed the stale label Jun 3, 2021
@stale
Copy link

stale bot commented Aug 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 2, 2021
@stale
Copy link

stale bot commented Aug 17, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 17, 2021
Thanos Roadmap automation moved this from In progress to Done Aug 17, 2021
@bwplotka bwplotka reopened this Aug 20, 2021
Thanos Roadmap automation moved this from Done to In progress Aug 20, 2021
@stale stale bot removed the stale label Aug 20, 2021
@bwplotka
Copy link
Member Author

On it, but missing Otel API -> opentracing tracer/exporters support that will allow us to use our custom exporter to Elastic APM/lightstep etc... .This is required to make sure we don't break compatibility of existing Tracing integrations when moving to Otel lib ):

@bwplotka
Copy link
Member Author

@matej-g
Copy link
Collaborator

matej-g commented Oct 4, 2021

I had a quick look and it seems like the configuration should be translatable relatively good to ensure smooth transition from OpenTracing to OpenTelemetry (even if not completely 1:1). I think that would make more sense than having a 'translating' exporter (that would necessitate we still rely on OpenTracing packages, which I think we want to avoid, since the expectation is these will be deprecated).

To me it seems that:

  1. For Jaeger we can switch to OpenTelemetry Jaeger exporter and this should be relatively smooth.
  2. For GCP we can leverage the Cloud Trace exporter.
  3. Other proprietary providers seems to support the OTLP format, which means we could leverage the OTLP exporter and where I envision two possibilities:
    a) We 'wrap' the proprietary tracer configurations into the OTLP exporter and use that to send spans.
    b) We create a stand-alone OTLP tracer and keep the existing OpenTracing-based proprietary exporters with the understanding, that these will be deprecated and removed (after time) and users will be expected to use the OTLP exporter.
    c) We'll do both a) and b)

I personally would be leaning towards option b). I'm wondering how many Thanos users are actually using these exporters. Since they are proprietary, I would expect the respective commercial providers to contribute the code and maintain it instead of community having to use resources to do that. Otherwise it makes more sense to have an OTLP exporter (which would be usable with any provider accepting it, seeing as it might become the standard and which is also accepted by the actual OTEL collector).

@bwplotka
Copy link
Member Author

Nice, thanks for the investigation. Let's summarize our options:

a) We 'wrap' the proprietary tracer configurations into the OTLP exporter and use that to send spans.

So using existing configuration and translate into new provider code available in Otel packages. Assuming it's possible, (I don't think it's trivial due to different sampling techniques), we either can do this while sticking to (a1 option) OpenTracing instrumentation for now, OR replace it with Otel (a2 option) which will require us to switch all exporters at once.. which is not fun. In a1 option, we could try to translate each provider one by one and when all are using Otel we can switch instrumentation to Otel too. That would make some sense.

b) We create a stand-alone OTLP tracer and keep the existing OpenTracing-based proprietary exporters with the understanding, that these will be deprecated and removed (after time) and users will be expected to use the OTLP exporter.

We cannot do this if we also replace our instrumentation to Otel from OpenTracing because there is no shim for OpenTracing to use Otel exporters. But this is fine if we would do similar to option a1 where we don't change instrumentation to Otel UNTIL we deprecate all.

Overall, I think it's nice you looked at it excluding instrumentation because this kind of tells us that replacing the OpenTracing instrumentation part should be postponed until we switch all providers. Technically I would stick to b, mainly because we give time still for people to move AND it might impossible to translate configuration..

@bwplotka
Copy link
Member Author

bwplotka commented Oct 14, 2021

cc Cortex and Loki devs since they might laverage our tracing clients. Are you ok with option b on this one? cc @pracucci @pstibrany @cyriltovena @slim-bean

@pracucci
Copy link
Contributor

I took a quick look at the discussion and, as far as I can see, option b is the one giving more time to upgrade. If so, I'm ok.

@matej-g
Copy link
Collaborator

matej-g commented Nov 7, 2021

I'm posting to provide updates on the progress here. I started to migrate the exporters:

  • We are currently limited to OTEL packages with version v0.20.0 - this is due to the dependency chain, as we are importing (via Cortex and Grafana Dskit packages) module go.etcd.io/etcd/client/v3 in version v3.5.0, which uses v0.20.0 OTEL packages for its tracing. Due to changes in paths for some packages, I am not able to import latest OTEL version in Thanos since I'm getting module * found, but does not contain package *. The latest master branch of ETCD however imports newer OTEL version, so I'm expecting to be reconciled once this is released.
  • For Jaeger, the OpenTracing client will soon be sunset, however there are still
    some features which are missing between the old client and the new OpenTelemetry exporter,
    most notably we're missing remote sampler (which however is underway - see Add Jaeger remote sampler open-telemetry/opentelemetry-go-contrib#936).
    I have therefore postponed migrating this exporter until we can migrate with all features.
  • I moved onto Google Cloud (Stackdriver) exporter as next, this one seems to be relatively simple to migrate, so good first candidate - opened a PR for this

I'll also use this comment to track the progress below:

  • Migrate Jaeger client
  • Migrate Google Cloud (Stackdriver) client -> done
  • Add OTLP exporter
  • Migrate proprietary providers (ELM and Lighstep)
    • Alternative: (if cannot easily be migrated): Deprecate and point users to OTLP
  • Migrate the instrumentation

Next step would be adding OTLP and once all necessary work to add Jaeger features on OTEL side is done, that will also be migrated.

@yeya24
Copy link
Contributor

yeya24 commented Jan 7, 2022

It would great if we can start with Add OTLP exporter step as this seems independent with other tasks.
If no one is working on this one right now then I can try to take it.

@matej-g
Copy link
Collaborator

matej-g commented Jan 28, 2022

Sorry for the delay on this @yeya24, I updated my related PR and commented there as well. I believe that PR should give us a common set of utilities to bridge OpenTelemetry exporters while we keep using OpenTracing instrumentation, for now. So I believe we can work on the other exporters independently once those bridging utilities land, as you say.

@matej-g
Copy link
Collaborator

matej-g commented Mar 22, 2022

So to move this forward again, the Google Cloud (Stackdriver) provider has now been move (#4838) and we should have all the code necessary to support move for other providers as well, including adding OTLP exporter.

@bwplotka
Copy link
Member Author

bwplotka commented Jun 9, 2022

FYI: I played with much simpler API for own use https://github.com/bwplotka/tracing-go (it using Otel underneath)

@squat
Copy link
Member

squat commented Jun 9, 2022

FYI: I played with much simpler API for own use https://github.com/bwplotka/tracing-go (it using Otel underneath)

ooOOooh!

@stale
Copy link

stale bot commented Aug 13, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 13, 2022
@matej-g matej-g removed the stale label Aug 15, 2022
@Symbianx
Copy link

Symbianx commented Oct 7, 2022

Looks like a PR for this issue was merged some time ago. When can we expect a new release with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Thanos Roadmap
  
In progress
Development

No branches or pull requests