Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knative blog copyedits #1473

Merged
merged 2 commits into from
Jun 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .vscode/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,9 @@
"Dynatrace",
"Grafana",
"initializers",
"Istio",
"Javadoc",
"Knative",
"Kubernetes",
"lifecycles",
"Lightstep",
Expand Down
186 changes: 138 additions & 48 deletions content/en/blog/2022/knative.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,112 @@
---
title: Distributed tracing in Knative
linkTitle: Tracing in Knative
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavolloffay - notice the shortened title, which appears in the blog left-nav. Hopefully you're ok with this.

date: 2022-04-12
spelling: cSpell:ignore Cloudevents istio Loffay Pavol pavolloffay
author: Pavol Loffay
---

In this article, you will learn how distributed tracing works in Knative and we will explore how the OpenTelemetry project can make tracing support in this environment easier. We will explore Knative under the hood to understand what distributed tracing capabilities it provides out-of-the-box and which parts of the system need additional instrumentation.
In this article, you will learn how distributed tracing works in Knative and we
will explore how the OpenTelemetry project can make tracing support in this
environment easier. We will explore Knative under the hood to understand what
distributed tracing capabilities it provides out-of-the-box and which parts of
the system need additional instrumentation.

## About Knative

Knative is a serverless platform built on top of Kubernetes as a set of `CustomResourceDefinitions` (CRDs). The project is split into two logical parts:
* serving - facilitates the creation, deployment and scaling of workload/services
* eventing - facilitates event-driven communication between workloads to enable loosely coupled architectures
Knative is a serverless platform built on top of Kubernetes as a set of
`CustomResourceDefinitions` (CRDs). The project is split into two logical parts:

In this article we will not cover Knative fundamentals, please refer to the [Knative documentation](https://knative.dev/docs/) to get familiar with the project.
- serving - facilitates the creation, deployment and scaling of
workload/services
- eventing - facilitates event-driven communication between workloads to enable
loosely coupled architectures

In this article we will not cover Knative fundamentals, please refer to the
[Knative documentation](https://knative.dev/docs/) to get familiar with the
project.

### Knative data flow

Before we deep dive into tracing let's take a look at a data flow example. It will help us to understand Knative architecture and which parts of the system need to be instrumented in order to understand the timing characteristics of the request or transaction. On the diagram below there are two user workloads (first and second) and an incoming request marked as (1. HTTP) that goes to use workload first and then to the workload second as a cloud event message.
Before we deep dive into tracing let's take a look at a data flow example. It
will help us to understand Knative architecture and which parts of the system
need to be instrumented in order to understand the timing characteristics of the
request or transaction. On the diagram below there are two user workloads (first
and second) and an incoming request marked as (1. HTTP) that goes to use
workload first and then to the workload second as a cloud event message.

![Knative data flow: incoming HTTP request goes throguh Knative service and queue-proxy sidecar container before if reaches a workload](/img/blog-knative/knative-data-flow.jpg)
![Knative data flow: incoming HTTP request goes through Knative service and queue-proxy sidecar container before it reaches a workload](/img/blog-knative/knative-data-flow.jpg)

There are two important facts about this diagram:

1. all the traffic goes through queue-proxy sidecar
2. all traffic goes through Knative component(s). The Knative components in the diagram are abstract. It can be a Knative activator service, Knative event broker, dispatcher etc.
2. all traffic goes through Knative component(s). The Knative components in the
diagram are abstract. It can be a Knative activator service, Knative event
broker, dispatcher etc.

From the telemetry perspective, the purpose of queue-proxy is similar to istio-proxy from Istio service mesh. It is a proxy that intercepts all traffic going to the workload and it emits telemetry data for any communication going to or from the workload.
From the telemetry perspective, the purpose of queue-proxy is similar to
istio-proxy from Istio service mesh. It is a proxy that intercepts all traffic
going to the workload and it emits telemetry data for any communication going to
or from the workload.

## Distributed tracing in Knative

The Knative project comes with a solid distributed tracing integration. Major parts of the system are already instrumented and the system creates trace data for transactions/requests that go to user workloads.

Internally at the moment, Knative uses OpenCensus instrumentation libraries that export data in Zipkin format. The inter-process context propagation uses [Zipkin B3](https://github.com/openzipkin/b3-propagation) and [W3C Trace-Context](https://www.w3.org/TR/trace-context/) standards. The Zipkin B3 propagation format is most likely used for legacy reasons to allow trace context propagation with older workloads instrumented with older technology. As a best practice, use the standard W3C Trace-Context which is natively used by the OpenTelemetry project.

Now let's take a look at an example trace with two workloads (first and second). The workflow is similar to the diagram from the previous section: the first service receives an HTTP call and sends a cloud event to the second service. The full demo source code can be found in [pavolloffay/knative-tracing](https://github.com/pavolloffay/knative-tracing).

![A screenshot from Jaeger that shows Knative trace](/img/blog-knative/jaeger-knative-trace.jpg)

The trace shows the following services interacting: activator, first workload, broker-ingress, imc-dispatcher, broker-filter, activator, and second workload. There are many services, right? A simple interaction of two workloads resulted in a trace that shows many Knative internal components. From the observability perspective, this is great because it can show issues in the infrastructure and additionally show cost associated with Knative request processing.

Let's briefly example the data flow. The incoming HTTP request first goes through an activator service that is responsible for scaling up a workload, then its execution reaches the first workload. The first workload sends a cloud event which goes through the broker and dispatcher and finally reaches the second workload.

Now let's take a closer look at the user workloads. The first service is a Golang service with a single REST API endpoint. The endpoint implementation creates a cloud event and sends it to the broker. Let's take a look at important facts from the observability perspective:
* REST API is instrumented with OpenTelemetry. This allows us to link traces started in the Knative activator service with spans created in the workload and further link it with outbound spans - e.g. to calls to the second service.
* The workload is using instrumented [Cloudevents client/SDK](https://github.com/cloudevents/sdk-go/tree/main/observability/opentelemetry/v2) - similarly to the previous point it allows us to continue the trace in the outbound request (in this scenario to the second service).

How is the trace-context (`traceId`, `spanId`, `sampled` flag) being propagated in our example applications? The trace-context is propagated in HTTP headers both for incoming HTTP requests into the first service and as well for cloud events sent to the second service. The trace-context is not attached directly to the event extensions/attributes.
The Knative project comes with a solid distributed tracing integration. Major
parts of the system are already instrumented and the system creates trace data
for transactions/requests that go to user workloads.

Internally at the moment, Knative uses OpenCensus instrumentation libraries that
export data in Zipkin format. The inter-process context propagation uses
[Zipkin B3](https://github.com/openzipkin/b3-propagation) and
[W3C Trace-Context](https://www.w3.org/TR/trace-context/) standards. The Zipkin
B3 propagation format is most likely used for legacy reasons to allow trace
context propagation with older workloads instrumented with older technology. As
a best practice, use the standard W3C Trace-Context which is natively used by
the OpenTelemetry project.

Now let's take a look at an example trace with two workloads (first and second).
The workflow is similar to the diagram from the previous section: the first
service receives an HTTP call and sends a cloud event to the second service. The
full demo source code can be found in
[pavolloffay/knative-tracing](https://github.com/pavolloffay/knative-tracing).

![Jaeger screenshot showing a Knative trace](/img/blog-knative/jaeger-knative-trace.jpg)

The trace shows the following services interacting: activator, first workload,
broker-ingress, imc-dispatcher, broker-filter, activator, and second workload.
There are many services, right? A simple interaction of two workloads resulted
in a trace that shows many Knative internal components. From the observability
perspective, this is great because it can show issues in the infrastructure and
additionally show cost associated with Knative request processing.

Let's briefly example the data flow. The incoming HTTP request first goes
through an activator service that is responsible for scaling up a workload, then
its execution reaches the first workload. The first workload sends a cloud event
which goes through the broker and dispatcher and finally reaches the second
workload.

Now let's take a closer look at the user workloads. The first service is a
Golang service with a single REST API endpoint. The endpoint implementation
creates a cloud event and sends it to the broker. Let's take a look at important
facts from the observability perspective:

- REST API is instrumented with OpenTelemetry. This allows us to link traces
started in the Knative activator service with spans created in the workload
and further link it with outbound spans - e.g. to calls to the second service.
- The workload is using instrumented
[Cloudevents client/SDK](https://github.com/cloudevents/sdk-go/tree/main/observability/opentelemetry/v2) -
similarly to the previous point it allows us to continue the trace in the
outbound request (in this scenario to the second service).

How is the trace-context (`traceId`, `spanId`, `sampled` flag) being propagated
in our example applications? The trace-context is propagated in HTTP headers
both for incoming HTTP requests into the first service and as well for cloud
events sent to the second service. The trace-context is not attached directly to
the event extensions/attributes.

Follows log output with request headers from the first service:
```bash

```nocode
2022/02/17 12:53:48 Request headers:
2022/02/17 12:53:48 X-B3-Sampled: [1]
2022/02/17 12:53:48 X-B3-Spanid: [af6c239eb7b39349]
Expand All @@ -70,8 +129,11 @@ Follows log output with request headers from the first service:
2022/02/17 12:53:48
```

Now let's take a look at logging from the second service which exposes API to consume Knative events. The event API in this case is just an HTTP endpoint which is a cloud event implementation detail:
```bash
Now let's take a look at logging from the second service which exposes API to
consume Knative events. The event API in this case is just an HTTP endpoint
which is a cloud event implementation detail:

```nocode
2022/02/17 13:39:36 Event received: Context Attributes,
specversion: 1.0
type: httpbody
Expand All @@ -85,39 +147,67 @@ Data,
hello from first, traceid=5f2c4775e0e36efc1d554a0b6c456cc1
```

We see that the trace context is not directly present in the event object. However, it is encoded in the incoming transport message - HTTP headers.
We see that the trace context is not directly present in the event object.
However, it is encoded in the incoming transport message - HTTP headers.

### Future improvements

In the previous section, it was mentioned that the Knative serving and eventing components are instrumented with OpenCensus SDK. The instrumentation will change in the future to OpenTelemetry which is tracked in [knative/eventing/#3126](https://github.com/knative/eventing/issues/3126) and [knative/pkg#855](https://github.com/knative/pkg/issues/855). The SDK change might not have an immediate impact on the user, however, it will enable users to start natively reporting data in OpenTelemetry format (OTLP).
In the previous section, it was mentioned that the Knative serving and eventing
components are instrumented with OpenCensus SDK. The instrumentation will change
in the future to OpenTelemetry which is tracked in
[knative/eventing/#3126](https://github.com/knative/eventing/issues/3126) and
[knative/pkg#855](https://github.com/knative/pkg/issues/855). The SDK change
might not have an immediate impact on the user, however, it will enable users to
start natively reporting data in OpenTelemetry format (OTLP).

Another recently merged change is the addition of [Cloudevents semantic attributes into the OpenTelemetry specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/cloudevents.md). The document standardizes attributes related to CloudEvents. The screenshot below is from the demo application that is still not using the standardized attribute names:
Another recently merged change is the addition of
[Cloudevents semantic attributes into the OpenTelemetry specification](/docs/reference/specification/trace/semantic_conventions/cloudevents).
The document standardizes attributes related to CloudEvents. The screenshot
below is from the demo application that is still not using the standardized
attribute names:

![A screenshot from Jaeger that shows Knative attributes](/img/blog-knative/jaeger-knative-attributes.jpg)

### Configuration

Tracing in Knative can be easily enabled. Please follow the [official documentation](https://knative.dev/docs/) for a step-by-step guide. Let's briefly describe the process here:
1. Deploy a tracing system that can ingest tracing data in Zipkin format - Zipkin, Jaeger, or OpenTelemetry collector
2. Enable tracing in [Knative eventing](https://knative.dev/docs/eventing/accessing-traces/)
3. Enable tracing in [Knative serving](https://knative.dev/docs/serving/accessing-traces/)
Tracing in Knative can be easily enabled. Please follow the
[official documentation](https://knative.dev/docs/) for a step-by-step guide.
Let's briefly describe the process here:

1. Deploy a tracing system that can ingest tracing data in Zipkin format -
Zipkin, Jaeger, or OpenTelemetry collector
2. Enable tracing in
[Knative eventing](https://knative.dev/docs/eventing/accessing-traces/)
3. Enable tracing in
[Knative serving](https://knative.dev/docs/serving/accessing-traces/)

In the beginning, I recommended using 100% sampling rate configuration to capture trace data for all traffic in the cluster. This will help to avoid any issues with sampling, do not forget to change this configuration once moving to the production environment.
In the beginning, I recommended using 100% sampling rate configuration to
capture trace data for all traffic in the cluster. This will help to avoid any
issues with sampling, do not forget to change this configuration once moving to
the production environment.

## Conclusion

We have learned what distributed tracing capabilities Knative project provides out-of-the-box and which parts need more work from the user. Generally speaking Knative emits rich tracing data, however, as always the user is responsible to instrument the workload and make sure trace-context is propagated from inbound to outbound requests or events. This is exactly the same situation as implementing distributed tracing in service meshes.
We have learned what distributed tracing capabilities Knative project provides
out-of-the-box and which parts need more work from the user. Generally speaking
Knative emits rich tracing data, however, as always the user is responsible to
instrument the workload and make sure trace-context is propagated from inbound
to outbound requests or events. This is exactly the same situation as
implementing distributed tracing in service meshes.

OpenTelemetry can help to instrument the user workload and correctly propagate the trace-context. Depending on the language, the user can initialize instrumentation libraries explicitly in the code or even [dynamically inject OpenTelemetry auto-instrumentation into the workload](https://medium.com/opentelemetry/using-opentelemetry-auto-instrumentation-agents-in-kubernetes-869ec0f42377).
OpenTelemetry can help to instrument the user workload and correctly propagate
the trace-context. Depending on the language, the user can initialize
instrumentation libraries explicitly in the code or even
[dynamically inject OpenTelemetry auto-instrumentation into the workload](https://medium.com/opentelemetry/using-opentelemetry-auto-instrumentation-agents-in-kubernetes-869ec0f42377).

## References

* Knative docs: https://knative.dev/docs/
* Knative serving tracing config: https://knative.dev/docs/serving/accessing-traces/
* Knative eventing tracing config: https://knative.dev/docs/eventing/accessing-traces/
* Cloud events: https://cloudevents.io/
* Zipkin B3: https://github.com/openzipkin/b3-propagation
* W3C Trace-Context: https://www.w3.org/TR/trace-context/
* OpenTelemetry instrumentation for Cloudevents Golang SDK: https://github.com/cloudevents/sdk-go/tree/main/observability/opentelemetry/v2
* Cloudevents OpenTelemetry attributes: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/cloudevents.md
* Knative tracing demo: https://github.com/pavolloffay/knative-tracing
- [Knative docs](https://knative.dev/docs/)
- [Knative serving tracing config](https://knative.dev/docs/serving/accessing-traces/)
- [Knative eventing tracing config](https://knative.dev/docs/eventing/accessing-traces/)
- [Cloud events](https://cloudevents.io)
- [Zipkin B3](https://github.com/openzipkin/b3-propagation)
- [W3C Trace-Context](https://www.w3.org/TR/trace-context/)
- [OpenTelemetry instrumentation for Cloudevents Golang SDK](https://github.com/cloudevents/sdk-go/tree/main/observability/opentelemetry/v2)
- [Cloudevents OpenTelemetry attributes](/docs/reference/specification/trace/semantic_conventions/cloudevents/)
- [Knative tracing demo](https://github.com/pavolloffay/knative-tracing)