Add logging related metrics to Containerd CRI plugin #7546

sophieliu15 · 2022-10-17T19:09:40Z

This PR adds logging related metrics in the Containerd CRI plugin. Metrics are per Containerd instance. Metrics can be used to estimate logging completeness of Containerd and logging completeness of logging pipeline (when combined with logging agent metrics). See design details below.

Context

graph LR
A[Customer Workload] -- Output --> B((stdout/stderr))
B -- Retrive --> C[Containerd]
C -- Output --> D((disk))
D -- Retrive --> E[Logging Agent]
E -- Output --> F[Sink]
G[Log Rotator] -- Rotate --> D

The graph above shows a common architecture of a logging pipeline. Customer workloads output their logs to stdout/stderr. Containerd reads those logs and outputs them to log files on disk. Logging agents read log files from disk and exports logs to various sinks (e.g., stdout, Cloud Logging). In the meantime, log rotator rotates files on disk to prevent logs from overflowing disks.

In the logging pipeline above, logs could be missed or duplicated in any stages of the pipeline. For example, Containerd might fail to read a log or output the log to disk, which could cause potential log loss. Another example we encountered in the past is that logging agents could duplicate logs unexpectedly during a node restart

Motivation

Adding logging related instrumentation in Containerd helps us achieve following two goals:

Estimate logging completeness of Containerd
Estimate logging completeness on disk before logs are registered by the logging agent

Currently we have zero visibility into these two areas. See the section below for details about how we use the proposed Containerd metrics to achieve the goals above.

Detailed Design

Logging Related Metrics

We will calculate the following metrics in the Containerd CRI plugin. Metrics are per Containerd instance.

containerd_cri_input_entries_total: Number of log entries Containerd receives
containerd_cri_input_bytes_total: Size of logs Containerd receives
containerd_cri_output_entries_total: Number of log entries Containerd successfully writes to disks
containerd_cri_output_bytes_total: Size of logs Containerd successfully writes to disks
containerd_cri_split_entries_total: Number of extra log entries created by splitting the original log entry. This happens when the original log entry exceeds the length limit. This metric does not count the original log entry.

Usage of metrics

This section discusses some use cases of logging metrics. The number corresponds to the metrics number listed above.

Estimate logging completeness of Containerd

This can be achieved by comparing input_entries [1], output_entries [3] and split_entries [5]. In the case of no log processing errors, input_entries + split_entries = output_entries.

input_bytes [2] and output_bytes [4] can also be used for the estimation. However, it won’t be accurate because the Containerd CRI plugin adds additional metadata to log entries.

Estimate logging completeness on disk

This estimates logging loss/duplication after logs leave Containerd but before they are registered by logging agents. The estimation can be achieved by comparing Containerd output logging metrics ([3] and [4]) with logging agent input metrics.

Estimate logging volume on the node

This can be achieved via output_bytes_total [4].

Estimate logging completeness of the entire pipeline

This can be achieved by comparing Containerd input logging metrics with logging agent output metrics.

Example metrics

Following metrics are collected by deploying a custom binary of Containerd to a GKE node.

$ curl http://127.0.0.1:1338/v1/metrics | egrep "input_entries"
# HELP containerd_cri_input_entries_total Number of log entries received
# TYPE containerd_cri_input_entries_total counter
containerd_cri_input_entries_total 26
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "output_entries"
# HELP containerd_cri_output_entries_total Number of log entries successfully written to disks
# TYPE containerd_cri_output_entries_total counter
containerd_cri_output_entries_total 26
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "input_bytes"
# HELP containerd_cri_input_bytes_total Size of logs received
# TYPE containerd_cri_input_bytes_total counter
containerd_cri_input_bytes_total 2933
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "output_bytes"
# HELP containerd_cri_output_bytes_total Size of logs successfully written to disks
# TYPE containerd_cri_output_bytes_total counter
containerd_cri_output_bytes_total 4141
 
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "split_entries"
# HELP containerd_cri_split_entries_total Number of extra log entries created by splitting the original log entry. This happens when the original log entry exceeds length limit. This metric does not count the original log entry.
# TYPE containerd_cri_split_entries_total counter
containerd_cri_split_entries_total 0

k8s-ci-robot · 2022-10-17T19:09:49Z

Hi @sophieliu15. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

samuelkarp · 2022-10-17T20:22:10Z

/ok-to-test

pkg/cri/io/metrics.go

pkg/cri/io/logger.go

pkg/cri/io/metrics.go

Signed-off-by: Sophie Liu <sophieliu@google.com>

samuelkarp · 2022-10-20T02:02:38Z

/retest

samuelkarp · 2022-10-21T20:10:08Z

With 1.6 as an LTS, I think we should cherry-pick this commit over. @kzys and @dcantah, do you have thoughts on whether we should also cherry-pick this into 1.5?

kzys · 2022-10-21T20:45:32Z

The change seems safe enough to have in 1.6. Regarding 1.5, I'd like to keep the version open for only bug/security fixes.

dcantah · 2022-10-21T20:46:57Z

1.5 seems in maintenance mode which in my mind drifts towards just security (and bugs).

1.6 seems perfectly fine though

dcantah · 2022-10-21T20:48:02Z

Oh nice @kzys beat me to it 🤣

samuelkarp · 2022-10-21T20:52:57Z

1.6 cherry-pick: #7571

JeffLuoo · 2022-10-31T16:54:33Z

A question about the metrics labels: I wonder if namespace a label of the metrics given that I only want the metrics containerd_cri_input_bytes_total for a specific namespace.

k8s-ci-robot added the needs-ok-to-test label Oct 17, 2022

sophieliu15 force-pushed the metrics_playground_1 branch from a4ff021 to aaf3201 Compare October 17, 2022 19:25

samuelkarp added the area/cri Container Runtime Interface (CRI) label Oct 17, 2022

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Oct 17, 2022

samuelkarp self-requested a review October 17, 2022 20:22

kzys reviewed Oct 17, 2022

View reviewed changes

pkg/cri/io/metrics.go Show resolved Hide resolved

samuelkarp added the kind/enhancement label Oct 18, 2022

samuelkarp approved these changes Oct 18, 2022

View reviewed changes

pkg/cri/io/logger.go Outdated Show resolved Hide resolved

pkg/cri/io/metrics.go Outdated Show resolved Hide resolved

pkg/cri/io/metrics.go Outdated Show resolved Hide resolved

samuelkarp added this to New in Code Review via automation Oct 18, 2022

samuelkarp moved this from New to Ready For Review in Code Review Oct 18, 2022

sophieliu15 force-pushed the metrics_playground_1 branch from aaf3201 to be92283 Compare October 18, 2022 20:46

samuelkarp approved these changes Oct 18, 2022

View reviewed changes

dcantah reviewed Oct 19, 2022

View reviewed changes

pkg/cri/io/metrics.go Show resolved Hide resolved

Add logging volume metrics to Containerd CRI plugin

3e44498

Signed-off-by: Sophie Liu <sophieliu@google.com>

sophieliu15 force-pushed the metrics_playground_1 branch from be92283 to 3e44498 Compare October 19, 2022 14:49

dcantah approved these changes Oct 20, 2022

View reviewed changes

kzys approved these changes Oct 20, 2022

View reviewed changes

kzys merged commit 72177ca into containerd:main Oct 20, 2022

Code Review automation moved this from Ready For Review to Done Oct 20, 2022

samuelkarp added the cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch label Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logging related metrics to Containerd CRI plugin #7546

Add logging related metrics to Containerd CRI plugin #7546

sophieliu15 commented Oct 17, 2022 •

edited

k8s-ci-robot commented Oct 17, 2022

samuelkarp commented Oct 17, 2022

samuelkarp commented Oct 20, 2022

samuelkarp commented Oct 21, 2022

kzys commented Oct 21, 2022

dcantah commented Oct 21, 2022

dcantah commented Oct 21, 2022

samuelkarp commented Oct 21, 2022

JeffLuoo commented Oct 31, 2022

Add logging related metrics to Containerd CRI plugin #7546

Add logging related metrics to Containerd CRI plugin #7546

Conversation

sophieliu15 commented Oct 17, 2022 • edited

Context

Motivation

Detailed Design

Logging Related Metrics

Usage of metrics

Estimate logging completeness of Containerd

Estimate logging completeness on disk

Estimate logging volume on the node

Estimate logging completeness of the entire pipeline

Example metrics

k8s-ci-robot commented Oct 17, 2022

samuelkarp commented Oct 17, 2022

samuelkarp commented Oct 20, 2022

samuelkarp commented Oct 21, 2022

kzys commented Oct 21, 2022

dcantah commented Oct 21, 2022

dcantah commented Oct 21, 2022

samuelkarp commented Oct 21, 2022

JeffLuoo commented Oct 31, 2022

sophieliu15 commented Oct 17, 2022 •

edited