Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging related metrics to Containerd CRI plugin #7546

Merged
merged 1 commit into from Oct 20, 2022

Conversation

sophieliu15
Copy link
Contributor

@sophieliu15 sophieliu15 commented Oct 17, 2022

This PR adds logging related metrics in the Containerd CRI plugin. Metrics are per Containerd instance. Metrics can be used to estimate logging completeness of Containerd and logging completeness of logging pipeline (when combined with logging agent metrics). See design details below.

Context

graph LR
A[Customer Workload] -- Output --> B((stdout/stderr))
B -- Retrive --> C[Containerd]
C -- Output --> D((disk))
D -- Retrive --> E[Logging Agent]
E -- Output --> F[Sink]
G[Log Rotator] -- Rotate --> D

The graph above shows a common architecture of a logging pipeline. Customer workloads output their logs to stdout/stderr. Containerd reads those logs and outputs them to log files on disk. Logging agents read log files from disk and exports logs to various sinks (e.g., stdout, Cloud Logging). In the meantime, log rotator rotates files on disk to prevent logs from overflowing disks.

In the logging pipeline above, logs could be missed or duplicated in any stages of the pipeline. For example, Containerd might fail to read a log or output the log to disk, which could cause potential log loss. Another example we encountered in the past is that logging agents could duplicate logs unexpectedly during a node restart

Motivation

Adding logging related instrumentation in Containerd helps us achieve following two goals:

  • Estimate logging completeness of Containerd
  • Estimate logging completeness on disk before logs are registered by the logging agent

Currently we have zero visibility into these two areas. See the section below for details about how we use the proposed Containerd metrics to achieve the goals above.

Detailed Design

Logging Related Metrics

We will calculate the following metrics in the Containerd CRI plugin. Metrics are per Containerd instance.

  1. containerd_cri_input_entries_total: Number of log entries Containerd receives

  2. containerd_cri_input_bytes_total: Size of logs Containerd receives

  3. containerd_cri_output_entries_total: Number of log entries Containerd successfully writes to disks

  4. containerd_cri_output_bytes_total: Size of logs Containerd successfully writes to disks

  5. containerd_cri_split_entries_total: Number of extra log entries created by splitting the original log entry. This happens when the original log entry exceeds the length limit. This metric does not count the original log entry.

Usage of metrics

This section discusses some use cases of logging metrics. The number corresponds to the metrics number listed above.

Estimate logging completeness of Containerd

This can be achieved by comparing input_entries [1], output_entries [3] and split_entries [5]. In the case of no log processing errors, input_entries + split_entries = output_entries.

input_bytes [2] and output_bytes [4] can also be used for the estimation. However, it won’t be accurate because the Containerd CRI plugin adds additional metadata to log entries.

Estimate logging completeness on disk

This estimates logging loss/duplication after logs leave Containerd but before they are registered by logging agents. The estimation can be achieved by comparing Containerd output logging metrics ([3] and [4]) with logging agent input metrics.

Estimate logging volume on the node

This can be achieved via output_bytes_total [4].

Estimate logging completeness of the entire pipeline

This can be achieved by comparing Containerd input logging metrics with logging agent output metrics.

Example metrics

Following metrics are collected by deploying a custom binary of Containerd to a GKE node.

$ curl http://127.0.0.1:1338/v1/metrics | egrep "input_entries"
# HELP containerd_cri_input_entries_total Number of log entries received
# TYPE containerd_cri_input_entries_total counter
containerd_cri_input_entries_total 26
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "output_entries"
# HELP containerd_cri_output_entries_total Number of log entries successfully written to disks
# TYPE containerd_cri_output_entries_total counter
containerd_cri_output_entries_total 26
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "input_bytes"
# HELP containerd_cri_input_bytes_total Size of logs received
# TYPE containerd_cri_input_bytes_total counter
containerd_cri_input_bytes_total 2933
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "output_bytes"
# HELP containerd_cri_output_bytes_total Size of logs successfully written to disks
# TYPE containerd_cri_output_bytes_total counter
containerd_cri_output_bytes_total 4141
 
 
$ curl http://127.0.0.1:1338/v1/metrics | egrep "split_entries"
# HELP containerd_cri_split_entries_total Number of extra log entries created by splitting the original log entry. This happens when the original log entry exceeds length limit. This metric does not count the original log entry.
# TYPE containerd_cri_split_entries_total counter
containerd_cri_split_entries_total 0

@k8s-ci-robot
Copy link

Hi @sophieliu15. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@samuelkarp samuelkarp added the area/cri Container Runtime Interface (CRI) label Oct 17, 2022
@samuelkarp
Copy link
Member

/ok-to-test

pkg/cri/io/logger.go Outdated Show resolved Hide resolved
pkg/cri/io/metrics.go Outdated Show resolved Hide resolved
pkg/cri/io/metrics.go Outdated Show resolved Hide resolved
@samuelkarp samuelkarp added this to New in Code Review via automation Oct 18, 2022
@samuelkarp samuelkarp moved this from New to Ready For Review in Code Review Oct 18, 2022
Signed-off-by: Sophie Liu <sophieliu@google.com>
@samuelkarp
Copy link
Member

/retest

@kzys kzys merged commit 72177ca into containerd:main Oct 20, 2022
Code Review automation moved this from Ready For Review to Done Oct 20, 2022
@samuelkarp samuelkarp added the cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch label Oct 21, 2022
@samuelkarp
Copy link
Member

With 1.6 as an LTS, I think we should cherry-pick this commit over. @kzys and @dcantah, do you have thoughts on whether we should also cherry-pick this into 1.5?

@kzys
Copy link
Member

kzys commented Oct 21, 2022

The change seems safe enough to have in 1.6. Regarding 1.5, I'd like to keep the version open for only bug/security fixes.

@dcantah
Copy link
Member

dcantah commented Oct 21, 2022

1.5 seems in maintenance mode which in my mind drifts towards just security (and bugs).

1.6 seems perfectly fine though

@dcantah
Copy link
Member

dcantah commented Oct 21, 2022

Oh nice @kzys beat me to it 🤣

@samuelkarp
Copy link
Member

1.6 cherry-pick: #7571

@JeffLuoo
Copy link

A question about the metrics labels: I wonder if namespace a label of the metrics given that I only want the metrics containerd_cri_input_bytes_total for a specific namespace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri Container Runtime Interface (CRI) cherry-pick/1.6.x Change to be cherry picked to release/1.6 branch kind/enhancement ok-to-test
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants