New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logging related metrics to Containerd CRI plugin #7546
Conversation
Hi @sophieliu15. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
a4ff021
to
aaf3201
Compare
/ok-to-test |
aaf3201
to
be92283
Compare
Signed-off-by: Sophie Liu <sophieliu@google.com>
be92283
to
3e44498
Compare
/retest |
The change seems safe enough to have in 1.6. Regarding 1.5, I'd like to keep the version open for only bug/security fixes. |
1.5 seems in maintenance mode which in my mind drifts towards just security (and bugs). 1.6 seems perfectly fine though |
Oh nice @kzys beat me to it 🤣 |
1.6 cherry-pick: #7571 |
A question about the metrics labels: I wonder if |
This PR adds logging related metrics in the Containerd CRI plugin. Metrics are per Containerd instance. Metrics can be used to estimate logging completeness of Containerd and logging completeness of logging pipeline (when combined with logging agent metrics). See design details below.
Context
The graph above shows a common architecture of a logging pipeline. Customer workloads output their logs to stdout/stderr. Containerd reads those logs and outputs them to log files on disk. Logging agents read log files from disk and exports logs to various sinks (e.g., stdout, Cloud Logging). In the meantime, log rotator rotates files on disk to prevent logs from overflowing disks.
In the logging pipeline above, logs could be missed or duplicated in any stages of the pipeline. For example, Containerd might fail to read a log or output the log to disk, which could cause potential log loss. Another example we encountered in the past is that logging agents could duplicate logs unexpectedly during a node restart
Motivation
Adding logging related instrumentation in Containerd helps us achieve following two goals:
Currently we have zero visibility into these two areas. See the section below for details about how we use the proposed Containerd metrics to achieve the goals above.
Detailed Design
Logging Related Metrics
We will calculate the following metrics in the Containerd CRI plugin. Metrics are per Containerd instance.
containerd_cri_input_entries_total: Number of log entries Containerd receives
containerd_cri_input_bytes_total: Size of logs Containerd receives
containerd_cri_output_entries_total: Number of log entries Containerd successfully writes to disks
containerd_cri_output_bytes_total: Size of logs Containerd successfully writes to disks
containerd_cri_split_entries_total: Number of extra log entries created by splitting the original log entry. This happens when the original log entry exceeds the length limit. This metric does not count the original log entry.
Usage of metrics
This section discusses some use cases of logging metrics. The number corresponds to the metrics number listed above.
Estimate logging completeness of Containerd
This can be achieved by comparing input_entries [1], output_entries [3] and split_entries [5]. In the case of no log processing errors, input_entries + split_entries = output_entries.
input_bytes [2] and output_bytes [4] can also be used for the estimation. However, it won’t be accurate because the Containerd CRI plugin adds additional metadata to log entries.
Estimate logging completeness on disk
This estimates logging loss/duplication after logs leave Containerd but before they are registered by logging agents. The estimation can be achieved by comparing Containerd output logging metrics ([3] and [4]) with logging agent input metrics.
Estimate logging volume on the node
This can be achieved via output_bytes_total [4].
Estimate logging completeness of the entire pipeline
This can be achieved by comparing Containerd input logging metrics with logging agent output metrics.
Example metrics
Following metrics are collected by deploying a custom binary of Containerd to a GKE node.