Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify internal observability documentation - 2 of 3 #4322

Merged
merged 38 commits into from
May 16, 2024
Merged
Changes from 32 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
7616b21
Add TODOs and copy content from Collector repo
tiffany76 Apr 18, 2024
2178c84
Add table of internal metrics
tiffany76 Apr 30, 2024
f5b4090
Add relative link and make prettier fixes
tiffany76 Apr 30, 2024
8a5b940
Add word to cSpell ignore
tiffany76 Apr 30, 2024
3fd2a7b
Reorder cSpell ignore
tiffany76 Apr 30, 2024
08063ac
Copy edit verbosity additions to enabling section
tiffany76 Apr 30, 2024
bbffadf
Add intro paragraph
tiffany76 Apr 30, 2024
90349fc
Add log events section
tiffany76 May 2, 2024
8183913
Add types of metrics section
tiffany76 May 2, 2024
a04e050
Modify list of internal metrics
tiffany76 May 2, 2024
511e381
Make more edits to section intro and table
tiffany76 May 2, 2024
9454293
Edit page intro
tiffany76 May 2, 2024
937feca
Update word choice based on review suggestion
tiffany76 May 3, 2024
cc88a54
Make prettier fix
tiffany76 May 3, 2024
c301cde
Add metrics-by-level in tabbed panes
tiffany76 May 3, 2024
6dca24d
Add line breaks to long metric names
tiffany76 May 3, 2024
92753a5
Add comment for component telemetry
tiffany76 May 3, 2024
5afa418
Edit content about default level for metrics
tiffany76 May 6, 2024
5797aa3
Apply suggestions from Patrice's review
tiffany76 May 7, 2024
af6b8d3
Fix tabbed panes section
tiffany76 May 7, 2024
c616feb
Edit metrics table
tiffany76 May 7, 2024
decb027
Fix self-monitoring alert
tiffany76 May 8, 2024
82ebc25
Add comment on how to compile list of metrics
tiffany76 May 8, 2024
b2ed01f
Remove word from self-monitoring caution alert
tiffany76 May 8, 2024
f7b8b23
Remove metric tabbed panes and refer to table
tiffany76 May 8, 2024
738b8df
Split metrics table into three tables, by level
tiffany76 May 8, 2024
0cd4039
Change self-monitoring wording again
tiffany76 May 8, 2024
3ca456f
Add cspell ignore word
tiffany76 May 8, 2024
2319388
Fix prettier issue
tiffany76 May 8, 2024
e2290b0
Apply suggestions from Patrice's second review
tiffany76 May 9, 2024
8b1e5f9
Make prettier fixes
tiffany76 May 9, 2024
aca64b0
Change yaml notation
tiffany76 May 9, 2024
0920384
Merge branch 'main' into internal-obs-2
tiffany76 May 13, 2024
f2588c3
Update content/en/docs/collector/internal-telemetry.md
tiffany76 May 13, 2024
a04a6e7
Merge branch 'main' into internal-obs-2
tiffany76 May 14, 2024
ee87db1
Apply suggestions Pablo's review
tiffany76 May 14, 2024
c11de76
Merge branch 'main' into internal-obs-2
tiffany76 May 15, 2024
741eb80
Merge branch 'main' into internal-obs-2
svrnm May 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
180 changes: 164 additions & 16 deletions content/en/docs/collector/internal-telemetry.md
mx-psi marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
---
title: Internal telemetry
weight: 25
cSpell:ignore: journalctl kube otecol pprof tracez zpages
# prettier-ignore
cSpell:ignore: alloc journalctl kube otecol pprof tracez underperforming zpages
---

You can monitor the health of any OpenTelemetry Collector instance by checking
its own internal telemetry. Read on to learn how to configure this telemetry to
help you [troubleshoot](/docs/collector/troubleshooting/) Collector issues.
its own internal telemetry. Read on to learn about this telemetry and how to
configure it to help you [troubleshoot](/docs/collector/troubleshooting/)
Collector issues.

## Activate internal telemetry in the Collector

Expand All @@ -25,7 +27,7 @@ endpoint to one specific or all network interfaces when needed. For
containerized environments, you might want to expose this port on a public
interface.

Set the address in the config `service::telemetry::metrics`:
Set the address in the config `service.telemetry.metrics`:
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

```yaml
service:
Expand All @@ -34,23 +36,26 @@ service:
address: '0.0.0.0:8888'
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved
```

You can enhance the metrics telemetry level using the `level` field. The
following is a list of all possible values and their explanations.
You can adjust the verbosity of the Collector metrics output by setting the
`level` field to one of the following values:

- `none` indicates that no telemetry data should be collected.
- `basic` is the recommended value and covers the basics of the service
telemetry.
- `normal` adds other indicators on top of basic.
- `detailed` adds dimensions and views to the previous levels.
- `none`: no telemetry is collected.
- `basic`: essential service telemetry.
- `normal`: the default level, adds standard indicators on top of basic.
- `detailed`: the most verbose level, includes dimensions and views.
chalin marked this conversation as resolved.
Show resolved Hide resolved

For example:
Each verbosity level represents a threshold at which certain metrics are
emitted. For the complete list of metrics, with a breakdown by level, see
[Lists of internal metrics](#lists-of-internal-metrics).

The default level for metrics output is `normal`. To use another level, set
`service.telemetry.metrics.level`:
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

```yaml
service:
telemetry:
metrics:
level: detailed
address: ':8888'
```

The Collector can also be configured to scrape its own metrics and send them
Expand Down Expand Up @@ -80,15 +85,18 @@ service:

{{% alert title="Caution" color="warning" %}}

Self-monitoring is a risky practice. If an issue arises, the source of the
problem is unclear and the telemetry is unreliable.
When self-monitoring, the Collector collects its own telemetry and sends it to
the desired backend for analysis. This can be a risky practice. If the Collector
is underperforming, its self-monitoring capability could be impacted. As a
result, the self-monitored telemetry might not reach the backend in time for
critical analysis.

{{% /alert %}}

### Configure internal logs

You can find log output in `stderr`. The verbosity level for logs defaults to
`INFO`, but you can adjust it in the config `service::telemetry::logs`:
`INFO`, but you can adjust it in the config `service.telemetry.logs`:
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

```yaml
service:
Expand All @@ -113,3 +121,143 @@ journalctl | grep otelcol | grep Error
```

{{% /tab %}} {{< /tabpane >}}

## Types of internal observability

The OpenTelemetry Collector aims to be a model of observable service by clearly
exposing its own operational metrics. Additionally, it collects host resource
metrics that can help you understand if problems are caused by a different
process on the same host. Specific components of the Collector can also emit
their own custom telemetry. In this section, you will learn about the different
types of observability emitted by the Collector itself.

### Values observable with internal metrics

The Collector emits internal metrics for the following **current values**:
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

- Resource consumption, including CPU, memory, and I/O.
- Data reception rate, broken down by receiver.
- Data export rate, broken down by exporters.
- Data drop rate due to throttling, broken down by data type.
- Data drop rate due to invalid data received, broken down by data type.
- Throttling state, including Not Throttled, Throttled by Downstream, and
Internally Saturated.
- Incoming connection count, broken down by receiver.
- Incoming connection rate showing new connections per second, broken down by
receiver.
- In-memory queue size in bytes and in units.
- Persistent queue size.
- End-to-end latency from receiver input to exporter output.
- Latency broken down by pipeline elements, including exporter network roundtrip
latency for request/response protocols.

Rate values are averages over 10 second periods, measured in bytes/sec or
units/sec (for example, spans/sec).

{{% alert title="Caution" color="warning" %}}

Byte measurements can be expensive to compute.

{{% /alert %}}

The Collector also emits internal metrics for these **cumulative values**:

- Total received data, broken down by receivers.
- Total exported data, broken down by exporters.
- Total dropped data due to throttling, broken down by data type.
- Total dropped data due to invalid data received, broken down by data type.
- Total incoming connection count, broken down by receiver.
- Uptime since start.

### Lists of internal metrics

The following tables group each internal metric by level of verbosity: `basic`,
`normal`, and `detailed`. Each metric is identified by name and description and
categorized by instrumentation type.

<!---To compile this list, configure a Collector instance to emit its own metrics to the localhost:8888/metrics endpoint. Select a metric and grep for it in the Collector core repository. For example, the `otelcol_process_memory_rss` can be found using:`grep -Hrn "memory_rss" .` Make sure to eliminate from your search string any words that might be prefixes. Look through the results until you find the .go file that contains the list of metrics. In the case of `otelcol_process_memory_rss`, it and other process metrics can be found in https://github.com/open-telemetry/opentelemetry-collector/blob/31528ce81d44e9265e1a3bbbd27dc86d09ba1354/service/internal/proctelemetry/process_telemetry.go#L92. Note that the Collector's internal metrics are defined in several different files in the repository.--->

#### `basic`-level metrics

| Metric name | Description | Type |
| ------------------------------------------------------ | --------------------------------------------------------------------------------------- | --------- |
| `otelcol_exporter_enqueue_failed_`<br>`log_records` | Number of spans that exporter(s) failed to enqueue. | Counter |
| `otelcol_exporter_enqueue_failed_`<br>`metric_points` | Number of metric points that exporter(s) failed to enqueue. | Counter |
| `otelcol_exporter_enqueue_failed_`<br>`spans` | Number of spans that exporter(s) failed to enqueue. | Counter |
| `otelcol_exporter_queue_capacity` | Fixed capacity of the retry queue, in batches. | Gauge |
| `otelcol_exporter_queue_size` | Current size of the retry queue, in batches. | Gauge |
| `otelcol_exporter_send_failed_`<br>`log_records` | Number of logs that exporter(s) failed to send to destination. | Counter |
| `otelcol_exporter_send_failed_`<br>`metric_points` | Number of metric points that exporter(s) failed to send to destination. | Counter |
| `otelcol_exporter_send_failed_`<br>`spans` | Number of spans that exporter(s) failed to send to destination. | Counter |
| `otelcol_exporter_sent_log_records` | Number of logs successfully sent to destination. | Counter |
| `otelcol_exporter_sent_metric_points` | Number of metric points successfully sent to destination. | Counter |
| `otelcol_exporter_sent_spans` | Number of spans successfully sent to destination. | Counter |
| `otelcol_process_cpu_seconds` | Total CPU user and system time in seconds. | Counter |
| `otelcol_process_memory_rss` | Total physical memory (resident set size). | Gauge |
| `otelcol_process_runtime_heap_`<br>`alloc_bytes` | Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc'). | Gauge |
| `otelcol_process_runtime_total_`<br>`alloc_bytes` | Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc'). | Counter |
| `otelcol_process_runtime_total_`<br>`sys_memory_bytes` | Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys'). | Gauge |
| `otelcol_process_uptime` | Uptime of the process. | Counter |
| `otelcol_processor_accepted_`<br>`log_records` | Number of logs successfully pushed into the next component in the pipeline. | Counter |
| `otelcol_processor_accepted_`<br>`metric_points` | Number of metric points successfully pushed into the next component in the pipeline. | Counter |
| `otelcol_processor_accepted_spans` | Number of spans successfully pushed into the next component in the pipeline. | Counter |
| `otelcol_processor_batch_batch_`<br>`send_size_bytes` | Number of bytes in the batch that was sent. | Histogram |
| `otelcol_processor_dropped_`<br>`log_records` | Number of logs dropped by the processor. | Counter |
| `otelcol_processor_dropped_`<br>`metric_points` | Number of metric points dropped by the processor. | Counter |
| `otelcol_processor_dropped_spans` | Number of spans dropped by the processor. | Counter |
| `otelcol_receiver_accepted_`<br>`log_records` | Number of logs successfully ingested and pushed into the pipeline. | Counter |
| `otelcol_receiver_accepted_`<br>`metric_points` | Number of metric points successfully ingested and pushed into the pipeline. | Counter |
| `otelcol_receiver_accepted_spans` | Number of spans successfully ingested and pushed into the pipeline. | Counter |
| `otelcol_receiver_refused_`<br>`log_records` | Number of logs that could not be pushed into the pipeline. | Counter |
| `otelcol_receiver_refused_`<br>`metric_points` | Number of metric points that could not be pushed into the pipeline. | Counter |
| `otelcol_receiver_refused_spans` | Number of spans that could not be pushed into the pipeline. | Counter |
| `otelcol_scraper_errored_`<br>`metric_points` | Number of metric points the Collector failed to scrape. | Counter |
| `otelcol_scraper_scraped_`<br>`metric_points` | Number of metric points scraped by the Collector. | Counter |

#### Additional `normal`-level metrics

| Metric name | Description | Type |
| ------------------------------------------------------- | --------------------------------------------------------------- | --------- |
| `otelcol_processor_batch_batch_`<br>`send_size` | Number of units in the batch. | Histogram |
| `otelcol_processor_batch_batch_`<br>`size_trigger_send` | Number of times the batch was sent due to a size trigger. | Counter |
| `otelcol_processor_batch_metadata_`<br>`cardinality` | Number of distinct metadata value combinations being processed. | Counter |
| `otelcol_processor_batch_timeout_`<br>`trigger_send` | Number of times the batch was sent due to a timeout trigger. | Counter |

#### Additional `detailed`-level metrics

| Metric name | Description | Type |
| --------------------------------- | ----------------------------------------------------------------------------------------- | --------- |
| `http_client_active_requests` | Number of active HTTP client requests. | Counter |
| `http_client_connection_duration` | Measures the duration of the successfully established outbound HTTP connections. | Histogram |
| `http_client_open_connections` | Number of outbound HTTP connections that are active or idle on the client. | Counter |
| `http_client_request_body_size` | Measures the size of HTTP client request bodies. | Histogram |
| `http_client_request_duration` | Measures the duration of HTTP client requests. | Histogram |
| `http_client_response_body_size` | Measures the size of HTTP client response bodies. | Histogram |
| `http_server_active_requests` | Number of active HTTP server requests. | Counter |
| `http_server_request_body_size` | Measures the size of HTTP server request bodies. | Histogram |
| `http_server_request_duration` | Measures the duration of HTTP server requests. | Histogram |
| `http_server_response_body_size` | Measures the size of HTTP server response bodies. | Histogram |
| `rpc_client_duration` | Measures the duration of outbound RPC. | Histogram |
| `rpc_client_request_size` | Measures the size of RPC request messages (uncompressed). | Histogram |
| `rpc_client_requests_per_rpc` | Measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs. | Histogram |
| `rpc_client_response_size` | Measures the size of RPC response messages (uncompressed). | Histogram |
| `rpc_client_responses_per_rpc` | Measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs. | Histogram |
| `rpc_server_duration` | Measures the duration of inbound RPC. | Histogram |
| `rpc_server_request_size` | Measures the size of RPC request messages (uncompressed). | Histogram |
| `rpc_server_requests_per_rpc` | Measures the number of messages received per RPC. Should be 1 for all non-streaming RPCs. | Histogram |
| `rpc_server_response_size` | Measures the size of RPC response messages (uncompressed). | Histogram |
| `rpc_server_responses_per_rpc` | Measures the number of messages sent per RPC. Should be 1 for all non-streaming RPCs. | Histogram |

### Events observable with internal logs

The Collector logs the following internal events:
tiffany76 marked this conversation as resolved.
Show resolved Hide resolved

- A Collector instance starts or stops.
- Data dropping begins due to throttling for a specified reason, such as local
saturation, downstream saturation, downstream unavailable, etc.
- Data dropping due to throttling stops.
- Data dropping begins due to invalid data. A sample of the invalid data is
included.
- Data dropping due to invalid data stops.
- A crash is detected, differentiated from a clean stop. Crash data is included
if available.