Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit how collector internal metric are distributed across telemetry levels #7890

Open
dmitryax opened this issue Jun 14, 2023 · 6 comments
Assignees

Comments

@dmitryax
Copy link
Member

dmitryax commented Jun 14, 2023

There are several verbosity levels that can be used to configure how many metrics the collector exposes:

	// Level is the level of telemetry metrics, the possible values are:
	//  - "none" indicates that no telemetry data should be collected;
	//  - "basic" is the recommended and covers the basics of the service telemetry.
	//  - "normal" adds some other indicators on top of basic.
	//  - "detailed" adds dimensions and views to the previous levels.

The problem is that they are barely used. Most of the metrics are exposed at the basic level. There is only one metric in batch processor exposed at the detailed level. The normal level is not being used. The suggestion is to revisit all the metrics and further distribute them across the levels.

The default level is basic (the lowest), which is the most common and provides most of the metrics. We can move a significant portion of the metrics to the normal level, which can become a new default. So default behavior doesn't change for the end user. While basic level can be kept to the bare minimum reserved for collector core:

  • process metrics
  • accepted and refused data by receivers
  • sent and failed to send data by exporters
  • sending queue size and enqueue failures

OTel components metrics will use only normal or detailed levels.

Long-term, we can consider providing a user an option to override this level per component.

@dmitryax dmitryax added discussion-needed Community discussion needed and removed discussion-needed Community discussion needed labels Jun 14, 2023
@hughesjj
Copy link

From collector wg/sig:

  • basic should only be for internal collector telemetry
  • normal should be the default for component authors
  • detailed should be ... well, the detailed view

I'm strongly in favor of better utilizing verbosity levels

@dmitryax dmitryax self-assigned this Jun 14, 2023
@victoramsantos
Copy link

Agree. I was testing the difference between these levels and just found the same, that the detailed level logs has one more metric than others (namely otelcol_processor_batch_batch_send_size_bytes_bucket).

@dmitryax
Copy link
Member Author

dmitryax commented Mar 14, 2024

According to the proposed guidelines, I think we can move Basic to Normal level by default and move the following metric sets from Basic to Normal:

  1. GRPC/HTTP server/client metrics.
otelcol_http_client_duration histogram
otelcol_http_client_request_size counter
otelcol_http_client_response_size counter
otelcol_http_server_duration histogram
otelcol_http_server_request_size counter
otelcol_http_server_response_size counter
otelcol_rpc_server_duration histogram
otelcol_rpc_server_request_size histogram
otelcol_rpc_server_requests_per_rpc histogram
otelcol_rpc_server_response_size histogram
otelcol_rpc_server_responses_per_rpc histogram

These metrics were not emitted before the transition to OTel instrumentation and can be pretty noisy even with enabled telemetry.disableHighCardinalityMetrics. It may even be worth considering moving them (or a portion of them) to the Detailed level.

  1. Batch processor metrics:
otelcol_processor_batch_batch_send_size histogram
otelcol_processor_batch_metadata_cardinality gauge
otelcol_processor_batch_timeout_trigger_send counter
otelcol_processor_batch_size_trigger_send counter

Enabling them on the Basic level isn't aligned with the suggested guidelines. "Custom" (not generated by the helpers) component metrics should be emitted starting with the Normal level.

@TylerHelmuth
Copy link
Member

@dmitryax I agree that component metrics should be emitted with Normal and that Normal should be the default level. I agree with Basic being only those generic receiver/processor/exporter metrics.

For the GRPC/HTTP server/client metrics, are they broken down per component? If so, they feels like Detailed metrics to me.

Side note about Detailed metrics. I foresee users wanting to be selective with which detailed metrics are emitted. Should we provide a way to specify which specific metrics, of any level, are enabled/disabled like we do for scrappers with mdatagen?

@dmitryax
Copy link
Member Author

For the GRPC/HTTP server/client metrics, are they broken down per component? If so, they feels like Detailed metrics to me.

Not every component exposes them, only receivers and exporters with HTTP/GRPC clients/servers, but it can be more granular than per component. Client metrics are per net.peer.name, so any component -> external endpoint pair. GRPC server metrics have rpc.method and rpc.service, so one per data type at least. If telemetry.disableHighCardinalityMetrics feature gate isn't enabled it also adds port and host attributes, which can bring high cardinality problem, see #7517.

I would agree to move HTTP/GRPC client/server metrics to Detailed level. @open-telemetry/collector-approvers WDYT?

Should we provide a way to specify which specific metrics, of any level, are enabled/disabled like we do for scrappers with mdatagen?

There is nothing like that available now. Users can further reduce the set with filter processor or with metric_relabel_configs on prometheus receiver. But I agree it would be nice to have this capability right in service::telemetry::metrics.

@TylerHelmuth
Copy link
Member

But I agree it would be nice to have this capability right in service::telemetry::metrics.

Out of scope for this issue for sure, but I agree.

dmitryax added a commit that referenced this issue Apr 16, 2024
**Description:**

This change distributes the reported internal metrics across available
levels and updates the level set by default:

1. The default level is changed from `basic` to `normal`, which can be
overridden with `service::telmetry::metrics::level` configuration.

2. The following batch processor metrics are updated to be reported
starting from `normal` level instead of `basic` level:
  - `processor_batch_batch_send_size`
  - `processor_batch_metadata_cardinality` 
  - `processor_batch_timeout_trigger_send` 
  - `processor_batch_size_trigger_send` 

3. The following GRPC/HTTP server and client metrics are updated to be
reported starting from `detailed` level:
  - `http.client.*` metrics 
  - `http.server.*` metrics 
  - `rpc.server.*` metrics 
  - `rpc.client.*` metrics

**Link to tracking Issue:**
#7890
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants