Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'reason' attribute to otelcol_exporter_send_failed_* metrics #10157

Open
0x006EA1E5 opened this issue May 15, 2024 · 1 comment · May be fixed by #10158
Open

Add 'reason' attribute to otelcol_exporter_send_failed_* metrics #10157

0x006EA1E5 opened this issue May 15, 2024 · 1 comment · May be fixed by #10158
Labels
collector-telemetry healthchecker and other telemetry collection issues

Comments

@0x006EA1E5
Copy link

Is your feature request related to a problem? Please describe.
I am interested in monitoring data loss which occurs when exporting data from one instance of the Collector to another, specifically using the loadbalancingexporter.

At the moment I just see a course grained metric which counts the export failures, but gives me no data on the cause. Was it a permanent or retryable error? Was it a badly configured endpoint, or did the downstream receiver actively reject the data?

I can look into the logs to see info on specific failures, but this is tedious and less easy to understand.

Describe the solution you'd like

I propose that we add a reason dimension to the otelcol_exporter_send_failed_* metrics. This reason could be the GRPC status of the response (I understand that GRPC status is uses as the internal representation of these kind of problems).

Describe alternatives you've considered

It is possible to try to correlate export failure metrics with downstream receiver error metrics. We can also try to correlate with "know failure causes", such as memorylimiterprocessor errors, which could mean the upstream export failed.

We can also check the logs, and even - depending on the system - extra metrics from theses logs.

However, this is all much harder work

Additional context

We could also consider adding a similar attribute to the otelcol_receiver_refused_spans metric.

I have had a look at the code, and it seems like a fairly small change in / around exporter/exporterhelper/obsexporter.go

@0x006EA1E5
Copy link
Author

/label area:exporter exporter/otlp exporter/otlphttp receiver/otlp area:receiver

@TylerHelmuth TylerHelmuth added the collector-telemetry healthchecker and other telemetry collection issues label May 16, 2024
@mx-psi mx-psi added this to the Self observability milestone May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collector-telemetry healthchecker and other telemetry collection issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants