Move partialsuccess code to internal package #3146

MrAlias · 2022-09-06T19:57:14Z

Addresses this unresolved comment originally posted by @MrAlias in #3106 (comment)

codecov · 2022-09-06T20:00:53Z

Codecov Report

Merging #3146 (bcc4302) into main (569f743) will increase coverage by 0.0%.
The diff coverage is 100.0%.

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #3146   +/-   ##
=====================================
  Coverage   76.4%   76.5%           
=====================================
  Files        180     180           
  Lines      12014   12014           
=====================================
+ Hits        9190    9192    +2     
+ Misses      2583    2581    -2     
  Partials     241     241

Impacted Files	Coverage Δ
exporters/otlp/internal/partialsuccess.go	`100.0% <ø> (ø)`
exporters/otlp/otlptrace/otlptracegrpc/client.go	`92.5% <100.0%> (ø)`
exporters/otlp/otlptrace/otlptracehttp/client.go	`76.2% <100.0%> (ø)`
sdk/trace/batch_span_processor.go	`81.9% <0.0%> (+0.8%)`	⬆️

jmacd · 2022-09-07T14:33:16Z

I'll say that I didn't understand the reason to make the error private in the first place. @MadVikingGod had asked for a way to access the structure of the error, which is lost when it's internal.
As a user, how will I know the difference between a partial-success error and another kind of error? I'll have to parse the error string which is what @MadVikingGod wanted to avoid, IIUC.

jmacd · 2022-09-07T14:35:00Z

@MrAlias why not expose the new error? I can't understand what you're trying to avoid making public. There are good reasons for uses to want to count the number of dropped metric points, which is exactly what this new structure gave you.

MrAlias · 2022-09-07T15:07:51Z

@MrAlias why not expose the new error? I can't understand what you're trying to avoid making public. There are good reasons for uses to want to count the number of dropped metric points, which is exactly what this new structure gave you.

Can you help explain the good reasons a users would want this for?

I do not understand what a user will do with this information when it is sent to the global error handler. They cannot retry the send, but all I can see is they can log this information. You don't need all these new exported types to do that.

jmacd · 2022-09-07T15:19:27Z

When it comes time to do what I'd like done, it can be done from inside the OTLP exporter with access to the internal package, however it would be great to be able to (a) write alternative instrumentation and (b) experiment with such instrumentation before standardizing it.

The metric I'd like produced is to count the number of spans, metric data points, and logs that are accepted by the server successfully. To do that I want to subtract the number that are dropped due to repeated failure and the number of that were rejected by the server as being malformed in some way. The field contained in the partial-success structure is one of the inputs.

I'm fiercely opposed to excessive logging. Handling errors in this way is already bad, in my opinion. By leaving a publicly-accessible struct here, I'm able to tailor my error handler to suppress partial-success errors, perhaps, meaning I could decide to sample them or rate-limit them differently than unrecognized errors, or even simply just count the number of points rejected.

MadVikingGod · 2022-09-07T15:30:56Z

Could we split the difference, make an interface that the error will implement, but keep the implementation internal.

This allows this information to be represented in some way beyond just a log, but also minimizes the exposed API surface.

MrAlias · 2022-09-07T16:08:11Z

Thanks for providing the use-cases, they are insightful.

When it comes time to do what I'd like done, it can be done from inside the OTLP exporter with access to the internal package, however it would be great to be able to (a) write alternative instrumentation and (b) experiment with such instrumentation before standardizing it.

The metric I'd like produced is to count the number of spans, metric data points, and logs that are accepted by the server successfully. To do that I want to subtract the number that are dropped due to repeated failure and the number of that were rejected by the server as being malformed in some way. The field contained in the partial-success structure is one of the inputs.

Adding metrics to the OTLP exporters is a great idea. I definitely agree we should add this into the exporter itself as the project progresses. However, I do not think the global error handler should be used in the prototyping of this data. Due to the global nature it will not be able to decern where these errors came from and how to allocate what error to what exporter. I would expect wrapping the exporter with a span processor that exposed these metrics to be a better approach (similar to this but it would wrap the SpanExporter itself).

In that situation, the error would need to be returned from the ExportSpans method itself. I would support adding an error to the package API if that were the case.

I'm fiercely opposed to excessive logging. Handling errors in this way is already bad, in my opinion. By leaving a publicly-accessible struct here, I'm able to tailor my error handler to suppress partial-success errors, perhaps, meaning I could decide to sample them or rate-limit them differently than unrecognized errors, or even simply just count the number of points rejected.

I can see the desire to not have excessive logging. But I think we have made a misstep then in using the global error handler here instead of the global logger. If we were to log these things with the structured logging interface it provides it would natively allow loggers with rate limiters to handle this. As it is now we need to export these types, send types along the global error handler, parse the types with a registered error handler, and then send to a rate limited logger.

I think based on these use-cases we may need to not only rethink these types, but I would like to consider again how we are "exporting" the partial success response itself. I think we could refactor these types and use them as return values from ExportSpans and log their presence to better achieve what was proposed in #3106. @jmacd thoughts?

MrAlias · 2022-09-07T16:14:07Z

I think we could refactor these types and use them as return values from ExportSpans ...

Also, I think if we do this it should be done at the SDK trace level. That way any exporter would be able to report these errors.

MadVikingGod · 2022-09-07T16:57:05Z

Also, I think if we do this it should be done at the SDK trace level. That way any exporter would be able to report these errors.

This probably shouldn't be part of the SDK, because this is an applicaiton level error for OTLP. Not every exporter will have this kind of error.

I support this PR because it allows us time to make the "right" decision after this is released. Currently OTLP only returns protocol level errors because there wasn't any application errors. It would be a change of behavior to start returning the new application errors, so I think we need to understand how this might break current usage, if at all.

At some practical level we will have to expose some API to make use of these errors, whether they are exposed via an ErrorHandler or directly from the ExportSpans. I would personally prefer to use the errors.Is(), but an interface with FailedSpanCount() or something similar would be just as effective.

After we expose the error in some way I can see us having a migration path of first expose via otel.Handle() inside the exporter. Next, if we accept the change in behavior, expose the error via returning it and allow the SpanProcessor to call otel.Handle(). And finally, create some tool to wrap the exporter that can measure this with the advantage of know how many spans were sent.

jmacd · 2022-09-07T16:57:05Z

global error handler here instead of the global logger

Not sure I agree. If there's a dedicated logger for logging errors produced inside the SDK, then maybe. Without more semantic conventions on errors produced by OTel SDKs, I think these have to be handled specially.

the structured logging interface it provides it would natively allow loggers with rate limiters to handle this

I was expecting code in a handler like:

if errors.Is(err, otlp.PartialSuccessError{}) {
   ... extract count for something useful do not log this
}

Admittedly, use of a global handler is not ideal here. I would want a per-exporter handler. You used the return value from ExportSpans in your example, but think about how this will work for the metrics SDK? I would be glad for a per-exporter ErrorHandler.

FWIW I argued against an unstructured error message being returned in the first place. The actual partial success errors being produced in the Lightstep metrics ingest path are structured to begin with, will say which metrics are failing and for which reasons. However, since the information is so dreadfully repetitive, it responds with one one example at most per response. Then, because OTLP doesn't support that structure, it ends up as a single example of formatted error message. 🤷 I just want to count the number of successful/failed metric points.

MrAlias · 2022-09-07T18:51:45Z

Also, I think if we do this it should be done at the SDK trace level. That way any exporter would be able to report these errors.

This probably shouldn't be part of the SDK, because this is an applicaiton level error for OTLP. Not every exporter will have this kind of error.

I don't think that is correct way to think about the problem. If more than one exporter reports partial success, which seems likely, they should both report the same error type. Otherwise code interpreting the OTLP error will now need to be update for every exporter that reports this type of error. This would follow suit in the same way we have Reader errors at the SDK that every `Reader implementation can return.

I support this PR because it allows us time to make the "right" decision after this is released. Currently OTLP only returns protocol level errors because there wasn't any application errors. It would be a change of behavior to start returning the new application errors, so I think we need to understand how this might break current usage, if at all.

I don't follow this. An error returned from ExportSpans is an error from that function call. There is not distinction about the error category, that is the benefit of Go defining errors as interfaces. How the error is handled can depend on the error, but the behavior of returning an error when an error occurred in the function call would remain the same.

After we expose the error in some way I can see us having a migration path of first expose via otel.Handle() inside the exporter. Next, if we accept the change in behavior, expose the error via returning it and allow the SpanProcessor to call otel.Handle(). And finally, create some tool to wrap the exporter that can measure this with the advantage of know how many spans were sent.

Reporting the error with the global error handler now means we will double report the error later or stop reporting the error with the handler in the future. Either are not ideal.

MrAlias · 2022-09-07T18:51:50Z

[...] If there's a dedicated logger for logging errors produced inside the SDK, then maybe.

There is, namely this.

the structured logging interface it provides it would natively allow loggers with rate limiters to handle this

I was expecting code in a handler like:
if errors.Is(err, otlp.PartialSuccessError{}) {
   ... extract count for something useful do not log this
}

Right, this is what I was expecting. But if you used the error logger to log this event that code would not be needed, nor the new PartialSuccessError.

Admittedly, use of a global handler is not ideal here. I would want a per-exporter handler.

I must be missing something. The caller of the ExportSpan function is the per-exporter handler. That function receives an error as a return value from calling the function and determines how it should be handled.

You used the return value from ExportSpans in your example, but think about how this will work for the metrics SDK? I would be glad for a per-exporter ErrorHandler.

Why would this not work in the metric SDK? An error is similarly returned from Export?

opentelemetry-go/sdk/metric/exporter.go

Line 44 in bdb917e

Export(context.Context, metricdata.ResourceMetrics) error

I just want to count the number of successful/failed metric points.

I think this is key! I want that as well (and so did @dashpole). I see returning this information as an error from the call to ExportSpans as the best way to do this. Even when we add metrics about the partial success to the exporter, what if a user wanted to count these number and log them locally or report them to the otel ErrorHandler. They would still be able to do that with a SpanProcessor wrapping the SpanExporter. It provides the most functionality and follows standard Go error handling practices.

jmacd · 2022-09-07T19:38:00Z

I think you mean to replace the PartialSuccessError with either something like:

  otel.Info("partial success", attribute.Int("number_rejected", ...), attribute.String("error_message", ...), attribute.String("signal", "metrics")

or maybe

  otel.Error(PartialSuccess{...}, "exporter partial success", attribute.String("signal", "metrics"))

Those are OK with me.

I don't think we should be RETURNING these as errors because the export itself is not failing, so I don't expect to have to provide a new span-exporter, metrics-exporter, etc., just in order to get these messages to show on the console.

Later, someone in OTel will organize a way to report on each signal -- at that point, where a non-nil error means a total failure, we will I assume have to inject some kind of error-handler to count the total loss of a batch of spans/metrics/logs, etc.

MrAlias · 2022-09-07T20:07:59Z

I think you mean to replace the PartialSuccessError with either something like:

  otel.Info("partial success", attribute.Int("number_rejected", ...), attribute.String("error_message", ...), attribute.String("signal", "metrics")

or maybe

  otel.Error(PartialSuccess{...}, "exporter partial success", attribute.String("signal", "metrics"))

Those are OK with me.

Right. This was the approach I would defer to if we wanted to support logging (rate limited, or otherwise) of the partial success.

I don't think we should be RETURNING these as errors because the export itself is not failing

I disagree. The export network call may not have failed, but the call to ExportSpans did fail. The payload the caller passed did not successfully export.

Errors in Go are used to communicate when aberrant and unexpected events happen. Similar to an io.Writer, io.Reader, "html/template".ExecuteTemplate, or "text/template".ExecuteTemplate, returning an error explaining why part of the passed data was not handled is a common Go practice.

so I don't expect to have to provide a new span-exporter, metrics-exporter, etc., just in order to get these messages to show on the console.

I don't follow this. Our current behavior of the simple span processor and the batch span processor send ExportSpans errors to the global error handler. Why would you need to provide a new span-exporter?

MrAlias · 2022-09-08T00:00:40Z

I created a proof-of-concept for how returning the error from a call to ExportSpan will allow span processor creation that where metric export experiments can happen and the default span processors will still register the error with the default handler.

MrAlias added 2 commits September 6, 2022 12:52

Move partialsuccess code to internal package

38bb74e

Fix imports to new pkg

bcc4302

MrAlias added the pkg:exporter:otlp Related to the OTLP exporter package label Sep 6, 2022

MrAlias added this to the Release v1.10.0 milestone Sep 6, 2022

MrAlias requested review from jmacd, Aneurysm9, evantorrie, XSAM, dashpole, MadVikingGod, pellared, hanyuancheung and dmathieu as code owners September 6, 2022 19:57

MrAlias added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Sep 6, 2022

dashpole approved these changes Sep 6, 2022

View reviewed changes

jmacd approved these changes Sep 6, 2022

View reviewed changes

hanyuancheung approved these changes Sep 7, 2022

View reviewed changes

dmathieu approved these changes Sep 7, 2022

View reviewed changes

MadVikingGod approved these changes Sep 7, 2022

View reviewed changes

MrAlias mentioned this pull request Sep 7, 2022

Proof-of-Concept for partial export err returns #3153

Closed

MrAlias merged commit 13906ac into open-telemetry:main Sep 8, 2022

MrAlias deleted the mv-otlp-part-success branch September 8, 2022 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move partialsuccess code to internal package #3146

Move partialsuccess code to internal package #3146

MrAlias commented Sep 6, 2022

codecov bot commented Sep 6, 2022

jmacd commented Sep 7, 2022

jmacd commented Sep 7, 2022

MrAlias commented Sep 7, 2022

jmacd commented Sep 7, 2022

MadVikingGod commented Sep 7, 2022

MrAlias commented Sep 7, 2022

MrAlias commented Sep 7, 2022

MadVikingGod commented Sep 7, 2022

jmacd commented Sep 7, 2022

MrAlias commented Sep 7, 2022

MrAlias commented Sep 7, 2022 •

edited

jmacd commented Sep 7, 2022

MrAlias commented Sep 7, 2022 •

edited

MrAlias commented Sep 8, 2022

Move partialsuccess code to internal package #3146

Move partialsuccess code to internal package #3146

Conversation

MrAlias commented Sep 6, 2022

codecov bot commented Sep 6, 2022

Codecov Report

jmacd commented Sep 7, 2022

jmacd commented Sep 7, 2022

MrAlias commented Sep 7, 2022

jmacd commented Sep 7, 2022

MadVikingGod commented Sep 7, 2022

MrAlias commented Sep 7, 2022

MrAlias commented Sep 7, 2022

MadVikingGod commented Sep 7, 2022

jmacd commented Sep 7, 2022

MrAlias commented Sep 7, 2022

MrAlias commented Sep 7, 2022 • edited

jmacd commented Sep 7, 2022

MrAlias commented Sep 7, 2022 • edited

MrAlias commented Sep 8, 2022

MrAlias commented Sep 7, 2022 •

edited

MrAlias commented Sep 7, 2022 •

edited