Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(otelgin): add support for recording panics #5090

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Expand Up @@ -27,6 +27,7 @@ The next release will require at least [Go 1.21].
- Add support for Summary metrics to `go.opentelemetry.io/contrib/bridges/prometheus`. (#5089)
- Add support for Exponential (native) Histograms in `go.opentelemetry.io/contrib/bridges/prometheus`. (#5093)
- Implemented setting the `cloud.resource_id` resource attribute in `go.opentelemetry.io/detectors/aws/ecs` based on the ECS Metadata v4 endpoint. (#5091)
- Add support to record panics in `go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin`. (#5090)

### Removed

Expand Down
11 changes: 10 additions & 1 deletion instrumentation/github.com/gin-gonic/gin/otelgin/gintrace.go
Expand Up @@ -75,7 +75,16 @@ func Middleware(service string, opts ...Option) gin.HandlerFunc {
opts = append(opts, oteltrace.WithAttributes(rAttr))
}
ctx, span := tracer.Start(ctx, spanName, opts...)
defer span.End()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is already in the default SDK's End method:

https://github.com/open-telemetry/opentelemetry-go/blob/e8973b75b230246545cdae072a548c83877cba09/sdk/trace/span.go#L403-L420

Why is this being added here? It seems like an SDK concern not an instrumentation one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a really good question. the general pattern/setup i typically deal with have a middleware pattern that look something like this:

  1. panic recovery -pretty simple and attempts to recover any panics, log them, and sets the response code to 500. this is always at the top to catch any and everything that it can.
  2. otelgin (this middleware)
  3. auth and other middlewares

panics show up in logs via the panic recovery middleware described above and also show up in the spans via the mechanism you linked. thinking about this more, the gap i currently have is less that panics get added to this specific span somehow, and more that the spans generated by the otelgin middleware don't have a span status as error and don't set the http.status_code to anything. when an alert fires for an increase in 5XX error codes, i usually first go to Jaeger and type something like http.status_code=500 or error=true. in the cases of panics, those types of queries don't work.

given that, i think this would leave a few options:

  1. don't update this middleware. update the middleware ordering i currently use to be something like panic recover -> otelgin -> panic recover. this would result in the net result i want, but using the same middleware twice doesn't feel great.
  2. update this middleware roughly as it's currently written (still need to apply feedback with not having configs within configs, etc). this would result in the http.status_code field not being guaranteed, but that's not necessarily the responsibility of this middleware to set that. the span status will still be set to error, which is a strong signal.
  3. update this middleware roughly as written, but also give the user the ability to set a status code if a panic is caught. this feels like adding too much responsibility to this middleware.

i'm leaning towards 2, but i'd love to get other opinions as well :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the latest semantic conventions, the error.type attribute should be used here to capture the panic information.

It is a known thing that our HTTP instrumentation is quite behind on semantic conventions, but I think this should be our target. Instead of having configuration to enable this, we should add the error.type attribute if a panic occurred always.

This will require this instrumentation to go through the process of upgrading semconv which will require a migration path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the late reply - getting back to this now. i think that makes sense. i'm not familiar with these migrations, but i'll happily give it a shot. @MrAlias i may poke you for help with the migrations if i run into issues or have questions. i'll apply these changes, as well as the suggestions from @dmathieu.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it looks like #5092 will take care of the migration plan. i might wait until that PR is merged and then make the changes to this package on top of that. how does that sound?

defer func() {
if r := recover(); r != nil {
err := fmt.Errorf("%+v", r)
span.RecordError(err, oteltrace.WithStackTrace(cfg.RecordPanicStackTrace))
span.SetStatus(codes.Error, err.Error())
span.End()
panic(r)
}
span.End()
}()

// pass the span through the request context
c.Request = c.Request.WithContext(ctx)
Expand Down
16 changes: 12 additions & 4 deletions instrumentation/github.com/gin-gonic/gin/otelgin/option.go
Expand Up @@ -13,10 +13,11 @@ import (
)

type config struct {
TracerProvider oteltrace.TracerProvider
Propagators propagation.TextMapPropagator
Filters []Filter
SpanNameFormatter SpanNameFormatter
TracerProvider oteltrace.TracerProvider
Propagators propagation.TextMapPropagator
Filters []Filter
SpanNameFormatter SpanNameFormatter
RecordPanicStackTrace bool
}

// Filter is a predicate used to determine whether a given http.request should
Expand Down Expand Up @@ -77,3 +78,10 @@ func WithSpanNameFormatter(f func(r *http.Request) string) Option {
c.SpanNameFormatter = f
})
}

// WithRecordPanicStackTrace specifies whether to record the stack trace of a panic.
func WithRecordPanicStackTrace(stackTrace bool) Option {
return optionFunc(func(c *config) {
c.RecordPanicStackTrace = stackTrace
})
}
Expand Up @@ -251,3 +251,57 @@ func TestWithFilter(t *testing.T) {
assert.Len(t, sr.Ended(), 1)
})
}

func TestRecordPanic(t *testing.T) {
recoveryMiddleware := func(c *gin.Context) {
// Ensure panics are recovered and don't crash tests/are logged to stdout.
defer func() {
_ = recover()
}()
c.Next()
}

testCases := []struct {
name string
expectStackTrace bool
}{
{
name: "should record stack trace",
expectStackTrace: true,
},
{
name: "should not record stack trace",
expectStackTrace: false,
},
}

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
sr := tracetest.NewSpanRecorder()
provider := sdktrace.NewTracerProvider()
provider.RegisterSpanProcessor(sr)
router := gin.New()
router.Use(recoveryMiddleware, otelgin.Middleware("potato", otelgin.WithTracerProvider(provider), otelgin.WithRecordPanicStackTrace(tc.expectStackTrace)))
router.GET("/user/:id", func(c *gin.Context) { panic("corn") })
router.ServeHTTP(httptest.NewRecorder(), httptest.NewRequest("GET", "/user/123", nil))

require.Len(t, sr.Ended(), 1, "should emit a span")
span := sr.Ended()[0]
assert.Equal(t, span.Status().Code, codes.Error, "should set Error status for panics")
require.Len(t, span.Events(), 1, "should emit an event")
event := span.Events()[0]
assert.Equal(t, event.Name, "exception")

var foundStackTrace bool

for _, attr := range event.Attributes {
if attr.Key == "exception.stacktrace" {
foundStackTrace = true
break
}
}

assert.Equal(t, tc.expectStackTrace, foundStackTrace, "should record a stack trace")
})
}
}