Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Structured Error type #357

Open
tallclair opened this issue Nov 3, 2022 · 20 comments · May be fixed by #361
Open

[RFC] Structured Error type #357

tallclair opened this issue Nov 3, 2022 · 20 comments · May be fixed by #361
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@tallclair
Copy link
Member

/kind feature

Describe the solution you'd like
Sometimes I want to be able to add structured logging key-value pairs where an error is generated, but it isn't logged until way up the stack. It would be nice if structured logging information could be attached to the error itself, and logged with the error.

Rough proposal:

+ type StructuredError struct {
+   error
+   KeysAndValues []any
+ }
+ 
+ func (err StructuredError) RecursiveKeysAndValues() []any {
+	kvs := err.KeysAndValues
+	var inner *StructuredError
+	if errors.As(err.error, &inner) {
+		// When duplicate keys are present, should the inner most or outermost error win?
+		kvs = serialize.MergeKVs(inner.RecursiveKeysAndValues(), kvs)
+ 	}
+ 	return kvs
+ }

func (l *loggingT) errorS(err error, logger *logr.Logger, filter LogFilter, depth int, msg string, keysAndValues ...interface{}) {
+	var structuredErr *StructuredError
+	if errors.As(err, &structuredErr) {
+ 		keysAndValues = append(keysAndValues, structuredErr.RecursiveKeysAndValues()...)
+ 	}
	if filter != nil {
		msg, keysAndValues = filter.FilterS(msg, keysAndValues)
	}
	if logger != nil {
		logger.WithCallDepth(depth+2).Error(err, msg, keysAndValues...)
		return
	}
	l.printS(err, severity.ErrorLog, depth+1, msg, keysAndValues...)
}

Open Question: should the message returned by structuredErr.Error() include the keyvalue pairs? In that case they should be omitted from the message when logged with klog, but that gets complicated when you have a non-structured error wrapping a structured error wrapping a non-structured error. My inclination is to not include them in the default Error() message for this reason.

Open Question: should the StructuredError type be hoisted to logr? cc @thockin

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 3, 2022
@thockin
Copy link
Member

thockin commented Nov 15, 2022

logr supports types that define a MarshalLog() method, but I don't know if that is plumbed all the way to klog, or which modes support it properly.

@thockin thockin self-assigned this Nov 15, 2022
@pohly
Copy link

pohly commented Nov 18, 2022

The klog text output currently checks for error first and then logs the message from Error. Changing that is not hard:

diff --git a/internal/serialize/keyvalues.go b/internal/serialize/keyvalues.go
index ad6bf11..33813b4 100644
--- a/internal/serialize/keyvalues.go
+++ b/internal/serialize/keyvalues.go
@@ -127,8 +127,6 @@ func KVListFormat(b *bytes.Buffer, keysAndValues ...interface{}) {
                        writeStringValue(b, true, StringerToString(v))
                case string:
                        writeStringValue(b, true, v)
-               case error:
-                       writeStringValue(b, true, ErrorToString(v))
                case logr.Marshaler:
                        value := MarshalerToValue(v)
                        // A marshaler that returns a string is useful for
@@ -147,6 +145,8 @@ func KVListFormat(b *bytes.Buffer, keysAndValues ...interface{}) {
                        default:
                                writeStringValue(b, false, fmt.Sprintf("%+v", value))
                        }
+               case error:
+                       writeStringValue(b, true, ErrorToString(v))
                case []byte:
                        // In https://github.com/kubernetes/klog/pull/237 it was decided
                        // to format byte slices with "%+q". The advantages of that are:

However, that then only has the effect that the structured error gets formatted with fmt.Sprintf("%+v"), like any other struct:

// Structured error with additional values that are not part of the message
// returned by Error().
type structuredError struct {
	error
	Attributes map[string]interface{}
}

// Overrides the Error function as the preferred way for logging
// this error.
func (err structuredError) MarshalLog() interface{} {
	return struct {
		Message    string
		Attributes map[string]interface{}
	}{
		Message:    err.Error(),
		Attributes: err.Attributes,
	}
}

structuredError{error: errors.New("fake error"), Attributes: map[string]interface{}{"x": 1, "y": 2}}
=>
E output.go:422] "structured error" err={Message:fake error Attributes:map[x:1 y:2]}

The extra information is there, but it's not getting formatted well. Improving that would be more complicated.

zapr also uses Error. Patch to change that:

diff --git a/zapr.go b/zapr.go
index 8bb7fce..eb9d44b 100644
--- a/zapr.go
+++ b/zapr.go
@@ -213,7 +213,18 @@ func (zl *zapLogger) Info(lvl int, msg string, keysAndVals ...interface{}) {
 
 func (zl *zapLogger) Error(err error, msg string, keysAndVals ...interface{}) {
        if checkedEntry := zl.l.Check(zap.ErrorLevel, msg); checkedEntry != nil {
-               checkedEntry.Write(zl.handleFields(noLevel, keysAndVals, zap.NamedError(zl.errorKey, err))...)
+               field := zap.NamedError(zl.errorKey, err)
+               if logMarshaler, ok := err.(logr.Marshaler); ok {
+                       func() {
+                               defer func() {
+                                       // Ignore internal error in MarshalLog call.
+                                       // err will get logged as normal error field.
+                                       recover()
+                               }()
+                               field = zap.Any(zl.errorKey, logMarshaler.MarshalLog())
+                       }()
+               }
+               checkedEntry.Write(zl.handleFields(noLevel, keysAndVals, field)...)
        }
 }

The output then becomes:

{"caller":"test/output.go:422","msg":"structured error","err":{"Message":"fake error","Attributes":{"x":1,"y":2}}}

@tallclair: is that good enough?

I'm a bit worried about "err" changing its schema depending on the type of error. I've been told that some log processing tools get confused when that happens. We don't guarantee it for other keys either, but "err" is a pretty common one.

@thockin
Copy link
Member

thockin commented Nov 18, 2022

It's a good point that perhaps the err argument should be treated with Error(). Having it be string in some cases and struct in others may confuse systems.

funcr, for example, goes out of its way to call Error() and then log the result, rather than treat it as any other key-value

@pohly
Copy link

pohly commented Nov 19, 2022

Perhaps we can log both with separate keys:

  • "err": <Error()>
  • "errObject": <MarshalLog>

To achieve what @tallclair originally asked for (providing additional key/value pairs) we could add a new type to logr:

type KeyValuePairs map[string]interface{}

If, and only if, MarshalLog returns an object of that type, then a logger that supports this could ignore the key that normally would be used for the object (in the example above, "errObject") and instead log the key/value pairs as if they had been provided directly.

I'm undecided whether that special support should be limited to MarshalLog. It may also be useful to capture multiple key/value pairs in a single value without having to flatten them into a slice with other key/value pairs. This may even be recursive.

Key name conflicts could become a problem:

logger.Info("hello", "a": 1, "b": KeyValuePairs{"a": 2})
->
{"msg": "hello", "a":1, "a":2} ?!

I think I would instead concatenate the keys while expanding:

{"msg": "hello", "a":1, "b.a":2}
"structured error" a=1 b.a=2

While conceptually simple, using a map makes the output non-deterministic unless the logger sorts by key. A better type might be a slice:

type KeyValuePairs []KeyValuePair
type KeyValuePair struct {Key string, Value interface{}}

I'm choosing to limit keys to strings because it simplifies the logging. We are not forced to use interface{} as in the WithValues and Info/Error parameters.

@thockin
Copy link
Member

thockin commented Nov 19, 2022 via email

@pohly
Copy link

pohly commented Nov 19, 2022

funcr has PseudoStruct which is similar to what you are describing. I did it in funcr because it felt kind of dirty to try to push that into every sink.

Not every sink has to support it. If they don't, they'll just log the KeyValuePairs like they normally do (map or list, depending on what we choose), which still exposes the same information.

Maybe a construction-time flag to klog/etc like AlsoErrorMarshalLog
meaning, to log both err=string and errRaw=? when er supports
MarshalLog?

Are there situations where we would want such a AlsoErrorMarshalLog to be false?

@pohly
Copy link

pohly commented Nov 23, 2022

I implemented logging a "structured error" as "err" + "errDetails":

This is work in progress (lacks some unit tests, depends on release of logr with those changes first), but should be complete enough to discuss whether this makes sense conceptually.

The name of "errDetails" is configurable in zapr. Conceptually this is similar to how zap handles an error that implements fmt.Formatter: for those it adds "errVerbose".

@tallclair
Copy link
Member Author

Thanks for following up on this @pohly ! I want to make sure I understand how all these pieces connect. My understanding is that with your changes above, I could do something like this:

type StructuredError struct {
  err error
  detail logr.KeysAndValues
}

func (e *StructuredError) Error() string {
  return e.err.Error()
}

func (e *StructuredError) MarshalLog() interface{} {
  return detail
}

And then something like &StructuredError{errors.New("my error"), logr.KeysAndValues{{"extra": "foo-bar"}}} would get logged like:

err="my error" errDetails={extra="foo-bar"}

Does that look right? If I understood correctly, I think this works for my use case.

@pohly
Copy link

pohly commented Dec 3, 2022

That's correct. One caveat is that there's no consensus yet on the logr.KeysAndValues type.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2023
@pohly
Copy link

pohly commented Apr 2, 2023

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 2, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2023
@pohly
Copy link

pohly commented Sep 19, 2023

Update: waiting for log/slog in Go 1.21 IMHO was worthwhile. Instead of adding a new special logr.KeysAndValues, we can instead use slog.GroupValue and the LogValue() slog.Value method. Support for those is needed anyway if we want to support slog in a LogSink and was already implemented (PRs pending for klog and zapr).

I've updated go-logr/zapr#56 and could do the same for klog.

@tallclair: is depending on slog and thus Go 1.21 okay?

@pohly
Copy link

pohly commented Nov 21, 2023

/unassign thockin
/assign

@pohly
Copy link

pohly commented Nov 21, 2023

The following PRs are ready for merging, if we agree on the approach:

TODO: klog text format

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2024
@pohly
Copy link

pohly commented Jan 22, 2024

/remove-lifecycle rotten

ping @tallclair: still interested?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 22, 2024
@tallclair
Copy link
Member Author

Yes. Nothing (that I'm aware of) is blocked on this, but I've run into a few places where this would be useful, either enabling more detailed error logging, or reducing the amount of context that needs to be plumbed through.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants