Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LastError in the TargetSnapShot for debugging and monitoring #420

Closed
wants to merge 4 commits into from

Conversation

rockspore
Copy link

It seems any error that occurs during fetching the JWKS in the background gets swallowed.

In my use case of this library, the JWKS url is provided by users and misconfiguration of the JWKS url would be very difficult to troubleshoot if those errors are not visible unless Fetch() or Refresh() gets specifically invoked. So I added an option which allows library users to inject an error handling function to deal with the error with their own logic.

One open question outside this PR is: It seems the backoff interval is ineffective if it's smaller than the MinRefreshInterval. This basically means any failure caused by internet interruption or temporary server downtime will not be able to recover asynchronously till quite some time after. Is this intended? Thanks.

@lestrrat
Copy link
Collaborator

I'm coming off of a being sick, so I'm not going to make any technical decisions yet, but a few things that I can immediately answer..

  • seems like tests are failing :)
  • I generally don't like callbacks in my API. Maybe pass a channel as a parameter, and the consumer can monitor it?
  • I'm not sure passing the error alone would suffice. Wouldn't you need the URL, at least?
  • re: backoff, the default settings could have been set better, but the idea is that the user could tweak it by passing their own backoff object. if you think the default should be more dynamic, PRs welcome.

@rockspore rockspore changed the title Add WithFetchErrorHandler to make errors during background fetching visible. Add LastError in the TargetSnapShot for debugging and monitoring Jul 28, 2021
@rockspore
Copy link
Author

I'm coming off of a being sick, so I'm not going to make any technical decisions yet, but a few things that I can immediately answer..

  • seems like tests are failing :)
  • I generally don't like callbacks in my API. Maybe pass a channel as a parameter, and the consumer can monitor it?
  • I'm not sure passing the error alone would suffice. Wouldn't you need the URL, at least?
  • re: backoff, the default settings could have been set better, but the idea is that the user could tweak it by passing their own backoff object. if you think the default should be more dynamic, PRs welcome.

Thanks for the quick reply! Yeah there was a race condition in the previously added test and I forgot to use -race when testing locally (also didn't realize there was the Makefile)

After rethinking about our need, the existing TargetSnapShot seems to be a good fit for us to set up periodic monitoring on the client side. So I reverted all the early change and added LastError into that structure along with some logic in doRefreshRequest() to make sure the it's correctly populated and updated.

Since now nothing goes inside the fetch() function, any error inside, namely during the backoff retry loop, will not be visible unless returned. But now I think that piece of information is not necessary.

Speaking of SnapShot() itself, I wonder if you would like to have another function that persists the channel to stream target snapshots whenever there is a change, such that no polling is required from the client side.

@lestrrat
Copy link
Collaborator

lestrrat commented Jul 28, 2021

Adding LastError seems to be minimal and if it fits your needs I'm generally okay with this approach.

One last comment I'd make is about

Speaking of SnapShot() itself, I wonder if you would like to have another function that persists the channel to stream target snapshots whenever there is a change, such that no polling is required from the client side.

This is close to what I was suggesting when I said "maybe pass a channel?"
So the following is sort of what I was thinking for your error use case (note: I haven't really thought this through yet) :

type AutoRefreshError struct {
  URL string
  Error error
}
ch := make(chan AutoRefreshError) // Note: user must drain it otherwise AutoRefresh will block.
ar := jwk.NewAutoRefresh( jwk.WithAutoRefreshErrorChannel(ch) )
// ... normal operation ...

// somewhere else
for arErr := range ch {
  log.Printf("error while fetching %s: %s", arErr.URL, ar.Err.Error)
}

I don't know. I think we can keep the LastError bit regardless, but if your use case benefits from an asynchronous error stream, I thought it might be worth thinking about.

@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2021

Codecov Report

Merging #420 (d7534a6) into main (90b0d09) will increase coverage by 0.10%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #420      +/-   ##
==========================================
+ Coverage   69.61%   69.72%   +0.10%     
==========================================
  Files          80       80              
  Lines        9186     9189       +3     
==========================================
+ Hits         6395     6407      +12     
+ Misses       1942     1933       -9     
  Partials      849      849              
Impacted Files Coverage Δ
jwk/refresh.go 76.95% <100.00%> (+3.82%) ⬆️
jwk/jwk.go 58.50% <0.00%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90b0d09...d7534a6. Read the comment docs.

@rockspore
Copy link
Author

This is close to what I was suggesting when I said "maybe pass a channel?"

I was thinking of something like this:

func (af *AutoRefresh) SnapshotOnChange() <-chan TargetSnapshot {
  // The channel is not closed here, and a snapshot will be
  // sent whenever there is a change including refresh time and error.
}

// client side
ch := ar.SnapshotOnChange()
var snapshot jwk.TargetSnapshot
for {
  select {
    case snapshot <- ch:
      // process the snapshot
    default:
  }
}

which is basically with the same idea as what you suggested. This wouldn't need any new structs or options. Otherwise, what you proposed makes more sense to me.

Async error stream means less polling so it will surely bring benefits but perhaps marginal. If the current PR looks fine to you, I'd love to see it merged and I'm happy to contribute on the async error streaming as a separate task.

@lestrrat
Copy link
Collaborator

Hmmmm, okay, so I was going to merge, but then I thought,
you're trying to get the last error, but what happens when the error is overwritten before you get to it?
This is totally possible in case a retry occurs, and a fetch is successful. It's an intermittent error, but would you want to catch that?

@lestrrat
Copy link
Collaborator

lestrrat commented Jul 29, 2021

@rockspore here's a possible solution from me: #421. This doesn't block, you can turn it on and off, and should be race free. Please check if this works for you

@rockspore
Copy link
Author

rockspore commented Jul 29, 2021

This is totally possible in case a retry occurs, and a fetch is successful. It's an intermittent error, but would you want to catch that?

Yes I am aware of the race condition when I tested this on the client side. Then I thought it would be very rare and not cause panic so I didn't pay attention to it. My purpose is mostly to catch incorrect JWKS URL configuration. But you're right it'd be nice to catch intermittent error as long as it doesn't require frequent polling.

#421 looks good! I will do some manual test with it shortly and get back to you. Thanks for the swift turnaround ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants