Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging improvements #944

Merged
merged 5 commits into from Feb 8, 2024
Merged

Conversation

robobario
Copy link
Contributor

@robobario robobario commented Feb 5, 2024

Type of change

  • Enhancement / new feature

Description

Closes #919

  1. reduce the verbosity of KMS exceptions to remove stack traces under the default configuration
  2. include details about the last exception when the ResilientKms ceases retrying
  3. route JDK platform logging to slf4j and quieten caffeine logs which dumped stacktraces at warn on async load failure (we have logging to cover failures anyway).
  4. redundantly cancel timeout futures in FilterHandler, when the filter failed exceptionally we also saw a timeout log some time later as the timeout was not cancelled on the exceptional path. These logs are misleading.

Additional Context

We want to expose some level of logging by default when the Filter is completing it's futures exceptionally or failing to communicate with the KMS, but given the frequency of events potentially traversing the proxy we should prefer not to log full stack traces with the default configuration as it will produce a lot of noise. Instead we will logs stack traces at DEBUG level.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Review performance test results. Ensure that any degradations to performance numbers are understood and justified.
  • Make sure all Sonar-Lint warnings are addressed or are justifiably ignored.
  • Update documentation
  • Reference relevant issue(s) and close them after merging
  • For user facing changes, update CHANGELOG.md (remember to include changes affecting the API of the test artefacts too).

kroxylicious-app/pom.xml Outdated Show resolved Hide resolved
@robobario robobario force-pushed the log-more-useful branch 2 times, most recently from 4ed76d1 to c7b7eaa Compare February 5, 2024 02:21
@robobario robobario added this to the 0.5.0 milestone Feb 5, 2024
@@ -46,3 +46,8 @@ Configuration:
additivity: false
AppenderRef:
- ref: STDOUT
- name: com.github.benmanes.caffeine.cache.LocalAsyncCache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of errors do we expect?
what would a user do if they saw one?
i don't know caffeine but seeing this make me wonder if there is a programatic way to discover cache errors that we should be tuning in to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is to disable WARN level logging with stack traces when there is an exception thrown during async cache loading. The class produces no ERROR logs. Caffeine in general looks to take the approach of minimal logging except in these async loading cases where a failure could be invisible to the user, due to the user not logging the result of exceptionally completed futures. These failures flow into our logging.

Caffeine does support micrometer [1][2], so we should make a ticket to enable this to be switched on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SLF4J have a FATAL level, I'm kinda on the fence about considering ERROR to be disabled. I guess it makes sense that we would still want to capture ERROR logs if they were to start being emitted by caffine.

Just to write down for posterity the discussion we had on a call. We already attach to the future and capture the failures so the logging from caffine is redundant to us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slf4j has FATAL yes, rarely seen it though The FATAL level designates very severe error events that will presumably lead the application to abort..

LocalAsyncCache only produces WARN logs (currently), there are other usages of platform logging in caffeine but this is the only one we know would log in our failed KMS operation case. Caffeine logging in general would be going to the root logger, just this one logger is targeted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the context of when we are seeing this exception. However this thread makes me wonder if we improve our exception handling, can we avoid relying on their logging for regular use-cases?

I am surprised that Caffeine doesn't have an error listener.. That seems odd to me.

Copy link
Contributor Author

@robobario robobario Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relying on their logging for regular use-cases

We aren't, we chain off their futures and when they fail we handle it and log in ResilientKms or eventually in FilterHandler if it causes the whole stage to fail. This change silences the redundant logging from within caffeine.

Copy link
Contributor Author

@robobario robobario Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context of when this would happen is, we instantiate an AsyncLoadingCache with a loader function like:

this.decryptorCache = Caffeine.newBuilder()
                .buildAsync((edek, executor) -> makeDecryptor(edek));

where makeDecryptor returns CompletableFuture<AesGcmEncryptor>. Calling the loader function is caffeines responsibility when a client calls CompletableFuture<V> get(K key);. Caffeine takes care of ensuring we have one future in flight for each key. If the future supplied by the loader function fails, all the futures from get that are waiting on that Future are failed as well. But if the user doesn't handle the exceptionally completed future properly, and log etc then it could obscure the problem.

pom.xml Show resolved Hide resolved
Copy link
Member

@SamBarker SamBarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -46,3 +46,8 @@ Configuration:
additivity: false
AppenderRef:
- ref: STDOUT
- name: com.github.benmanes.caffeine.cache.LocalAsyncCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SLF4J have a FATAL level, I'm kinda on the fence about considering ERROR to be disabled. I guess it makes sense that we would still want to capture ERROR logs if they were to start being emitted by caffine.

Just to write down for posterity the discussion we had on a call. We already attach to the future and capture the failures so the logging from caffine is redundant to us.

Signed-off-by: Robert Young <robeyoun@redhat.com>
Signed-off-by: Robert Young <robeyoun@redhat.com>
Signed-off-by: Robert Young <robeyoun@redhat.com>
Why:
We already have logging covering failures to obtain KMS results

Signed-off-by: Robert Young <robeyoun@redhat.com>
Why:
If the filter completed it's future exceptionally, then the thenApplyAsync
block that cancelled the timeout future was never called. By using a whenComplete
we will cancel the timeout on the success and failure path. It's also safe to
redundantly cancel it if the failure was being triggered by the timeout future
itself.

Signed-off-by: Robert Young <robeyoun@redhat.com>
Copy link

sonarcloud bot commented Feb 7, 2024

Quality Gate Passed Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
89.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

@k-wall k-wall self-requested a review February 7, 2024 21:35
Copy link
Contributor

@k-wall k-wall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@robobario robobario merged commit 6e31322 into kroxylicious:main Feb 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Ensure logging is adequate for debugging some envelope encryption issues
3 participants