Logging improvements #944

robobario · 2024-02-05T01:18:52Z

Type of change

Enhancement / new feature

Description

Closes #919

reduce the verbosity of KMS exceptions to remove stack traces under the default configuration
include details about the last exception when the ResilientKms ceases retrying
route JDK platform logging to slf4j and quieten caffeine logs which dumped stacktraces at warn on async load failure (we have logging to cover failures anyway).
redundantly cancel timeout futures in FilterHandler, when the filter failed exceptionally we also saw a timeout log some time later as the timeout was not cancelled on the exceptional path. These logs are misleading.

Additional Context

We want to expose some level of logging by default when the Filter is completing it's futures exceptionally or failing to communicate with the KMS, but given the frequency of events potentially traversing the proxy we should prefer not to log full stack traces with the default configuration as it will produce a lot of noise. Instead we will logs stack traces at DEBUG level.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

Write tests
Make sure all tests pass
Review performance test results. Ensure that any degradations to performance numbers are understood and justified.
Make sure all Sonar-Lint warnings are addressed or are justifiably ignored.
Update documentation
Reference relevant issue(s) and close them after merging
For user facing changes, update CHANGELOG.md (remember to include changes affecting the API of the test artefacts too).

kroxylicious-app/pom.xml

k-wall · 2024-02-05T18:00:57Z

kroxylicious-app/src/main/resources/log4j2.yaml

@@ -46,3 +46,8 @@ Configuration:
        additivity: false
        AppenderRef:
          - ref: STDOUT
+      - name: com.github.benmanes.caffeine.cache.LocalAsyncCache


what kind of errors do we expect?
what would a user do if they saw one?
i don't know caffeine but seeing this make me wonder if there is a programatic way to discover cache errors that we should be tuning in to.

This block is to disable WARN level logging with stack traces when there is an exception thrown during async cache loading. The class produces no ERROR logs. Caffeine in general looks to take the approach of minimal logging except in these async loading cases where a failure could be invisible to the user, due to the user not logging the result of exceptionally completed futures. These failures flow into our logging.

Caffeine does support micrometer [1][2], so we should make a ticket to enable this to be switched on.

Does SLF4J have a FATAL level, I'm kinda on the fence about considering ERROR to be disabled. I guess it makes sense that we would still want to capture ERROR logs if they were to start being emitted by caffine.

Just to write down for posterity the discussion we had on a call. We already attach to the future and capture the failures so the logging from caffine is redundant to us.

Slf4j has FATAL yes, rarely seen it though The FATAL level designates very severe error events that will presumably lead the application to abort..

LocalAsyncCache only produces WARN logs (currently), there are other usages of platform logging in caffeine but this is the only one we know would log in our failed KMS operation case. Caffeine logging in general would be going to the root logger, just this one logger is targeted.

I don't understand the context of when we are seeing this exception. However this thread makes me wonder if we improve our exception handling, can we avoid relying on their logging for regular use-cases?

I am surprised that Caffeine doesn't have an error listener.. That seems odd to me.

relying on their logging for regular use-cases

We aren't, we chain off their futures and when they fail we handle it and log in ResilientKms or eventually in FilterHandler if it causes the whole stage to fail. This change silences the redundant logging from within caffeine.

The context of when this would happen is, we instantiate an AsyncLoadingCache with a loader function like:

this.decryptorCache = Caffeine.newBuilder() .buildAsync((edek, executor) -> makeDecryptor(edek));

where makeDecryptor returns CompletableFuture<AesGcmEncryptor>. Calling the loader function is caffeines responsibility when a client calls CompletableFuture<V> get(K key);. Caffeine takes care of ensuring we have one future in flight for each key. If the future supplied by the loader function fails, all the futures from get that are waiting on that Future are failed as well. But if the user doesn't handle the exceptionally completed future properly, and log etc then it could obscure the problem.

...ous-encryption/src/main/java/io/kroxylicious/filter/encryption/EnvelopeEncryptionFilter.java

...rs/kroxylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/ResilientKms.java

pom.xml

SamBarker

LGTM

SamBarker · 2024-02-07T02:30:55Z

kroxylicious-app/src/main/resources/log4j2.yaml

@@ -46,3 +46,8 @@ Configuration:
        additivity: false
        AppenderRef:
          - ref: STDOUT
+      - name: com.github.benmanes.caffeine.cache.LocalAsyncCache


Does SLF4J have a FATAL level, I'm kinda on the fence about considering ERROR to be disabled. I guess it makes sense that we would still want to capture ERROR logs if they were to start being emitted by caffine.

Just to write down for posterity the discussion we had on a call. We already attach to the future and capture the failures so the logging from caffine is redundant to us.

Signed-off-by: Robert Young <robeyoun@redhat.com>

Why: We already have logging covering failures to obtain KMS results Signed-off-by: Robert Young <robeyoun@redhat.com>

Why: If the filter completed it's future exceptionally, then the thenApplyAsync block that cancelled the timeout future was never called. By using a whenComplete we will cancel the timeout on the success and failure path. It's also safe to redundantly cancel it if the failure was being triggered by the timeout future itself. Signed-off-by: Robert Young <robeyoun@redhat.com>

sonarcloud · 2024-02-07T21:32:48Z

Quality Gate passed

The SonarCloud Quality Gate passed, but some issues were introduced.

1 New issue
0 Security Hotspots
89.7% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

k-wall

lgtm

robobario commented Feb 5, 2024

View reviewed changes

kroxylicious-app/pom.xml Outdated Show resolved Hide resolved

robobario force-pushed the log-more-useful branch 2 times, most recently from 4ed76d1 to c7b7eaa Compare February 5, 2024 02:21

robobario added the filter: encryption label Feb 5, 2024

robobario added this to the 0.5.0 milestone Feb 5, 2024

k-wall reviewed Feb 5, 2024

View reviewed changes

...ous-encryption/src/main/java/io/kroxylicious/filter/encryption/EnvelopeEncryptionFilter.java Outdated Show resolved Hide resolved

k-wall reviewed Feb 5, 2024

View reviewed changes

...rs/kroxylicious-encryption/src/main/java/io/kroxylicious/filter/encryption/ResilientKms.java Outdated Show resolved Hide resolved

k-wall reviewed Feb 5, 2024

View reviewed changes

pom.xml Show resolved Hide resolved

robobario mentioned this pull request Feb 6, 2024

Enable envelope encryption cache metrics collection #953

Open

robobario force-pushed the log-more-useful branch from c7b7eaa to 8560e22 Compare February 6, 2024 21:29

SamBarker approved these changes Feb 7, 2024

View reviewed changes

robobario added 5 commits February 8, 2024 10:12

Emit stack trace from filter handler logger at DEBUG or higher

ab1e146

Signed-off-by: Robert Young <robeyoun@redhat.com>

Reduce default logging context when encrypt/decrypt fails

6c70406

Signed-off-by: Robert Young <robeyoun@redhat.com>

Include last failure in retry exception and message

66ce32e

Signed-off-by: Robert Young <robeyoun@redhat.com>

Route caffeine logging to slf4j and suppress verbose log

2009d5a

Why: We already have logging covering failures to obtain KMS results Signed-off-by: Robert Young <robeyoun@redhat.com>

robobario force-pushed the log-more-useful branch from 8560e22 to 067202a Compare February 7, 2024 21:13

k-wall self-requested a review February 7, 2024 21:35

k-wall approved these changes Feb 7, 2024

View reviewed changes

robobario merged commit 6e31322 into kroxylicious:main Feb 8, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logging improvements #944

Logging improvements #944

robobario commented Feb 5, 2024 •

edited

k-wall Feb 5, 2024

robobario Feb 6, 2024

SamBarker Feb 7, 2024

robobario Feb 7, 2024

k-wall Feb 7, 2024

robobario Feb 7, 2024 •

edited

robobario Feb 7, 2024 •

edited

SamBarker left a comment

SamBarker Feb 7, 2024

sonarcloud bot commented Feb 7, 2024

k-wall left a comment

Logging improvements #944

Logging improvements #944

Conversation

robobario commented Feb 5, 2024 • edited

Type of change

Description

Additional Context

Checklist

k-wall Feb 5, 2024

Choose a reason for hiding this comment

robobario Feb 6, 2024

Choose a reason for hiding this comment

SamBarker Feb 7, 2024

Choose a reason for hiding this comment

robobario Feb 7, 2024

Choose a reason for hiding this comment

k-wall Feb 7, 2024

Choose a reason for hiding this comment

robobario Feb 7, 2024 • edited

Choose a reason for hiding this comment

robobario Feb 7, 2024 • edited

Choose a reason for hiding this comment

SamBarker left a comment

Choose a reason for hiding this comment

SamBarker Feb 7, 2024

Choose a reason for hiding this comment

sonarcloud bot commented Feb 7, 2024

Quality Gate passed

k-wall left a comment

Choose a reason for hiding this comment

robobario commented Feb 5, 2024 •

edited

robobario Feb 7, 2024 •

edited

robobario Feb 7, 2024 •

edited