Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh k8s client on 'Unathorized' exceptions #337

Merged
merged 2 commits into from Jun 24, 2022

Conversation

PettitWesley
Copy link
Contributor

Signed-off-by: Wesley Pettit wppttt@amazon.com

#323

Signed-off-by: Wesley Pettit <wppttt@amazon.com>
@PettitWesley
Copy link
Contributor Author

PettitWesley commented Jun 9, 2022

@jcantrill Is this design for solving expired tokens acceptable for you? I've been testing it myself it works. I work for/with Amazon EKS and we have a lot of customers who need some sort of fix for expired tokens. Happy to change the code here to be whatever would suit your standards.

@jimmidyson

@@ -121,6 +121,12 @@ def fetch_pod_metadata(namespace_name, pod_name)
rescue StandardError => e
@stats.bump(:pod_cache_api_nil_error)
log.debug "Exception '#{e}' encountered fetching pod metadata from Kubernetes API #{@apiVersion} endpoint #{@kubernetes_url}"
if e.message == "Unauthorized"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a transport error of some kind that is throw where we can evaluate a response code (e.g. 401)? This would seem to be more consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me explain how I chose this option. But first disclaimer- so to be honest I am not a ruby dev and I sort I actually mostly work on Fluent Bit but I was asked to work on it since its important to our customers but I mostly don't know what I'm doing here.

I had the same thought as you though, what's the most canonical way to match this specific error.

So I ran this code and recorded what it outputted: https://github.com/PettitWesley/fluent-plugin-kubernetes_metadata_filter/blob/attempt_2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb#L53

And the result is this screenshot. Basically, in the string representation of the exception, which gets printed when you print it, there is an http code in the string (but not in the message or full_message fields, which is interesting). It felt wrong to match on that full string and the "Unauthorized" was easy to match on so I picked it. it didn't seem like there was an actual field on the exception object that would give me the code but may be I just didn't know how to find it.

Screen Shot 2022-06-06 at 6 42 00 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe using 401 error_code to decide the correct path is the proper option here.
you'll probably be able to use e.error_code instead of e.message.

side note, this is most properly a temporary fix as the underlying work should be done in the kubeclient library.
the bigger issue is that there is no backport coming to 4.9.4+ and they are going straight to 5.x which, in turn, means that all developers using this library will have to update to 5.x to have this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larivierec are you confirming this fix will be part of 5.x? Is there any reason we could not consume those changes in this library in lieu of making these larger changes to this plugin?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, as i'm not a contributor I cannot guarantee that the fix will be in the library.
however, based on master, you'll see that's this support is already on the master branch

Look at the following issue to be sure.
ManageIQ/kubeclient#561

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that the maintainers of that repo have not given any timeline for a release. Hence, its not something we can use to get a fix out to users ASAP- which is what AWS kubernetes users have requested and thus I am working on this.

If there is any way that you can think of that we can reach out to them to speed up that release, or any way that this project can consume that code on master... its definitely preferable to my change.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a problem with the PR at all. imo, I would probably add this fix using the error_code rather than message. it's never a bad thing to rely solely on underlying libraries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

it's never a bad thing to rely solely on underlying libraries.

Wait I think you meant- it is a bad thing to rely solely on the underlying library, we should have protections for basic things like unauthorized exceptions in this code base as well? That's what I was thinking too after I gave it some more thought... I also just updated this PR with the error_code change and I'm testing it now..

@@ -153,12 +159,20 @@ def fetch_namespace_metadata(namespace_name)
rescue StandardError => e
@stats.bump(:namespace_cache_api_nil_error)
log.debug "Exception '#{e}' encountered fetching namespace metadata from Kubernetes API #{@apiVersion} endpoint #{@kubernetes_url}"
if e.message == "Unauthorized"
@client = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be in "create_client"

sleep(Thread.current[:namespace_watch_retry_backoff_interval])
Thread.current[:namespace_watch_retry_count] += 1
Thread.current[:namespace_watch_retry_backoff_interval] *= @watch_retry_exponential_backoff_base
if e.message == "Unauthorized"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here regarding evaluating against a known code instead of a string

@PettitWesley
Copy link
Contributor Author

I created this little test script and I am running it on a cluster where the token expiration is set to 1 hour. This should prove if it works and is safe or not:

while true
do
  kubectl apply -f app.yaml
  sleep 444
  POD=$(kubectl get pods -o name)
  kubectl label --overwrite  $POD status=hi
  sleep 444
  kubectl label --overwrite  $POD status=testing
  sleep 444
  kubectl label --overwrite  $POD status=hopefullyitworks
  sleep 444
  kubectl apply -f app2.yaml
  kubectl create namespace test
  sleep 444
  kubectl delete -f app.yaml
  POD=$(kubectl get pods -o name)
  kubectl label --overwrite  $POD status=hi
  sleep 444
  kubectl label --overwrite  $POD status=testing
  sleep 444
  kubectl label --overwrite  $POD status=hopefullyitworks
  sleep 444
  kubectl delete -f app2.yaml
  kubectl delete namespace test
  sleep 444
done

Signed-off-by: Wesley Pettit <wppttt@amazon.com>
@jcantrill jcantrill merged commit b80d65e into fabric8io:master Jun 24, 2022
@PettitWesley
Copy link
Contributor Author

PettitWesley commented Jun 26, 2022

@jcantrill I actually didn't think this was ready yet. I am still performing the final long running tests.

Here is the output after running for almost 100 hours, with watch true on a cluster with the token expiration set to 1 hour, so the token would need to be refreshed about 100 times in this run- there are no errors or failures that I can see, and below is the current log output. This was with my script that continually restarts the same 2 pods and creates and deletes a namespace and changes labels on the pods.

2022-06-26 00:55:17 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:55:17.846993520 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:55:51.846743737 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:55:51 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:56:27 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:56:27.847043383 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:57:03 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:57:03.848472347 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:57:37 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:57:37.846396057 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:58:13 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:58:13.850007351 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:58:49.849233695 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:58:49 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:59:23.850891448 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 00:59:23 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:59:59 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 00:59:59.847002656 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:00:35 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:00:35.847178135 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:01:09 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:01:09.847133076 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:01:45 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:01:45.847048708 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:02:21 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:02:21.845743474 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:02:55.847117226 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:02:55 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:03:31.846258958 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:03:31 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:04:07.851858517 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:04:07 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:04:41.846592617 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:04:41 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:05:18.847502683 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:05:18 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:05:54 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:05:54.854894194 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:06:28 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:06:28.848381563 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:07:04 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:07:04.856832648 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:07:40 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:07:40.846404396 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:08:14.846693561 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:08:14 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:08:50 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:08:50.846042861 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}
2022-06-26 01:09:26 +0000 [info]: #0 stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292
2022-06-26 01:09:26.847326543 +0000 fluent.info: {"message":"stats - namespace_cache_size: 3, pod_cache_size: 3, namespace_cache_api_updates: 292, pod_cache_api_updates: 292, id_cache_miss: 292"}

@larivierec
Copy link

This was with my script that continually restarts the same 2 pods and creates and deletes a namespace and changes labels on the pods.

I'd like a small clarification, do you restart your pods?
I don't believe this would be the right way, considering a restarted pod will always reload the token from the new file.

Would it be possible to test without restarting the pods?

@PettitWesley
Copy link
Contributor Author

@larivierec

I'd like a small clarification, do you restart your pods?

My script restarts the app pods, not the Fluentd pod. This tests fluentd picking up new pods which requires a new request to the API server. See in my script the kubectl delete -f app.yaml and kubectl delete -f app2.yaml- those are my app pods. My Fluentd pod is fluentd.yaml and it was running continuously without restarts for the 100 hours. I checked.

@larivierec
Copy link

Alright thanks for the clarification, just wanted to make sure! 👍

@PettitWesley
Copy link
Contributor Author

PettitWesley commented Jun 27, 2022

@larivierec @jcantrill Unfortunately, this fix may not fully work. I sincerely apologize for this. There are two reasons why this happened:

  1. My EKS cluster that I used for final tested was set up by my partner on this project, and I didn't fully understand how it works. Every time the instances refresh, the 1 hour token expiration is removed apparently, so during the time when I did this testing the cluster didn't actually work how I thought it did, so my results are not valid :( :( I am very sorry about this and me and my team will work to remedy it and be better.
  2. I was expecting that you would wait to merge until I had stated that I had completed the full testing.

Let's fix this. I will post an update within a few hours on the status once I get the testing cluster working again. Again I apologize.

EDIT: OK actually I might have panicked too much, I think it fully works I just need to redo the testing to feel safe.

@PettitWesley
Copy link
Contributor Author

OK actually I might have panicked too much, I think it fully works I just need to redo the testing to feel safe.

@larivierec
Copy link

As I cannot really confirm that this is working at the moment. The version is properly updated in the gems so it should be running the latest build.
I tested this on a 1.21 EKS cluster and since the BoundServiceAccountTokenVolume Refresh AWS expires only fully after 90 days (i think) well there's no way for me to know for sure within a valid time frame.

I know for a fact that I still receive these messages:

annotations.authentication.k8s.io/stale-token
subject: system:serviceaccount:logging:fluentd-es, seconds after warning threshold: 1895
annotations.authorization.k8s.io/decision: allow
annotations.authorization.k8s.io/reason: RBAC: allowed by ClusterRoleBinding "fluentd-es" of ClusterRole "fluentd-es" to ServiceAccount "fluentd-es/logging"

@PettitWesley
Copy link
Contributor Author

@larivierec Yea so this is why my testing was invalid. The code I wrote is reactive not proactive, it refreshes when the token expires, not when its stale. Since the 1 hour token refresh on my cluster was removed due to a communication mistake, my testing isn't valid since the token never expired in my latest tests (I did do a test earlier with the 1 hour expiration and it worked but then I made a few changes and in the earlier test I didn't check changing labels and all the things my script docs... hence I think it works but I can't be certain yet...)

@PettitWesley
Copy link
Contributor Author

We have fixed our cluster to expire tokens in 1 hour. So we will know soon if this works. I'm not too worried...

@PettitWesley
Copy link
Contributor Author

We have now validated that the watch true behavior works... so it should be good... to be safe I will let this test run a bit longer and then put in a test for watch false over night.

@PettitWesley
Copy link
Contributor Author

And we have now validated that watch false fully works! Sorry for the panic! But thorough pre-release testing is always important!

@hvaghani221
Copy link

@PettitWesley from what I have understood, the token will be updated when the client receives an unauthorised error. So it is expected that we will see seconds after warning threshold in API server audit logs once every hour.

@smrutimandal
Copy link

@PettitWesley Thanks for this. Just to clarify, does it also mean that for EKS 1.21, the token won't be refreshed for 90 days as the kube-apiserver doesn't throw a 401 for the stale tokens for 90 days?

I am running the following:

eks: 1.21
fluentd image: v1.14.6-debian-1.1
fluent-plugin-kubernetes_metadata_filter gem: 2.11.1

@PettitWesley
Copy link
Contributor Author

@smrutimandal Correct.

@Ga13Ou
Copy link

Ga13Ou commented Sep 27, 2023

We have fixed our cluster to expire tokens in 1 hour. So we will know soon if this works. I'm not too worried...

@PettitWesley Can you tell how did you manage to configure your EKS cluster to expire token in 1 hour? it seems like such configuration can only be done at apiserver level, which is not customizable when using EK

@PettitWesley
Copy link
Contributor Author

Sorry, I no longer remember.

@PettitWesley
Copy link
Contributor Author

There was some API command that allowed us to set that. Search docs and github.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants