New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh k8s client on 'Unathorized' exceptions #337
Refresh k8s client on 'Unathorized' exceptions #337
Conversation
Signed-off-by: Wesley Pettit <wppttt@amazon.com>
@jcantrill Is this design for solving expired tokens acceptable for you? I've been testing it myself it works. I work for/with Amazon EKS and we have a lot of customers who need some sort of fix for expired tokens. Happy to change the code here to be whatever would suit your standards. |
@@ -121,6 +121,12 @@ def fetch_pod_metadata(namespace_name, pod_name) | |||
rescue StandardError => e | |||
@stats.bump(:pod_cache_api_nil_error) | |||
log.debug "Exception '#{e}' encountered fetching pod metadata from Kubernetes API #{@apiVersion} endpoint #{@kubernetes_url}" | |||
if e.message == "Unauthorized" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a transport error of some kind that is throw where we can evaluate a response code (e.g. 401)? This would seem to be more consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me explain how I chose this option. But first disclaimer- so to be honest I am not a ruby dev and I sort I actually mostly work on Fluent Bit but I was asked to work on it since its important to our customers but I mostly don't know what I'm doing here.
I had the same thought as you though, what's the most canonical way to match this specific error.
So I ran this code and recorded what it outputted: https://github.com/PettitWesley/fluent-plugin-kubernetes_metadata_filter/blob/attempt_2/lib/fluent/plugin/kubernetes_metadata_watch_pods.rb#L53
And the result is this screenshot. Basically, in the string representation of the exception, which gets printed when you print it, there is an http code in the string (but not in the message or full_message fields, which is interesting). It felt wrong to match on that full string and the "Unauthorized" was easy to match on so I picked it. it didn't seem like there was an actual field on the exception object that would give me the code but may be I just didn't know how to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PettitWesley have you tried Rescuing on the KubeClient::HttpError ? https://github.com/ManageIQ/kubeclient/blob/master/lib/kubeclient/http_error.rb#L6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do believe using 401 error_code
to decide the correct path is the proper option here.
you'll probably be able to use e.error_code
instead of e.message
.
side note, this is most properly a temporary fix as the underlying work should be done in the kubeclient library.
the bigger issue is that there is no backport coming to 4.9.4+ and they are going straight to 5.x which, in turn, means that all developers using this library will have to update to 5.x to have this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larivierec are you confirming this fix will be part of 5.x? Is there any reason we could not consume those changes in this library in lieu of making these larger changes to this plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, as i'm not a contributor I cannot guarantee that the fix will be in the library.
however, based on master, you'll see that's this support is already on the master branch
Look at the following issue to be sure.
ManageIQ/kubeclient#561
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that the maintainers of that repo have not given any timeline for a release. Hence, its not something we can use to get a fix out to users ASAP- which is what AWS kubernetes users have requested and thus I am working on this.
If there is any way that you can think of that we can reach out to them to speed up that release, or any way that this project can consume that code on master... its definitely preferable to my change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a problem with the PR at all. imo, I would probably add this fix using the error_code rather than message. it's never a bad thing to rely solely on underlying libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
it's never a bad thing to rely solely on underlying libraries.
Wait I think you meant- it is a bad thing to rely solely on the underlying library, we should have protections for basic things like unauthorized exceptions in this code base as well? That's what I was thinking too after I gave it some more thought... I also just updated this PR with the error_code
change and I'm testing it now..
@@ -153,12 +159,20 @@ def fetch_namespace_metadata(namespace_name) | |||
rescue StandardError => e | |||
@stats.bump(:namespace_cache_api_nil_error) | |||
log.debug "Exception '#{e}' encountered fetching namespace metadata from Kubernetes API #{@apiVersion} endpoint #{@kubernetes_url}" | |||
if e.message == "Unauthorized" | |||
@client = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be in "create_client"
sleep(Thread.current[:namespace_watch_retry_backoff_interval]) | ||
Thread.current[:namespace_watch_retry_count] += 1 | ||
Thread.current[:namespace_watch_retry_backoff_interval] *= @watch_retry_exponential_backoff_base | ||
if e.message == "Unauthorized" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here regarding evaluating against a known code instead of a string
I created this little test script and I am running it on a cluster where the token expiration is set to 1 hour. This should prove if it works and is safe or not:
|
Signed-off-by: Wesley Pettit <wppttt@amazon.com>
7f16705
to
d9d5ba1
Compare
@jcantrill I actually didn't think this was ready yet. I am still performing the final long running tests. Here is the output after running for almost 100 hours, with
|
I'd like a small clarification, do you restart your pods? Would it be possible to test without restarting the pods? |
My script restarts the app pods, not the Fluentd pod. This tests fluentd picking up new pods which requires a new request to the API server. See in my script the |
Alright thanks for the clarification, just wanted to make sure! 👍 |
@larivierec @jcantrill Unfortunately, this fix may not fully work. I sincerely apologize for this. There are two reasons why this happened:
Let's fix this. I will post an update within a few hours on the status once I get the testing cluster working again. Again I apologize. EDIT: OK actually I might have panicked too much, I think it fully works I just need to redo the testing to feel safe. |
OK actually I might have panicked too much, I think it fully works I just need to redo the testing to feel safe. |
As I cannot really confirm that this is working at the moment. The version is properly updated in the gems so it should be running the latest build. I know for a fact that I still receive these messages:
|
@larivierec Yea so this is why my testing was invalid. The code I wrote is reactive not proactive, it refreshes when the token expires, not when its stale. Since the 1 hour token refresh on my cluster was removed due to a communication mistake, my testing isn't valid since the token never expired in my latest tests (I did do a test earlier with the 1 hour expiration and it worked but then I made a few changes and in the earlier test I didn't check changing labels and all the things my script docs... hence I think it works but I can't be certain yet...) |
We have fixed our cluster to expire tokens in 1 hour. So we will know soon if this works. I'm not too worried... |
We have now validated that the |
And we have now validated that |
@PettitWesley from what I have understood, the token will be updated when the client receives an unauthorised error. So it is expected that we will see |
@PettitWesley Thanks for this. Just to clarify, does it also mean that for EKS 1.21, the token won't be refreshed for 90 days as the kube-apiserver doesn't throw a 401 for the stale tokens for 90 days? I am running the following:
|
@smrutimandal Correct. |
@PettitWesley Can you tell how did you manage to configure your EKS cluster to expire token in 1 hour? it seems like such configuration can only be done at apiserver level, which is not customizable when using EK |
Sorry, I no longer remember. |
There was some API command that allowed us to set that. Search docs and github. |
Signed-off-by: Wesley Pettit wppttt@amazon.com
#323