New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for TLS certificate rotation in OPA's HTTP server #2500
Comments
OPA does load the certificate and key from disk on every call to the remote service. This method is called on every call. It would be helpful to see OPA's debug logs to figure out what's going on. |
@patoarvizu thanks for filing this. As @ashutosh-narkar mentioned, other parts of OPA (e.g., all of the service clients for bundle downloading and decision log uploading, etc.) do reload certs (and do so on each request for simplicity.) However, in the case of the HTTP server, you're right that the certs are only loaded once at startup. I'd be wary about relying on file watching mechanisms due to flakyness across platforms. Perhaps a periodic reload (e.g., check every 10 seconds) would be enough. @patrick-east probably has some thoughts here. |
Thanks for the response @ashutosh-narkar @tsandall For additional reference, it seems like Gatekeeper is also rotating certs using periodic checks (https://github.com/open-policy-agent/gatekeeper/blob/master/pkg/webhook/certs.go), although the Gatekeeper case is a bit different because it is generating and injecting its own certificates (as opposed to using something external like cert-manager), so it has a little more control over it, but it seems like a similar approach could be used here. |
Sorry @patoarvizu, I missed that you were referring to the server. +1 for using using periodic checks. |
Any updates on this? This is a big blocker for us as well. Our certificates are valid for 24 hours only. |
@wma1729 there hasn't been any progress on this though we could prioritize it over the next release or two. In your environment, would it work if OPA just periodically reloaded the certificates from disk? In other words, OPA would re-read the certificates every X seconds. If the read succeeds, it would be update the certificate used by the server. If the read fails, OPA would minimally log something to the console at ERROR level to indicate the certificate reload could not be performed. OPA would continue using the last successfully loaded certificate. The reload period could be configurable. What default would you like to see? |
Yes. That should be fine. We use autocert in AKS. Autocert runs as a
sidecar and generates cert every 24 hours. The file containing the
certificate is updated every time the cert is renewed. So yes, if opa can
reload the file periodically, that would be fine.
…On Mon, Nov 29, 2021, 6:47 PM Torin Sandall ***@***.***> wrote:
@wma1729 <https://github.com/wma1729> there hasn't been any progress on
this though we could prioritize it over the next release or two. In your
environment, would it work if OPA just periodically reloaded the
certificates from disk? In other words, OPA would re-read the certificates
every X seconds. If the read succeeds, it would be update the certificate
used by the server. If the read fails, OPA would minimally log something to
the console at ERROR level to indicate the certificate reload could not be
performed. OPA would continue using the last successfully loaded
certificate. The reload period could be configurable. What default would
you like to see?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2500 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEULWO6F6PM74EOSJGNGBPLUOQNIXANCNFSM4OLUNXIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thought some more on this. I, somehow, am okay with the watch approach as well. In fact, I like it more. May I know your concerns with watchers? "I'd be wary about relying on file watching mechanisms due to flakyness across platforms." What platforms have troubled you? Is Windows your concern? Usually the systems that update cert file periodically updates the file way before the cert actually expires... For example if the cert expires in 24 hours... the cert is usually renewed after (24 - T) hours where T could be 1 hour or 30 minutes or 5 minutes but rarely 30 seconds... so we should have enough time to detect the change. If we go with periodic reload option, please record the sha digest of the file and reload the certificate only when the sha changes. And a default of 5 minutes should be good IMHO. But as long as it is configurable, we should be okay. |
Let's go ahead with a periodic reload. Hashing the cert file sounds fine assuming the reload on the server is expensive. I don't know enough about the http/tls package in Go to say which approach is better... @srenatus what do you think?
I'm concerned about relying on inotify() under the hood. Maybe it's improved and more reliable these days but in the past it was not something I'd want to depend on (e.g., if the watch doesn't fire as expected, we'll get a bug report). |
Related to #4107, I ran a quick benchmark of what the costs of different scenarios, given the certs have not changed, would be:
codepackage server
import (
"bytes"
"crypto/sha256"
"crypto/tls"
"io"
"os"
"sync/atomic"
"testing"
)
func BenchmarkCertReload(b *testing.B) {
certFile := "../test/e2e/certrefresh/testdata/server-cert.pem"
certKeyFile := "../test/e2e/certrefresh/testdata/server-key.pem"
b.Run("Load and Store", func(b *testing.B) {
var val atomic.Value
b.ResetTimer()
for n := 0; n < b.N; n++ {
cert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
if err != nil {
b.Fatal(err)
}
val.Store(&cert)
}
})
b.Run("Load and Compare bytes", func(b *testing.B) {
oldCert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
if err != nil {
b.Fatal(err)
}
b.ResetTimer()
for n := 0; n < b.N; n++ {
cert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
if err != nil {
b.Fatal(err)
}
if !bytes.Equal(oldCert.Certificate[0], cert.Certificate[0]) {
b.Error("expected equal certs")
}
}
})
b.Run("Compare sums of files", func(b *testing.B) {
hashCert := hash(b, certFile)
hashKey := hash(b, certKeyFile)
b.ResetTimer()
for n := 0; n < b.N; n++ {
newHashCert := hash(b, certFile)
newHashKey := hash(b, certKeyFile)
if !bytes.Equal(newHashKey, hashKey) || !bytes.Equal(newHashCert, hashCert) {
b.Error("expected equal certs")
}
}
})
}
func hash(b *testing.B, file string) []byte {
b.Helper()
f, err := os.Open(file)
if err != nil {
b.Fatal(err)
}
defer f.Close()
h := sha256.New()
if _, err := io.Copy(h, f); err != nil {
b.Fatal(err)
}
return h.Sum(nil)
} I don't think there's a clear winner here, is there? The checksumming approach (3rd one) has less allocations but more memory usage... but overall, it doesn't look terrible. If any of these things happen every 5 minutes, no harm is done, I believe. The first approach is the simplest when it comes to the underlying code: it's just a |
This adds a new flag to `opa run`, intended for server usage with HTTPS listeners: `--tls-cert-refresh-period`. If used with a positive duration, such as "5m" (5 minutes), "24h", etc, the server will track the certificate and key files' contents. When their content changes, the certificates will be reloaded. On an error in reloading, it will log (info) the error and try again in the next round. Fixes #2500. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
When is the new release coming? thanks |
Tomorrow, most likely. |
Expected Behavior
If an OPA server is running HTTPS (i.e. with
--tls-cert-file
) and the file on disk changes, OPA should have a mechanism for reloading the cert. This is useful for when the certificate is rotated periodically either manually or dynamically (e.g. withcert-manager
).Actual Behavior
OPA only loads the certificate once at startup time, and if the life of the server outlasts the validity period of the certificate it originally loaded, requests will fail, even if a new certificate with an extended expiration time exists on disk in the same location.
Steps to Reproduce the Problem
(Full example manifests below)
ClusterIssuer
and aCertificate
. Make theCertificate
very short-lived, (e.g. 5m).Certificate
above and passing the appropriate--tls
flags to use that certificate, and make sure the container listens on the TLS port. Make sure you have the appropriate configuration, service account, roles, role bindings, etc.Service
pointing to the HTTPS port on theDeployment
.ValidatingWebhookConfiguration
to capturepods
and point them to theopa
service created above.Pod
, it doesn't matter if a policy was applied properly or not. Could be something like:kubectl -n opa logs deployment/opa -c opa
. There should be no errors.2020/06/29 21:28:44 http: TLS handshake error from 10.42.0.1:33201: remote error: tls: bad certificate
Full manifests to deploy OPA:
Additional Info
This could be solved by having a "watch" on the file on disk and update a cached version of the certificate if it changes, and implementing
tls.Config.GetCertificate
to run a function to return the cached certificate. As a reference, other tools that implement this are vault-k8s (https://github.com/hashicorp/vault-k8s/blob/master/subcommand/injector/command.go, https://github.com/hashicorp/vault-k8s/blob/master/helper/cert/source_disk.go), or cert-manager itself (https://github.com/jetstack/cert-manager/blob/master/pkg/webhook/server/tls/file_source.go). I also implemented a somewhat simpler version of it on vault-agent-auto-inject-webhook (https://github.com/patoarvizu/vault-agent-auto-inject-webhook/blob/master/cmd/webhook.go). In the case of vault-k8s and vault-agent-auto-inject-webhook the watch is done using https://github.com/radovskyb/watcher, and cert-manager implements it withtime.Ticker
.An alternative to implementing
GetCertificate
would be to force the pod to restart withos.Exit(0)
on a file update event, although depending on how frequently the pod is forced to restart, it can cause unintended consequences, so the approach above is probably cleaner.The text was updated successfully, but these errors were encountered: