Add support for TLS certificate rotation in OPA's HTTP server #2500

patoarvizu · 2020-06-29T22:12:57Z

Expected Behavior

If an OPA server is running HTTPS (i.e. with --tls-cert-file) and the file on disk changes, OPA should have a mechanism for reloading the cert. This is useful for when the certificate is rotated periodically either manually or dynamically (e.g. with cert-manager).

Actual Behavior

OPA only loads the certificate once at startup time, and if the life of the server outlasts the validity period of the certificate it originally loaded, requests will fail, even if a new certificate with an extended expiration time exists on disk in the same location.

Steps to Reproduce the Problem

(Full example manifests below)

Have a cluster with cert-manager deployed.
Create a cert-manager ClusterIssuer and a Certificate. Make the Certificate very short-lived, (e.g. 5m).
Deploy OPA mounting the secret created by the Certificate above and passing the appropriate --tls flags to use that certificate, and make sure the container listens on the TLS port. Make sure you have the appropriate configuration, service account, roles, role bindings, etc.
Create an OPA Service pointing to the HTTPS port on the Deployment.
Create a ValidatingWebhookConfiguration to capture pods and point them to the opa service created above.
Deploy any Pod, it doesn't matter if a policy was applied properly or not. Could be something like:

apiVersion: v1
kind: Pod
metadata:
  name: echo
  namespace: default
spec:
  containers:
  - name: echo
    image: hashicorp/http-echo:latest
    args:
    - -listen
    - ":8080"
    - -text
    - "hello world"

Delete the pod.
Check the OPA logs, i.e. kubectl -n opa logs deployment/opa -c opa. There should be no errors.
Wait 10-11 minutes to make sure cert-manager rotated the certificates.
Try to delete the same pod as above.
You'll see an error along the lines of 2020/06/29 21:28:44 http: TLS handshake error from 10.42.0.1:33201: remote error: tls: bad certificate

Full manifests to deploy OPA:

apiVersion: v1
kind: Namespace
metadata:
  name: opa
  labels:
    opa-control-plane: "true"
---
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: selfsigning-issuer
spec:
  selfSigned: {}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: opa-role
rules:
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: opa
  namespace: opa
  labels:
    opa-control-plane: "true"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: null
  name: opa-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: opa-role
subjects:
- kind: ServiceAccount
  name: opa
  namespace: opa
---
kind: Service
apiVersion: v1
metadata:
  name: opa
  namespace: opa
  labels:
    opa-control-plane: "true"
spec:
  selector:
    app: opa
  ports:
  - name: https
    protocol: TCP
    port: 443
    targetPort: https
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: opa
    opa-control-plane: "true"
  name: opa
  namespace: opa
spec:
  selector:
    matchLabels:
      app: opa
  template:
    metadata:
      labels:
        app: opa
        opa-control-plane: "true"
      name: opa
    spec:
      serviceAccountName: opa
      containers:
      - name: opa
        image: openpolicyagent/opa:0.21.0
        args:
        - run
        - --server
        - --tls-cert-file=/certs/tls.crt
        - --tls-private-key-file=/certs/tls.key
        - --addr=https://0.0.0.0:443
        - --addr=http://127.0.0.1:8181
        - --log-level=error
        volumeMounts:
          - readOnly: true
            mountPath: /certs
            name: opa-server
        readinessProbe:
          httpGet:
            path: /health
            scheme: HTTPS
            port: https
          initialDelaySeconds: 3
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            scheme: HTTPS
            port: https
        ports:
        - containerPort: 443
          name: https
      - name: kube-mgmt
        image: openpolicyagent/kube-mgmt:0.11
        args:
        - --policies=opa
        - --enable-data=true
      volumes:
      - name: opa-server
        secret:
          secretName: opa-server-secret
---
apiVersion: cert-manager.io/v1alpha2
kind: Certificate
metadata:
  name: opa-webhook
  namespace: opa
spec:
  secretName: opa-server-secret
  duration: 10m
  renewBefore: 5m
  commonName: opa
  dnsNames:
  - opa
  - opa.opa
  - opa.opa.svc
  issuerRef:
    name: selfsigning-issuer
    kind: ClusterIssuer
---
kind: ValidatingWebhookConfiguration
apiVersion: admissionregistration.k8s.io/v1beta1
metadata:
  name: opa-validating-webhook
  annotations:
    cert-manager.io/inject-ca-from: opa/opa-webhook
webhooks:
- name: validating-webhook.openpolicyagent.org
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - pods
  clientConfig:
    caBundle: Cg==
    service:
      namespace: opa
      name: opa
  namespaceSelector:
    matchExpressions:
    - key: opa-control-plane
      operator: DoesNotExist
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: opa-default-system-main
  namespace: opa
  labels:
    openpolicyagent.org/policy: rego
data:
  main: |
    package system

    import data.kubernetes.admission

    main = {
      "apiVersion": "admission.k8s.io/v1beta1",
      "kind": "AdmissionReview",
      "response": response,
    }

    default response = {"allowed": true}

    response = {
        "allowed": false,
        "status": {
            "reason": reason,
        },
    }{
        reason = concat(", ", admission.deny)
        reason != ""
    }

Additional Info

This could be solved by having a "watch" on the file on disk and update a cached version of the certificate if it changes, and implementing tls.Config.GetCertificate to run a function to return the cached certificate. As a reference, other tools that implement this are vault-k8s (https://github.com/hashicorp/vault-k8s/blob/master/subcommand/injector/command.go, https://github.com/hashicorp/vault-k8s/blob/master/helper/cert/source_disk.go), or cert-manager itself (https://github.com/jetstack/cert-manager/blob/master/pkg/webhook/server/tls/file_source.go). I also implemented a somewhat simpler version of it on vault-agent-auto-inject-webhook (https://github.com/patoarvizu/vault-agent-auto-inject-webhook/blob/master/cmd/webhook.go). In the case of vault-k8s and vault-agent-auto-inject-webhook the watch is done using https://github.com/radovskyb/watcher, and cert-manager implements it with time.Ticker.

An alternative to implementing GetCertificate would be to force the pod to restart with os.Exit(0) on a file update event, although depending on how frequently the pod is forced to restart, it can cause unintended consequences, so the approach above is probably cleaner.

The text was updated successfully, but these errors were encountered:

ashutosh-narkar · 2020-07-07T07:39:12Z

OPA does load the certificate and key from disk on every call to the remote service. This method is called on every call. It would be helpful to see OPA's debug logs to figure out what's going on.

tsandall · 2020-07-07T12:50:56Z

@patoarvizu thanks for filing this. As @ashutosh-narkar mentioned, other parts of OPA (e.g., all of the service clients for bundle downloading and decision log uploading, etc.) do reload certs (and do so on each request for simplicity.) However, in the case of the HTTP server, you're right that the certs are only loaded once at startup.

I'd be wary about relying on file watching mechanisms due to flakyness across platforms. Perhaps a periodic reload (e.g., check every 10 seconds) would be enough. @patrick-east probably has some thoughts here.

patoarvizu · 2020-07-07T13:30:34Z

Thanks for the response @ashutosh-narkar @tsandall

For additional reference, it seems like Gatekeeper is also rotating certs using periodic checks (https://github.com/open-policy-agent/gatekeeper/blob/master/pkg/webhook/certs.go), although the Gatekeeper case is a bit different because it is generating and injecting its own certificates (as opposed to using something external like cert-manager), so it has a little more control over it, but it seems like a similar approach could be used here.

ashutosh-narkar · 2020-07-07T16:29:29Z

Sorry @patoarvizu, I missed that you were referring to the server. +1 for using using periodic checks.

wma1729 · 2021-11-29T22:18:18Z

Any updates on this? This is a big blocker for us as well. Our certificates are valid for 24 hours only.

tsandall · 2021-11-30T00:46:55Z

@wma1729 there hasn't been any progress on this though we could prioritize it over the next release or two. In your environment, would it work if OPA just periodically reloaded the certificates from disk? In other words, OPA would re-read the certificates every X seconds. If the read succeeds, it would be update the certificate used by the server. If the read fails, OPA would minimally log something to the console at ERROR level to indicate the certificate reload could not be performed. OPA would continue using the last successfully loaded certificate. The reload period could be configurable. What default would you like to see?

wma1729 · 2021-11-30T01:35:49Z

Yes. That should be fine. We use autocert in AKS. Autocert runs as a sidecar and generates cert every 24 hours. The file containing the certificate is updated every time the cert is renewed. So yes, if opa can reload the file periodically, that would be fine.

…

On Mon, Nov 29, 2021, 6:47 PM Torin Sandall ***@***.***> wrote: @wma1729 <https://github.com/wma1729> there hasn't been any progress on this though we could prioritize it over the next release or two. In your environment, would it work if OPA just periodically reloaded the certificates from disk? In other words, OPA would re-read the certificates every X seconds. If the read succeeds, it would be update the certificate used by the server. If the read fails, OPA would minimally log something to the console at ERROR level to indicate the certificate reload could not be performed. OPA would continue using the last successfully loaded certificate. The reload period could be configurable. What default would you like to see? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2500 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEULWO6F6PM74EOSJGNGBPLUOQNIXANCNFSM4OLUNXIA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

wma1729 · 2021-11-30T01:57:16Z

Thought some more on this. I, somehow, am okay with the watch approach as well. In fact, I like it more. May I know your concerns with watchers? "I'd be wary about relying on file watching mechanisms due to flakyness across platforms." What platforms have troubled you? Is Windows your concern?

Usually the systems that update cert file periodically updates the file way before the cert actually expires... For example if the cert expires in 24 hours... the cert is usually renewed after (24 - T) hours where T could be 1 hour or 30 minutes or 5 minutes but rarely 30 seconds... so we should have enough time to detect the change.

If we go with periodic reload option, please record the sha digest of the file and reload the certificate only when the sha changes. And a default of 5 minutes should be good IMHO. But as long as it is configurable, we should be okay.

tsandall · 2021-12-07T21:13:19Z

Let's go ahead with a periodic reload. Hashing the cert file sounds fine assuming the reload on the server is expensive. I don't know enough about the http/tls package in Go to say which approach is better... @srenatus what do you think?

What platforms have troubled you? Is Windows your concern?

I'm concerned about relying on inotify() under the hood. Maybe it's improved and more reliable these days but in the past it was not something I'd want to depend on (e.g., if the watch doesn't fire as expected, we'll get a bug report).

srenatus · 2021-12-08T11:06:54Z

Related to #4107, I ran a quick benchmark of what the costs of different scenarios, given the certs have not changed, would be:

U @ 2.30GHz
BenchmarkCertReload
BenchmarkCertReload/Load_and_Store
BenchmarkCertReload/Load_and_Store-16         	   17380	     67021 ns/op	   17740 B/op	     156 allocs/op
BenchmarkCertReload/Load_and_Compare_bytes
BenchmarkCertReload/Load_and_Compare_bytes-16 	   16795	     71509 ns/op	   17612 B/op	     155 allocs/op
BenchmarkCertReload/Compare_sums_of_files
BenchmarkCertReload/Compare_sums_of_files-16  	   25821	     44358 ns/op	   66176 B/op	      12 allocs/op

code

package server

import (
	"bytes"
	"crypto/sha256"
	"crypto/tls"
	"io"
	"os"
	"sync/atomic"
	"testing"
)

func BenchmarkCertReload(b *testing.B) {
	certFile := "../test/e2e/certrefresh/testdata/server-cert.pem"
	certKeyFile := "../test/e2e/certrefresh/testdata/server-key.pem"

	b.Run("Load and Store", func(b *testing.B) {
		var val atomic.Value
		b.ResetTimer()
		for n := 0; n < b.N; n++ {
			cert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
			if err != nil {
				b.Fatal(err)
			}
			val.Store(&cert)
		}
	})

	b.Run("Load and Compare bytes", func(b *testing.B) {
		oldCert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
		if err != nil {
			b.Fatal(err)
		}
		b.ResetTimer()
		for n := 0; n < b.N; n++ {
			cert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
			if err != nil {
				b.Fatal(err)
			}
			if !bytes.Equal(oldCert.Certificate[0], cert.Certificate[0]) {
				b.Error("expected equal certs")
			}
		}
	})

	b.Run("Compare sums of files", func(b *testing.B) {
		hashCert := hash(b, certFile)
		hashKey := hash(b, certKeyFile)

		b.ResetTimer()
		for n := 0; n < b.N; n++ {
			newHashCert := hash(b, certFile)
			newHashKey := hash(b, certKeyFile)
			if !bytes.Equal(newHashKey, hashKey) || !bytes.Equal(newHashCert, hashCert) {
				b.Error("expected equal certs")
			}
		}
	})
}

func hash(b *testing.B, file string) []byte {
	b.Helper()
	f, err := os.Open(file)
	if err != nil {
		b.Fatal(err)
	}
	defer f.Close()

	h := sha256.New()
	if _, err := io.Copy(h, f); err != nil {
		b.Fatal(err)
	}

	return h.Sum(nil)
}

I don't think there's a clear winner here, is there? The checksumming approach (3rd one) has less allocations but more memory usage... but overall, it doesn't look terrible. If any of these things happen every 5 minutes, no harm is done, I believe. The first approach is the simplest when it comes to the underlying code: it's just a (atomic.Value).Load() and (atomic.Value).Store(). Keeping track of checksums on the side requires a more involved setup, I believe, but I'll try to shed some light on that, too, experimenting around.

This adds a new flag to `opa run`, intended for server usage with HTTPS listeners: `--tls-cert-refresh-period`. If used with a positive duration, such as "5m" (5 minutes), "24h", etc, the server will track the certificate and key files' contents. When their content changes, the certificates will be reloaded. On an error in reloading, it will log (info) the error and try again in the next round. Fixes #2500. Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>

superff · 2022-01-03T23:07:51Z

When is the new release coming? thanks

anderseknert · 2022-01-03T23:18:49Z

Tomorrow, most likely.

ashutosh-narkar added the enhancement label Jun 29, 2020

tsandall added this to TODO (Things That Should Be Done) in Open Policy Agent via automation Jul 7, 2020

tsandall moved this from TODO (Things That Should Be Done) to Planned (Things We Are Going To Do) in Open Policy Agent Nov 30, 2021

tsandall added runtime and removed enhancement labels Dec 3, 2021

srenatus self-assigned this Dec 7, 2021

srenatus moved this from Planned - v0.36 to In Progress in Open Policy Agent Dec 7, 2021

This was referenced Dec 7, 2021

Sr/server+runtime/add cert refreshing #4106

Closed

server+runtime: add cert refreshing #4107

Merged

tsandall changed the title ~~Reload TLS certificate for HTTPS server~~ Add support for TLS certificate rotation in OPA's HTTP server Dec 8, 2021

srenatus closed this as completed in #4107 Dec 9, 2021

Open Policy Agent automation moved this from In Progress to Done Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for TLS certificate rotation in OPA's HTTP server #2500

Add support for TLS certificate rotation in OPA's HTTP server #2500

patoarvizu commented Jun 29, 2020

ashutosh-narkar commented Jul 7, 2020

tsandall commented Jul 7, 2020

patoarvizu commented Jul 7, 2020

ashutosh-narkar commented Jul 7, 2020

wma1729 commented Nov 29, 2021

tsandall commented Nov 30, 2021

wma1729 commented Nov 30, 2021 via email

wma1729 commented Nov 30, 2021

tsandall commented Dec 7, 2021

srenatus commented Dec 8, 2021

superff commented Jan 3, 2022

anderseknert commented Jan 3, 2022

Add support for TLS certificate rotation in OPA's HTTP server #2500

Add support for TLS certificate rotation in OPA's HTTP server #2500

Comments

patoarvizu commented Jun 29, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

ashutosh-narkar commented Jul 7, 2020

tsandall commented Jul 7, 2020

patoarvizu commented Jul 7, 2020

ashutosh-narkar commented Jul 7, 2020

wma1729 commented Nov 29, 2021

tsandall commented Nov 30, 2021

wma1729 commented Nov 30, 2021 via email

wma1729 commented Nov 30, 2021

tsandall commented Dec 7, 2021

srenatus commented Dec 8, 2021

superff commented Jan 3, 2022

anderseknert commented Jan 3, 2022