Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for TLS certificate rotation in OPA's HTTP server #2500

Closed
patoarvizu opened this issue Jun 29, 2020 · 12 comments · Fixed by #4107
Closed

Add support for TLS certificate rotation in OPA's HTTP server #2500

patoarvizu opened this issue Jun 29, 2020 · 12 comments · Fixed by #4107
Assignees
Labels

Comments

@patoarvizu
Copy link

Expected Behavior

If an OPA server is running HTTPS (i.e. with --tls-cert-file) and the file on disk changes, OPA should have a mechanism for reloading the cert. This is useful for when the certificate is rotated periodically either manually or dynamically (e.g. with cert-manager).

Actual Behavior

OPA only loads the certificate once at startup time, and if the life of the server outlasts the validity period of the certificate it originally loaded, requests will fail, even if a new certificate with an extended expiration time exists on disk in the same location.

Steps to Reproduce the Problem

(Full example manifests below)

  • Have a cluster with cert-manager deployed.
  • Create a cert-manager ClusterIssuer and a Certificate. Make the Certificate very short-lived, (e.g. 5m).
  • Deploy OPA mounting the secret created by the Certificate above and passing the appropriate --tls flags to use that certificate, and make sure the container listens on the TLS port. Make sure you have the appropriate configuration, service account, roles, role bindings, etc.
  • Create an OPA Service pointing to the HTTPS port on the Deployment.
  • Create a ValidatingWebhookConfiguration to capture pods and point them to the opa service created above.
  • Deploy any Pod, it doesn't matter if a policy was applied properly or not. Could be something like:
apiVersion: v1
kind: Pod
metadata:
  name: echo
  namespace: default
spec:
  containers:
  - name: echo
    image: hashicorp/http-echo:latest
    args:
    - -listen
    - ":8080"
    - -text
    - "hello world"
  • Delete the pod.
  • Check the OPA logs, i.e. kubectl -n opa logs deployment/opa -c opa. There should be no errors.
  • Wait 10-11 minutes to make sure cert-manager rotated the certificates.
  • Try to delete the same pod as above.
  • You'll see an error along the lines of 2020/06/29 21:28:44 http: TLS handshake error from 10.42.0.1:33201: remote error: tls: bad certificate

Full manifests to deploy OPA:

apiVersion: v1
kind: Namespace
metadata:
  name: opa
  labels:
    opa-control-plane: "true"
---
apiVersion: cert-manager.io/v1alpha2
kind: ClusterIssuer
metadata:
  name: selfsigning-issuer
spec:
  selfSigned: {}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: opa-role
rules:
- apiGroups:
  - '*'
  resources:
  - '*'
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - get
  - list
  - watch
  - create
  - update
  - patch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: opa
  namespace: opa
  labels:
    opa-control-plane: "true"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: null
  name: opa-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: opa-role
subjects:
- kind: ServiceAccount
  name: opa
  namespace: opa
---
kind: Service
apiVersion: v1
metadata:
  name: opa
  namespace: opa
  labels:
    opa-control-plane: "true"
spec:
  selector:
    app: opa
  ports:
  - name: https
    protocol: TCP
    port: 443
    targetPort: https
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: opa
    opa-control-plane: "true"
  name: opa
  namespace: opa
spec:
  selector:
    matchLabels:
      app: opa
  template:
    metadata:
      labels:
        app: opa
        opa-control-plane: "true"
      name: opa
    spec:
      serviceAccountName: opa
      containers:
      - name: opa
        image: openpolicyagent/opa:0.21.0
        args:
        - run
        - --server
        - --tls-cert-file=/certs/tls.crt
        - --tls-private-key-file=/certs/tls.key
        - --addr=https://0.0.0.0:443
        - --addr=http://127.0.0.1:8181
        - --log-level=error
        volumeMounts:
          - readOnly: true
            mountPath: /certs
            name: opa-server
        readinessProbe:
          httpGet:
            path: /health
            scheme: HTTPS
            port: https
          initialDelaySeconds: 3
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            scheme: HTTPS
            port: https
        ports:
        - containerPort: 443
          name: https
      - name: kube-mgmt
        image: openpolicyagent/kube-mgmt:0.11
        args:
        - --policies=opa
        - --enable-data=true
      volumes:
      - name: opa-server
        secret:
          secretName: opa-server-secret
---
apiVersion: cert-manager.io/v1alpha2
kind: Certificate
metadata:
  name: opa-webhook
  namespace: opa
spec:
  secretName: opa-server-secret
  duration: 10m
  renewBefore: 5m
  commonName: opa
  dnsNames:
  - opa
  - opa.opa
  - opa.opa.svc
  issuerRef:
    name: selfsigning-issuer
    kind: ClusterIssuer
---
kind: ValidatingWebhookConfiguration
apiVersion: admissionregistration.k8s.io/v1beta1
metadata:
  name: opa-validating-webhook
  annotations:
    cert-manager.io/inject-ca-from: opa/opa-webhook
webhooks:
- name: validating-webhook.openpolicyagent.org
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    - UPDATE
    resources:
    - pods
  clientConfig:
    caBundle: Cg==
    service:
      namespace: opa
      name: opa
  namespaceSelector:
    matchExpressions:
    - key: opa-control-plane
      operator: DoesNotExist
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: opa-default-system-main
  namespace: opa
  labels:
    openpolicyagent.org/policy: rego
data:
  main: |
    package system

    import data.kubernetes.admission

    main = {
      "apiVersion": "admission.k8s.io/v1beta1",
      "kind": "AdmissionReview",
      "response": response,
    }

    default response = {"allowed": true}

    response = {
        "allowed": false,
        "status": {
            "reason": reason,
        },
    }{
        reason = concat(", ", admission.deny)
        reason != ""
    }

Additional Info

This could be solved by having a "watch" on the file on disk and update a cached version of the certificate if it changes, and implementing tls.Config.GetCertificate to run a function to return the cached certificate. As a reference, other tools that implement this are vault-k8s (https://github.com/hashicorp/vault-k8s/blob/master/subcommand/injector/command.go, https://github.com/hashicorp/vault-k8s/blob/master/helper/cert/source_disk.go), or cert-manager itself (https://github.com/jetstack/cert-manager/blob/master/pkg/webhook/server/tls/file_source.go). I also implemented a somewhat simpler version of it on vault-agent-auto-inject-webhook (https://github.com/patoarvizu/vault-agent-auto-inject-webhook/blob/master/cmd/webhook.go). In the case of vault-k8s and vault-agent-auto-inject-webhook the watch is done using https://github.com/radovskyb/watcher, and cert-manager implements it with time.Ticker.

An alternative to implementing GetCertificate would be to force the pod to restart with os.Exit(0) on a file update event, although depending on how frequently the pod is forced to restart, it can cause unintended consequences, so the approach above is probably cleaner.

@ashutosh-narkar
Copy link
Member

OPA does load the certificate and key from disk on every call to the remote service. This method is called on every call. It would be helpful to see OPA's debug logs to figure out what's going on.

@tsandall tsandall added this to TODO (Things That Should Be Done) in Open Policy Agent via automation Jul 7, 2020
@tsandall
Copy link
Member

tsandall commented Jul 7, 2020

@patoarvizu thanks for filing this. As @ashutosh-narkar mentioned, other parts of OPA (e.g., all of the service clients for bundle downloading and decision log uploading, etc.) do reload certs (and do so on each request for simplicity.) However, in the case of the HTTP server, you're right that the certs are only loaded once at startup.

I'd be wary about relying on file watching mechanisms due to flakyness across platforms. Perhaps a periodic reload (e.g., check every 10 seconds) would be enough. @patrick-east probably has some thoughts here.

@patoarvizu
Copy link
Author

Thanks for the response @ashutosh-narkar @tsandall

For additional reference, it seems like Gatekeeper is also rotating certs using periodic checks (https://github.com/open-policy-agent/gatekeeper/blob/master/pkg/webhook/certs.go), although the Gatekeeper case is a bit different because it is generating and injecting its own certificates (as opposed to using something external like cert-manager), so it has a little more control over it, but it seems like a similar approach could be used here.

@ashutosh-narkar
Copy link
Member

Sorry @patoarvizu, I missed that you were referring to the server. +1 for using using periodic checks.

@wma1729
Copy link

wma1729 commented Nov 29, 2021

Any updates on this? This is a big blocker for us as well. Our certificates are valid for 24 hours only.

@tsandall
Copy link
Member

@wma1729 there hasn't been any progress on this though we could prioritize it over the next release or two. In your environment, would it work if OPA just periodically reloaded the certificates from disk? In other words, OPA would re-read the certificates every X seconds. If the read succeeds, it would be update the certificate used by the server. If the read fails, OPA would minimally log something to the console at ERROR level to indicate the certificate reload could not be performed. OPA would continue using the last successfully loaded certificate. The reload period could be configurable. What default would you like to see?

@tsandall tsandall moved this from TODO (Things That Should Be Done) to Planned (Things We Are Going To Do) in Open Policy Agent Nov 30, 2021
@wma1729
Copy link

wma1729 commented Nov 30, 2021 via email

@wma1729
Copy link

wma1729 commented Nov 30, 2021

Thought some more on this. I, somehow, am okay with the watch approach as well. In fact, I like it more. May I know your concerns with watchers? "I'd be wary about relying on file watching mechanisms due to flakyness across platforms." What platforms have troubled you? Is Windows your concern?

Usually the systems that update cert file periodically updates the file way before the cert actually expires... For example if the cert expires in 24 hours... the cert is usually renewed after (24 - T) hours where T could be 1 hour or 30 minutes or 5 minutes but rarely 30 seconds... so we should have enough time to detect the change.

If we go with periodic reload option, please record the sha digest of the file and reload the certificate only when the sha changes. And a default of 5 minutes should be good IMHO. But as long as it is configurable, we should be okay.

@srenatus srenatus self-assigned this Dec 7, 2021
@srenatus srenatus moved this from Planned - v0.36 to In Progress in Open Policy Agent Dec 7, 2021
@tsandall
Copy link
Member

tsandall commented Dec 7, 2021

Let's go ahead with a periodic reload. Hashing the cert file sounds fine assuming the reload on the server is expensive. I don't know enough about the http/tls package in Go to say which approach is better... @srenatus what do you think?

What platforms have troubled you? Is Windows your concern?

I'm concerned about relying on inotify() under the hood. Maybe it's improved and more reliable these days but in the past it was not something I'd want to depend on (e.g., if the watch doesn't fire as expected, we'll get a bug report).

@srenatus
Copy link
Contributor

srenatus commented Dec 8, 2021

Related to #4107, I ran a quick benchmark of what the costs of different scenarios, given the certs have not changed, would be:

U @ 2.30GHz
BenchmarkCertReload
BenchmarkCertReload/Load_and_Store
BenchmarkCertReload/Load_and_Store-16         	   17380	     67021 ns/op	   17740 B/op	     156 allocs/op
BenchmarkCertReload/Load_and_Compare_bytes
BenchmarkCertReload/Load_and_Compare_bytes-16 	   16795	     71509 ns/op	   17612 B/op	     155 allocs/op
BenchmarkCertReload/Compare_sums_of_files
BenchmarkCertReload/Compare_sums_of_files-16  	   25821	     44358 ns/op	   66176 B/op	      12 allocs/op
code
package server

import (
	"bytes"
	"crypto/sha256"
	"crypto/tls"
	"io"
	"os"
	"sync/atomic"
	"testing"
)

func BenchmarkCertReload(b *testing.B) {
	certFile := "../test/e2e/certrefresh/testdata/server-cert.pem"
	certKeyFile := "../test/e2e/certrefresh/testdata/server-key.pem"

	b.Run("Load and Store", func(b *testing.B) {
		var val atomic.Value
		b.ResetTimer()
		for n := 0; n < b.N; n++ {
			cert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
			if err != nil {
				b.Fatal(err)
			}
			val.Store(&cert)
		}
	})

	b.Run("Load and Compare bytes", func(b *testing.B) {
		oldCert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
		if err != nil {
			b.Fatal(err)
		}
		b.ResetTimer()
		for n := 0; n < b.N; n++ {
			cert, err := tls.LoadX509KeyPair(certFile, certKeyFile)
			if err != nil {
				b.Fatal(err)
			}
			if !bytes.Equal(oldCert.Certificate[0], cert.Certificate[0]) {
				b.Error("expected equal certs")
			}
		}
	})

	b.Run("Compare sums of files", func(b *testing.B) {
		hashCert := hash(b, certFile)
		hashKey := hash(b, certKeyFile)

		b.ResetTimer()
		for n := 0; n < b.N; n++ {
			newHashCert := hash(b, certFile)
			newHashKey := hash(b, certKeyFile)
			if !bytes.Equal(newHashKey, hashKey) || !bytes.Equal(newHashCert, hashCert) {
				b.Error("expected equal certs")
			}
		}
	})
}

func hash(b *testing.B, file string) []byte {
	b.Helper()
	f, err := os.Open(file)
	if err != nil {
		b.Fatal(err)
	}
	defer f.Close()

	h := sha256.New()
	if _, err := io.Copy(h, f); err != nil {
		b.Fatal(err)
	}

	return h.Sum(nil)
}

I don't think there's a clear winner here, is there? The checksumming approach (3rd one) has less allocations but more memory usage... but overall, it doesn't look terrible. If any of these things happen every 5 minutes, no harm is done, I believe. The first approach is the simplest when it comes to the underlying code: it's just a (atomic.Value).Load() and (atomic.Value).Store(). Keeping track of checksums on the side requires a more involved setup, I believe, but I'll try to shed some light on that, too, experimenting around.

@tsandall tsandall changed the title Reload TLS certificate for HTTPS server Add support for TLS certificate rotation in OPA's HTTP server Dec 8, 2021
Open Policy Agent automation moved this from In Progress to Done Dec 9, 2021
srenatus added a commit that referenced this issue Dec 9, 2021
This adds a new flag to `opa run`, intended for server usage with HTTPS listeners:
`--tls-cert-refresh-period`. If used with a positive duration, such as "5m" (5 minutes),
"24h", etc, the server will track the certificate and key files' contents. When their
content changes, the certificates will be reloaded.

On an error in reloading, it will log (info) the error and try again in the next round.

Fixes #2500.

Signed-off-by: Stephan Renatus <stephan.renatus@gmail.com>
@superff
Copy link

superff commented Jan 3, 2022

When is the new release coming? thanks

@anderseknert
Copy link
Member

Tomorrow, most likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
7 participants