Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KMSv2] Use status key ID to determine staleness of encrypted data #114544

Merged
merged 1 commit into from
Jan 19, 2023

Conversation

ritazh
Copy link
Member

@ritazh ritazh commented Dec 16, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

KMSv2: Use status key ID to determine staleness of encrypted data

Which issue(s) this PR fixes:

Fixes #111922

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/issues/3299

/milestone v1.27
/sig auth
/triage accepted

@k8s-ci-robot k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Dec 16, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.27 milestone Dec 16, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 16, 2022
@k8s-ci-robot k8s-ci-robot added area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Dec 16, 2022
@ritazh
Copy link
Member Author

ritazh commented Dec 16, 2022

/assign @enj

@ritazh ritazh changed the title Kmsv2 keyid staleness [KMSv2] Use status key ID to determine staleness of encrypted data Dec 16, 2022
@ritazh
Copy link
Member Author

ritazh commented Dec 16, 2022

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 16, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 9, 2023
@ritazh
Copy link
Member Author

ritazh commented Jan 9, 2023

/test pull-kubernetes-unit

@aramase
Copy link
Member

aramase commented Jan 10, 2023

/cc

@@ -274,6 +280,15 @@ func (h *kmsv2PluginProbe) check(ctx context.Context) error {
return nil
}

// getCurrentKeyID returns the latest keyID from the Status() method or err if keyID is empty
func (h *kmsv2PluginProbe) getCurrentKeyID(ctx context.Context) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt this PR makes the API server coast (forever) on the last successful key ID it got from the plugin. Thus staleness checks for read calls will not fail if the plugin is down transiently. At a high level I believe we have three choices here:

  1. Coast on the last key ID, no error, staleness based on last observed key ID
  2. Error and fail the read request (would partially defeat the purpose of having a cache since we would no longer tolerate plugin downtime)
  3. Mark the read request as stale=true - for no-op updates this would cause a write (which would presumably fail since the plugin is down)

What do you think is the best approach here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another option is to make stale a *bool to try and express the "I do not know" state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to imagine life being better for a cluster admin if we make the cluster start failing reads ... I wouldn't do that.

failing no-op writes doesn't seem useful to me either... it pushes errors to users unlikely to be able to do anything about them

I think I would expect to coast, and for an error here to increment a metric to make it clear something is not working as expected (either transiently or persistently)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #115188 to add metrics as a follow up.

keyID := ""
probe.keyID.Store(&keyID)

go wait.PollImmediateUntilWithContext(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to think if there is a way to have an integration test that proves that this loop is running. 🤔

@enj
Copy link
Member

enj commented Jan 13, 2023

Running the tests locally I see that https://github.com/ritazh/kubernetes/blob/c5f62be64070e68ca8f7f8c38af0a52ce119107d/staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go#L569 causes a panic every so often in the new go wait.PollImmediateUntilWithContext function. Here is a workaround for that:

diff --git a/staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go b/staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go
index bd57f31ac77..4f93314e37a 100644
--- a/staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go
+++ b/staging/src/k8s.io/apiserver/pkg/server/options/encryptionconfig/config_test.go
@@ -26,8 +26,10 @@ import (
 	"time"
 
 	"github.com/google/go-cmp/cmp"
+
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/runtime/schema"
+	"k8s.io/apimachinery/pkg/util/wait"
 	apiserverconfig "k8s.io/apiserver/pkg/apis/config"
 	"k8s.io/apiserver/pkg/features"
 	"k8s.io/apiserver/pkg/storage/value"
@@ -243,7 +245,7 @@ func TestEncryptionProviderConfigCorrect(t *testing.T) {
 
 		for _, transformer := range transformers {
 			untransformedData, stale, err := transformer.Transformer.TransformFromStorage(ctx, transformedData, dataCtx)
-			if err != nil && err.Error() != "got unexpected empty keyID" {
+			if err != nil {
 				t.Fatalf("%s: error while reading using %s transformer: %s", testCase.Name, transformer.Name, err)
 			}
 			if stale != (transformer.Name != testCase.Name) {
@@ -551,7 +553,9 @@ func TestKMSPluginHealthz(t *testing.T) {
 				return
 			}
 
-			_, got, kmsUsed, err := getTransformerOverridesAndKMSPluginProbes(config, testContext(t).Done())
+			ctx, cancel := context.WithCancel(context.Background())
+			cancel() // cancel this upfront so the kms v2 healthz check poll only runs once
+			_, got, kmsUsed, err := getTransformerOverridesAndKMSPluginProbes(config, ctx.Done())
 			if err != nil {
 				t.Fatal(err)
 			}
@@ -565,6 +569,7 @@ func TestKMSPluginHealthz(t *testing.T) {
 					p.l = nil
 					p.lastResponse = nil
 				case *kmsv2PluginProbe:
+					waitForOneKMSv2Check(t, p) // make sure the kms v2 healthz check poll is done
 					p.service = nil
 					p.l = nil
 					p.lastResponse = nil
@@ -595,6 +600,19 @@ func TestKMSPluginHealthz(t *testing.T) {
 	}
 }
 
+func waitForOneKMSv2Check(t *testing.T, p *kmsv2PluginProbe) {
+	t.Helper()
+
+	if err := wait.PollImmediate(100*time.Millisecond, wait.ForeverTestTimeout, func() (done bool, err error) {
+		p.l.Lock()
+		defer p.l.Unlock()
+
+		return !p.lastResponse.received.IsZero(), nil
+	}); err != nil {
+		t.Fatal(err)
+	}
+}
+
 func TestKMSPluginHealthzTTL(t *testing.T) {
 	ctx := testContext(t)
 

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2023
@ritazh
Copy link
Member Author

ritazh commented Jan 17, 2023

@enj @aramase I've addressed your comments. When you get a chance, PTAL

Copy link
Member

@enj enj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close.

@ritazh
Copy link
Member Author

ritazh commented Jan 19, 2023

@enj @aramase I've addressed your comments. When you get a chance, PTAL

Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
@enj
Copy link
Member

enj commented Jan 19, 2023

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 19, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f26e0356dab57c67edc5b38085ae2ea05b91fce0

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enj, ritazh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 285e796 into kubernetes:master Jan 19, 2023
@ritazh ritazh deleted the kmsv2-keyid-staleness branch January 19, 2023 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Status: Done
Development

Successfully merging this pull request may close these issues.

[KMSv2] Use status key ID to determine staleness of encrypted data
5 participants