Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writing-good-e2e-tests.md: polling, contexts and failure messages #7021

Merged
merged 1 commit into from
Mar 1, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
320 changes: 316 additions & 4 deletions contributors/devel/sig-testing/writing-good-e2e-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,22 @@ time, and delayed merges.

#### Debuggability ####

If your test fails, it should provide as detailed as possible reasons
for the failure in its output. "Timeout" is not a useful error
If your test fails, it should provide as detailed as possible reasons for the
failure in its failure message. The failure message is the string that gets
passed (directly or indirectly) to `ginkgo.Fail(f)`. That text is what gets
shown in the overview of failed tests for a Prow job and what gets aggregated
by https://go.k8s.io/triage.

A good failure message:
- identifies the test failure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say whether text should not needs to include names of dynamic properties like pod names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment actually conflicts with the later suggestion:

It is good practice to include details like the object that failed some assertion in the failure message because

it seems like aggregation handles it all right. You can disregard the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Aggregation simplifies the failure message. I have added a paragraph with a link to the relevant code.

- has enough details to provide some initial understanding of what went wrong

It's okay for it to contain information that changes during each test
run. Aggregation [simplifies the failure message with regular
expressions](https://github.com/kubernetes/test-infra/blob/d56bc333ae8acf176887a3249f750e7a8e0377f0/triage/summarize/text.go#L39-L69)
before looking for similar failures.

"Timeout" is not a useful error
message. "Timed out after 60 seconds waiting for pod xxx to enter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this example does pod xxx contains the unique pod name? WOuld it make it harder to aggregate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

running state, still in pending state" is much more useful to someone
trying to figure out why your test failed and what to do about it.
Expand All @@ -38,17 +52,315 @@ like the following generates rather useless errors:
Expect(err).NotTo(HaveOccurred())
```

Rather
Errors returned by client-go can be very detailed. A better way to test for
errors is with `framework.ExpectNoError` because it logs the full error and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see , this seems to be the only helper not wrapping gomega
kubernetes/kubernetes#115961

then includes only the shorter `err.Error()` in the failure message. To make
that failure message more informative,
[annotate](https://onsi.github.io/gomega/#annotating-assertions) your assertion with something like this:

```
Expect(err).NotTo(HaveOccurred(), "Failed to create %d foobars, only created %d", foobarsReqd, foobarsCreated)
framework.ExpectNoError(err, "tried creating %d foobars, only created %d", foobarsReqd, foobarsCreated)
```

On the other hand, overly verbose logging, particularly of non-error conditions, can make
it unnecessarily difficult to figure out whether a test failed and if
so why? So don't log lots of irrelevant stuff either.

Except for this special case, using Gomega assertions directly is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that there is wide consensus on using Gomega, I personally prefer to use stdlib directly, and I know some other contributors too

if !strings.Contains(actualStr, expectedSubStr) {
   framework.Failf("expected %s to contain %s",actualStr, expectedSubStr) 
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you expand on this later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this already shows how hand-written code can be problematic: is the failure message still readable when actualStr is empty? When using %q, how readable is it for multi-line strings?

The other problem is inconsistency.

I understand that we cannot require the usage of gomega, but I still think that we should encourage it where it makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to write about kubernetes/kubernetes#113298 here, which implies that we need to get that merged first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added something about that.

encouraged. They are more versatile than the (few) wrappers that were added to
the E2E framework. Use assertions that match the check in the test. Using Go
code to evaluate some condition and then checking the result often isn't
informative. For example this check should be avoided:

```
gomega.Expect(strings.Contains(actualStr, expectedSubStr)).To(gomega.Equal(true))
```

Better pass both values to Gomega, which will automatically include them in the
failure message. Add an annotation that explains what the assertion is about:

```
gomega.Expect(actualStr).To(gomega.ContainSubstring("xyz"), "checking log output")
```

This produces the following failure message:
```
[FAILED] checking log output
Expected
<string>: hello world
to contain substring
<string>: xyz
```

If there is no suitable Gomega assertion, call `ginkgo.Failf` directly:
```
import "github.com/onsi/gomega/format"

ok := someCustomCheck(abc)
if !ok {
ginkgo.Failf("check xyz failed for object:\n%s", format.Object(abc))
}
```

[Comparing a boolean](https://github.com/kubernetes/kubernetes/issues/105678)
like this against `true` or `false` with `gomega.Equal` or
`framework.ExpectEqual` is not useful because dumping the actual and expected
value just distracts from the underlying failure reason.

Dumping structs with `format.Object` is recommended. Starting with Kubernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean for an user? can we add an example of how to use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

1.26, `format.Object` will pretty-print Kubernetes API objects or structs [as
YAML and omit unset
fields](https://github.com/kubernetes/kubernetes/pull/113384), which is more
readable than other alternatives like `fmt.Sprintf("%+v")`.

import (
"fmt"
"k8s.io/api/core/v1"
"k8s.io/kubernetes/test/utils/format"
)

var pod v1.Pod
fmt.Printf("Printf: %+v\n\n", pod)
fmt.Printf("format.Object:\n%s", format.Object(pod, 1 /* indent one level */))

=>

Printf: {TypeMeta:{Kind: APIVersion:} ObjectMeta:{Name: GenerateName: Namespace: SelfLink: UID: ResourceVersion: Generation:0 CreationTimestamp:0001-01-01 00:00:00 +0000 UTC DeletionTimestamp:<nil> DeletionGracePeriodSeconds:<nil> Labels:map[] Annotations:map[] OwnerReferences:[] Finalizers:[] ManagedFields:[]} Spec:{Volumes:[] InitContainers:[] Containers:[] EphemeralContainers:[] RestartPolicy: TerminationGracePeriodSeconds:<nil> ActiveDeadlineSeconds:<nil> DNSPolicy: NodeSelector:map[] ServiceAccountName: DeprecatedServiceAccount: AutomountServiceAccountToken:<nil> NodeName: HostNetwork:false HostPID:false HostIPC:false ShareProcessNamespace:<nil> SecurityContext:nil ImagePullSecrets:[] Hostname: Subdomain: Affinity:nil SchedulerName: Tolerations:[] HostAliases:[] PriorityClassName: Priority:<nil> DNSConfig:nil ReadinessGates:[] RuntimeClassName:<nil> EnableServiceLinks:<nil> PreemptionPolicy:<nil> Overhead:map[] TopologySpreadConstraints:[] SetHostnameAsFQDN:<nil> OS:nil HostUsers:<nil> SchedulingGates:[] ResourceClaims:[]} Status:{Phase: Conditions:[] Message: Reason: NominatedNodeName: HostIP: PodIP: PodIPs:[] StartTime:<nil> InitContainerStatuses:[] ContainerStatuses:[] QOSClass: EphemeralContainerStatuses:[] Resize:}}

format.Object:
<v1.Pod>:
metadata:
creationTimestamp: null
spec:
containers: null
status: {}

Consider writing a [Gomega
matcher](https://onsi.github.io/gomega/#adding-your-own-matchers). It combines
the custom check and generating the failure messages and can make tests more
readable.

It is good practice to include details like the object that failed some
assertion in the failure message because then a) the information is available
when analyzing a failure that occurred in the CI and b) it only gets logged
when some assertion fails. Always dumping objects via log messages can make the
test output very large and may distract from the relevant information.

#### Recovering from test failures ####

All tests should ensure that a cluster is restored to the state that it was in
before the test ran. [`ginkgo.DeferCleanup`
](https://pkg.go.dev/github.com/onsi/ginkgo/v2#DeferCleanup) is recommended for
this because it can be called similar to `defer` directly after setting up
something. It is better than `defer` because Ginkgo will show additional
details about which cleanup code is running and (if possible) handle timeouts
for that code (see next section). Is is better than `ginkgo.AfterEach` because
it is not necessary to define additional variables and because
`ginkgo.DeferCleanup` executes code in the more useful last-in-first-out order,
i.e. things that get set up first get removed last.

Objects created in the test namespace do not need to be deleted because
deleting the namespace will also delete them. However, if deleting an object
may fail, then explicitly cleaning it up is better because then failures or
timeouts related to it will be more obvious.

In cases where the test may have removed the object, `framework.IgnoreNotFound`
can be used to ignore the "not found" error:
```
podClient := f.ClientSet.CoreV1().Pods(f.Namespace.Name)
pod, err := podClient.Create(ctx, testPod, metav1.CreateOptions{})
framework.ExpectNoError(err, "create test pod")
ginkgo.DeferCleanup(framework.IgnoreNotFound(podClient.Delete), pod.Name, metav1.DeleteOptions{})
```

#### Interrupting tests ####

When aborting a manual `gingko ./test/e2e` invocation with CTRL-C or a signal,
the currently running test(s) should stop immediately. This gets achieved by
accepting a `ctx context.Context` as first parameter in the Ginkgo callback
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is by far the best addition ❤️

function and then passing that context through to all code that might
block. When Ginkgo notices that it needs to shut down, it will cancel that
context and all code trying to use it will immediately return with a `context
canceled` error. Cleanup callbacks get a new context which will time out
eventually to ensure that tests don't get stuck. For a detailed description,
see https://onsi.github.io/ginkgo/#interrupting-aborting-and-timing-out-suites.
Most of the E2E tests [were update to use the Ginkgo
context](https://github.com/kubernetes/kubernetes/pull/112923) at the start of
the 1.27 development cycle.

There are some gotchas:

- Don't use the `ctx` passed into `ginkgo.It` in a `ginkgo.DeferCleanup`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that was a bit of a footgun ...

callback because the context will be canceled when the cleanup code
runs. This is wrong:

ginkgo.It("something", func(ctx context.Context) {
...
ginkgo.DeferCleanup(func() {
// do something with ctx
})
})

Instead, register a function which accepts a new context:

ginkgo.DeferCleanup(func(ctx context.Context) {
// do something with the new ctx
})

Anonymous functions can be avoided by passing some existing function and its
parameters directly to `ginkgo.DeferCleanup`. Again, beware to *not* pass the
wrong `ctx`. This is wrong:

ginkgo.It("something", func(ctx context.Context) {
...
ginkgo.DeferCleanup(myDeleteFunc, ctx, objName)
})

Instead, just pass the other parameters and let `ginkgo.DeferCleanup`
add a new context:

ginkgo.DeferCleanup(myDeleteFunc, objName)

- When starting some background goroutine in a `ginkgo.BeforeEach` callback,
use `context.WithCancel(context.Background())`. The context passed into the
callback will get canceled when the callback returns, which would cause the
background goroutine to stop before the test runs. This works:

backgroundCtx, cancel := context.WithCancel(context.Background())
ginkgo.DeferCleanup(cancel)
_, controller = cache.NewInformer( ... )
go controller.Run(backgroundCtx.Done())

- When adding a timeout to the context for one particular operation,
beware of overwriting the `ctx` variable. This code here applies
the timeout to the next call and everything else that follows:

ctx, cancel := context.WithTimeout(ctx, 5 * time.Second)
defer cancel()
someOperation(ctx)
...
anotherOperation(ctx)

Better use some other variable name:

timeoutCtx, cancel := context.WithTimeout(ctx, 5 * time.Second)
defer cancel()
someOperation(timeoutCtx)

When the intention is to set a timeout for the entire callback, use
[`ginkgo.NodeTimeout`](https://pkg.go.dev/github.com/onsi/ginkgo/v2#NodeTimeout):

ginkgo.It("something", ginkgo.NodeTimeout(30 * time.Second), func(ctx context.Context) {
})

There is also a `ginkgo.SpecTimeout`, but that then also includes the time
taken for `BeforeEach`, `AfterEach` and `DeferCleanup` callbacks.

#### Polling and timeouts ####

When waiting for something to happen, use a reasonable timeout. Without it, a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should encourage to use the framework timeouts https://github.com/kubernetes/kubernetes/blob/1edbb8cf1a864c2dd71a1e5b7282de4bc3d8f5fa/test/e2e/framework/timeouts.go#L21-L39 so we have consistency on the operations.

We also have to set some upper bounds, is not ok that a test takes more than 5 minutes to run a pod , per example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that, but it leads to another can of worms: if we recommend using NewTimeoutContextWithDefaults, what should we tell developers when they need a timeout that isn't covered by the existing ones? Should such timeouts be configurable? How?

The ones referenced above are not configurable. Others are.

It's a mess 😭

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then there are timeouts hard-coded even inside the framework:

https://github.com/kubernetes/kubernetes/blob/1edbb8cf1a864c2dd71a1e5b7282de4bc3d8f5fa/test/e2e/framework/pod/wait.go#L41-L65

Did I mention "mess" already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but the common timeouts like podStart and podDelete should be the same, those has demonstrated to be useful to surface CRI regressions ... also, people tend to increase timeouts to infinite instead of debugging, and that use to hide underly issues ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubernetes/kubernetes#114783 is my attempt to clean this up. If we can agree on that and merge it first, then I am more comfortable with pointing developers towards TimeoutContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

test might keep running until the entire test suite gets killed by the
CI. Beware that the CI under load may take a lot longer to complete some
operation compared to running the same test locally. On the other hand, a too
long timeout can be annoying when trying to debug tests locally.

The framework provides some [common
timeouts](https://github.com/kubernetes/kubernetes/blob/eba98af1d8b19b120e39f3/test/e2e/framework/timeouts.go#L44-L109)
through the [framework
instance](https://github.com/kubernetes/kubernetes/blob/1e84987baccbccf929eba98af1d8b19b120e39f3/test/e2e/framework/framework.go#L122-L123).
When writing a test, check whether one of those fits before defining a custom
timeout in the test.

Good code that waits for something to happen meets the following criteria:
- accepts a context for test timeouts
- informative during interactive use (i.e. intermediate reports, either
periodically or on demand)
- little to no output during a CI run except when it fails
- full explanation when it fails: when it observes some state and then
encounters errors reading the state, then dumping both the latest
observed state and the latest error is useful
- extension mechanism for writing custom checks
- early abort when condition cannot be reached anymore

[`gomega.Eventually`](https://pkg.go.dev/github.com/onsi/gomega#Eventually)
satisfies all of these criteria and therefore is recommended, but not required.
In https://github.com/kubernetes/kubernetes/pull/113298,
[test/e2e/framework/pods/wait.go](https://github.com/kubernetes/kubernetes/blob/222f65506252354da012c2e9d5457a6944a4e681/test/e2e/framework/pod/wait.go)
and the framework were modified to use gomega. Typically, `Eventually` is
passed a function which gets an object or lists several of them, then `Should`
checks against the expected result. Errors and retries specific to Kubernetes
are handled by [wrapping client-go
functions](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/get.go).

Using gomega assertions in helper packages is problematic for two reasons:
- The stacktrace associated with the failure starts with the helper unless
extra care is take to pass in a stack offset.
- Additional explanations for a potential failure must be prepared beforehand
and passed in.

The E2E framework therefore uses a different approach:
- [`framework.Gomega()`](https://github.com/kubernetes/kubernetes/blob/222f65506252354da012c2e9d5457a6944a4e681/test/e2e/framework/expect.go#L80-L101)
offers similar functions as the `gomega` package, except that they return a
normal error instead of failing the test.
- That error gets wrapped with `fmt.Errorf("<explanation>: %w)` to
add additional information, just as in normal Go code.
- Wrapping the error (`%w` instead of `%v`) is important because then
`framework.ExpectNoError` directly uses the error message as failure without
additional boiler plate text. It also is able to log the stacktrace where
the error occurred and not just where it was finally treated as a test
failure.

Some tips for writing and debugging long-running tests:

- Use `ginkgo.By` to record individual steps. Ginkgo will use that information
when describing where a test timed out.

- Invoke the `ginkgo` CLI with `--poll-progress-after=30s` or some other
suitable duration to [be informed
early](https://onsi.github.io/ginkgo/#getting-visibility-into-long-running-specs)
why a test doesn't complete and where it is stuck. A SIGINFO or SIGUSR1
signal can be sent to the CLI and/or e2e.test processes to trigger an
immediate progress report.

- Use [`gomega.Eventually`](https://pkg.go.dev/github.com/onsi/gomega#Eventually)
to wait for some condition. When it times out or
gets stuck, the last failed assertion will be included in the report
automatically. A good way to invoke it is:

gomega.Eventually(ctx, func(ctx context.Context) (book Book, err error) {
// Retrieve book from API server and return it.
...
}).WithPolling(5 * time.Second).WithTimeout(30 * time.Second).
Should(gomega.HaveField("Author.DOB.Year()", BeNumerically("<", 1900)))

Avoid testing for some condition inside the callback and returning a boolean
because then failure messages are not informative (see above). See
https://github.com/kubernetes/kubernetes/pull/114640 for an example where
[gomega/gcustom](https://pkg.go.dev/github.com/onsi/gomega@v1.27.2/gcustom)
was used to write assertions.

Some of the E2E framework sub-packages have helper functions that wait for
certain domain-specific conditions. Currently most of these functions don't
follow best practices (not using gomega.Eventually, error messages not very
informative). [Work is
planned](https://github.com/kubernetes/kubernetes/issues/106575) in that
area, so beware that these APIs may
change at some point.
Comment on lines +342 to +348
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is true but I'm afraid we leave stale information forever, but considering the pace we move I think that it may be ok :)


- Use `gomega.Consistently` to ensure that some condition is true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great advice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading it again and considering a PR that I just reviewed, this statement might be interpreted as doing something like Eventually(func() bool {...}).Should(BeTrue()). That's the "check boolean" anti-pattern.

I'll rephrase...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, examples that people can copy paste is the best solution for documentation ... see stackoverflow 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended and linked to an example in the previous section.

for a while. As with `gomega.Eventually`, make assertions about the
value instead of checking the value with Go code and then asserting
that the code returns true.

- Both `gomega.Consistently` and `gomega.Eventually` can be aborted early via
[`gomega.StopPolling`](https://onsi.github.io/gomega/#bailing-out-early---polling-functions).

- Avoid polling with functions that don't take a context (`wait.Poll`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add the linter for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`wait.PollImmediate`, `wait.Until`, ...) and replace with their counterparts
that do (`wait.PollWithContext`, `wait.PollImmediateWithContext`,
`wait.UntilWithContext`, ...) or even better, with `gomega.Eventually`.


#### Ability to run in non-dedicated test clusters ####

To reduce end-to-end delay and improve resource utilization when
Expand Down