Writing Good Integration Tests

Why is this important?

Istio has a long list of supported topologies and maintaining an integration test for every possible combination isn't scalable. Instead, we've chosen to make our tests generic so that they can be run against a variety of topologies. This document identifies some basic guidelines to help you get started.

Rule 1: Keep test times low

Integration tests are run for every PR. Long test times can significantly hurt the developer experience. In general, we want each test make target to complete in under 30 minutes on prow (10-15 is ideal).

Avoid creating new test suites

If possible, add your test to an existing suite (i.e. TestMain). Creating a new suite means we have to teardown and redeploy Istio as well as any other resources shared by the tests in the suite.

Avoid custom setup

Tests that require custom configuration of the control plane require a separate suite. Additional suites add to the overall test time, as described above. If there is a specific setting required, we should consider adding it to our default IOP file.

Re-use echos

Spinning up services for tests is one of the slower pieces of writing a test. Wherever possible, keep this setup in your suite's Setup to avoid long test times. Doing this allows us to start the services in parallel, which isn't possible if they're created by each individual test

An echo Builder will start all of its instances in parallel. If using multiple Builders, make sure that they each build in parallel to reduce the overall start up time.

var (
   a, b echo.Instance
)

func TestMain(m *testing.M) {
  framework.
  NewSuite(m).
  Setup(istio.Setup(nil, nil)).
  Setup(func(ctx resource.Context) (err error) {
     echoboot.NewBuilder(ctx).
       With(&a, cfgA).
       With(&b, cfgB).
       Build()
  })
}

func TestFeatureA(t *testing.T) {
    framework.
      NewTest(t).
      Run(func(ctx framework.TestContext) {
        ctx.Config().ApplyYAMLOrFail(...)
        ctx.WhenDone(func() error {
          ctx.Config().DeleteYAMLOrFail(...) // make sure to cleanup test-specific config
        })
        /* test the config on the existing echo services instead of creating/deleting new ones just for this test */
      })
}

Rule 2: Don't be flaky

Address test flakiness immediately. Utilize retries to make sure your tests pass consistently. False negatives can slow down development across all PRs.

Rule 3: Use feature labels

We want to track which features are covered. Don't forget to add feature labels to your tests!

Rule 4: Use all clusters

The test framework is provided with a number of clusters at startup, which may be divided into any number of networks with control planes running in any number of clusters.

New tests should leverage all of the available clusters. If traffic is being sent, it should ideally be sent from every cluster to every other cluster. This implies that service workloads be deployed on all clusters.

Tests should not depend on any particular network or control plane topology.

Currently, there are plenty of existing tests that don't follow these guidelines. As a result, they are currently skipped using RequireSingleCluster or other mechanisms. In the future, we should avoid this to make sure features work everywhere.

Supporting multicluster requires:

Avoiding RequireSingleCluster or RequireMaxClusters on the suite/test.
Deploying services in all clusters from the environment. (using ctx.Clusters() from the test context)
Ensuring all interactions between services are tested between every pair of clusters.
Use ParsedResponse from echo calls to investigate where traffic actually went.

Setup echos in all clusters:

func TestMain(m *testing.M) {
  framework.
  NewSuite(m).
  Setup(istio.Setup(nil, nil)).
  Setup(func(ctx resource.Context) (err error) {
     builder := echoboot.NewBuilder(ctx)
     for _, c := ctx.Clusters() {
       builder.
         With(nil, echo.Config{Service: "a", Cluster: a}).
         With(nil, echo.Config{Service: "b", Cluster: b}).
         With(nil, echo.Config{Service: fmt.Sprintf("c-%d", c.Index()), Cluster: c})
     }
     echos, _ := builder.Build()
     // package-level echos!
     a = echos.Match(echo.Service("a"))
     b = echos.Match(echo.Service("b"))
     c = echos.Match(echo.ServicePrefix("c-"))
  })
}

Test cross-cluster reachability with service that is the same across clusters:

func TestCrossClusterLoadBalancing(t *testing.T) {
  framework.NewTest(t).Run(func(ctx framework.TestContext) {
    // loop through the source services
    for _, a := a {
    a := a
    t.NewSubtest(a.Config().Cluster.Name()).Run(func(ctx framework.TestContext) {
      // we only need to target b[0]; all `b` services should be hit since they have the same name
      // set Count to something proportional to the number of possible targets to give load-balancing a chance to work
      res := a.CallOrFail(ctx, echo.CallOptions{Target: b[0], Count: 2*len(b)})
      // ensure 100% success
      res.CheckOKOrFail(ctx)
      // verify we reached all instances by using ParsedResponse
      clusterHits := map[string]int{}
      for _, r := range responses {
          clusterHits[r.Cluster]++
      }
      if len(hits) < targetCount {
        ctx.Fatal("did not hit all clsuters")
      }
    })
  })
}

Test cross-cluster reachability with services that are unique per cluster:

func TestCrossClusterLoadBalancing(t *testing.T) {
    framework.NewTest(t).Run(func(ctx framework.TestContext) {
      // loop through the all sources and destinations
      for _, a := range a {
        for _, c := range c {
          a, c := a, c
          t.NewSubtest(a.Config().Cluster.Name()).Run(func(ctx framework.TestContext) {
            // no need to set Count or verify that responses went to all instances
            a.CallOrFail(ctx, echo.CallOptions{Target: c}).CheckOKOrFail(ctx)
          })
        }
      }
    })
}

Rule 5: Use built-in framework features

Prefer `framework.TestContext` over `*testing.T`

Wherever possible, use the Istio test framework for tasks like creating subtests or marking tests as failed. Failure to do so can break assumptions about when things are cleaned up or cause a test to miss out on valuable features of the test framework, like context dumping.

// bad
t.Run("subtest", func(t *testing.T) {...})
// good
ctx.NewSubtest("subtest").Run(func(ctx framework.TestContext) {...})

Prefer `WhenDone`/`CleanupOrFail` over `defer`

The test framework will dump the current state of pods, proxies and logs on test failures. To take full advantage of this, avoid cleaning up resources with defer statements and prefer ctx.WhenDone. This will cause the cleanup to execute after the framework has a chance to dump the current state of things.

NOTE: WhenDone will not be skipped by --istio.test.nocleanup. ctx.Cleanup or ctx.CleanupOrFail should be used for cleanup that can be safely skipped without affecting other tests.

// bad - the config will be removed before we have a chance to dump config; you'll be missing out on valuable debug info
ctx.Config().ApplyYamlOrFail(ctx, ns, cfg...)
defer ctx.Config().DeleteYamlOrFail(ctx, ns, cfg...)

// good - we will dump the logs, proxy configs, and pods before we start removing the config under test
ctx.Config().ApplyYamlOrFail(ctx, ns, cfg...)
ctx.WhenDone(func() error {
  return ctx.Config().DeleteYAML(ns, cfg...)
})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing Good Integration Tests

Why is this important?

Rule 1: Keep test times low

Avoid creating new test suites

Avoid custom setup

Re-use echos

Rule 2: Don't be flaky

Rule 3: Use feature labels

Rule 4: Use all clusters

Rule 5: Use built-in framework features

Prefer `framework.TestContext` over `*testing.T`

Prefer `WhenDone`/`CleanupOrFail` over `defer`

Dev Environment

Writing Code

Pull Requests

Testing

Performance

Releases

Misc

Central Istiod

Security

Mixer

Pilot

Telemetry

Clone this wiki locally

Writing Good Integration Tests

Why is this important?

Rule 1: Keep test times low

Avoid creating new test suites

Avoid custom setup

Re-use echos

Rule 2: Don't be flaky

Rule 3: Use feature labels

Rule 4: Use all clusters

Rule 5: Use built-in framework features

Prefer framework.TestContext over *testing.T

Prefer WhenDone/CleanupOrFail over defer

Dev Environment

Writing Code

Pull Requests

Testing

Performance

Releases

Misc

Central Istiod

Security

Mixer

Pilot

Telemetry

Clone this wiki locally

Prefer `framework.TestContext` over `*testing.T`

Prefer `WhenDone`/`CleanupOrFail` over `defer`