Skip to content

Reducing Test Flakiness

keyonghan edited this page Jan 24, 2024 · 22 revisions

Flakiness issue has caused a large portion of the Flutter tree redness, and below workflow will be enforced to reduce flaky issues. The framework post-submit DeviceLab tests will be focused on in the beginning, and the logic will be extended to other host only tests in the future.

From Flutter tree dashboard, a flake is identified as a box with an exclamation icon. There are two types that will result in same flaky box.

  • Multiple reruns on the same commit and same task (earlier run fails, but the last run succeeds). For this case, check logs by clicking different build runs.

Preventing flaky tests

DeviceLab tests are located under /dev/devicelab/bin/tasks. If you plan to add a new DeviceLab test, please follow

  • Create a PR to add test files
    • Make sure an ownership entry is created for the test in TESTOWNERS file
  • Enable the test in staging pool first
  • If no flakiness issue pops up, then enable the test in the prod env.
    • Switch bringup to true.

Detecting flaky tests

On a weekly basis, an automation script will scan through test execution statistics over the past 15 days and identify top flaky ones

  • If there are any test builders whose Flaky Ratio >= 2%
    • Create a tracking bug if not existing in the bug pool.
      • The sub-team TL will be assigned by default for further triage/re-assign.
      • P0 will be labeled
    • If it is not a shard test, the script marks the tests as flaky by updating the entry in .ci.yaml.
      • Add a # TODO(username): github issue url above the bringup: true line

If an issue is closed, there will be a grace period of 15 days before the automation script refile the issue if the same flakiness persists.

Triaging flaky tests

Figuring out how and why a set of tests is failing can be tricky. Here are a few tips to help kickstart the process.

Use the generated ticket

The auto-generated ticket will provide links to:

  • An example of recent flakes on the same commit with a link to its Luci build page if available
  • A list of flaky builds with links to their Luci build page
  • A link to the recent test runs on the flutter dashboard

All of these pieces of information are helpful for further narrowing down the issue.

Identify infrastructure issues

Things not directly related to the tests being run can be difficult to determine from the stdout logs. Things like timeouts are easier to determine using the execution details rather than test_stdout. From the Luci build page:

  • The failing step should have a red icon with an exclamation point and be expanded
  • Click on the execution details link
  • At the bottom of these logs, look for non-zero exit codes

Some other common infra issues:

Identify failing tests

Sometimes, the reported error is not immediately obvious as to what test has failed. In these cases, digging into the test_stdout for clues can be helpful. From the Luci build page:

  • The failing step should have a red icon with an exclamation point and be expanded, e.g. step run flutter_view_ios__start_up.
  • Click on the test_stdout link
    • Example 1
      • Search for "ERROR:" in the logs (don't skip the ":")
      • There will be a "RUNNING:" above any "ERROR:"s that are found, between the preceding "RUNNING:" and the "ERROR:" there are a few different things that could point to a test failure:
        • Keep an eye out for "Failed assertion:" which points to specific failures in dart/flutter tests
        • For non-dart/flutter tests, the output between "RUNNING:" and "ERROR:" can vary greatly, but is usually shorter and should provide some guidance as to what the failure was and how to address it
    • Example 2
      • Search for "[E]" for shard tests flaking on a unit test

Identifying a pattern

Flakes, by their nature, are inconsistent. Determining a pattern can be very helpful in figuring out what causes the issue, and may help in identifying a fix or workaround. The most useful tool for this is the flutter dashboard link provided by the ticket under "Recent test runs" which will filter to just the relevant flaky test set. E.g. Linux_android flutter_gallery__transition_perf_with_semantics

Verify repeating failures

Flakes are reported on the set of tests run, not specific test failures. This can mean that the issue raising us above our 2% threshold is actually more than one issue. This is pretty rare, but does happen on occasion. There's no need to waste a lot of time verifying that every failure is the same, but taking some time to verify some of the failures can save time in the long run. The list of "Flaky builds:" on the automated ticket is very useful for this, and you can often find additional flaky builds in the flutter dashboard.

Find a first instance

Flakes can be under the 2% threshold for a long time and slowly build up from various small increases in flakiness. Other times they can have a specific cause that started in an obvious location. The latter are much easier to resolve, so it's a good idea to check for a root cause. In the Flutter dashboard:

  • Scroll down to load more builds
  • Scroll down until you have a long string of green builds (say, a full screen or two)
    • This is a heuristic, so it may not truly be the first instance, but it works well for triaging
  • Scroll back up to the first red build or green build with an exclamation point
  • Clicking on the profile picture next to the failed/flaky build pops up a link to the commit where this failure occurred
    • See if the commit directly affects any of the tests that are failing by adding randomness, new asserts
    • Keep an eye out for rolls that might affect the tests that are failing as well
  • Compare the first instance of the flake with the previous successful run (see "Compare against successful runs" below for tips!)

Compare against successful runs

Sometimes it can be easier to identify patterns of what's different between test runs than it is to identify a reason for the failure just from the failing runs. After you've identified what's failing, if you can't figure out what's causing the test to fail, keep an eye out for these common patterns:

  • Do the tests fail when they're run in a certain order?
    • Our test sets' order is randomized, so sometimes interaction between tests can cause issues that only come up occasionally
    • If you notice that the failing builds fail tests in a certain order, also validate that the successful builds do not fail in that same order
  • Do the tests tend to flake at certain times?
    • If the test isn't using something like DateTime.now(), then this may be a coincidence, but it's a good thing to check if you're stumped
    • If the test flakes the first build on a VM each day, it is likely due to the daily fresh provisioning of the bots.
  • Do the tests flake on same bot/device?
    • If yes, it is most likely due to hardware issues, and can be routed to infra team for help.

Fixing flaky tests

The TL will help triage, reassign, and attempt to fix the flakiness.

If the test was marked flaky in CI and then fixed, the test can be re-enabled after being validated for 50 consecutive runs without flakiness issues (task without exclamation point in flutter build dashboard and task not failed due to the same flaky failure). This can be done by updating the test entry in .ci.yaml to remove the bringup: true entry for that target.

If not fixable, the test will be removed from the flutter build dashboard or deleted from CI completely depending on specific cases.

Flutter Wiki

Process

Framework repo

The Flutter CLI Tool

Engine repo

Packages repo

Engineering Productivity

User documentation

Clone this wiki locally