Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry tests #3229

Merged
merged 13 commits into from May 23, 2024
Merged

Retry tests #3229

merged 13 commits into from May 23, 2024

Conversation

leej3
Copy link
Collaborator

@leej3 leej3 commented Apr 2, 2024

@vfdev-5

This is an attempt to rerun some of the CI tests using the retry github action from nick-fields. The objective is to rerun steps in the GitHub workflows that have failed due to flaky tests or tests that hang. Using the pytest argument --last-failed is helpful as passed tests are not re-executed across reruns.

An edge-case that is trickier to handle is when the first pytest invocation passes and the subsequent one fails. In some circumstances this will lead to all tests being executed for the first invocation. The pytest argument "--last-failed-no-failures none" can help here though I have observed undesirable behaviour in some cases using this argument (see reproducer below). Sometimes it skips all tests when I would expect it (or at least hope) to run some tests. Additionally, if all tests are deselected it emits an exit code of 5 in bash. For now I catch and ignore this for the first invocation.

Bash script to set up reproducer
#!/bin/bash

# Create the directory if it doesn't exist
pytest --cache-clear || true
rm -rf lf-behaviour
mkdir -p lf-behaviour

# Create the Python test file
cat << EOF > lf-behaviour/test_example.py
#!/usr/bin/env python3
import pytest

@pytest.fixture
def cache(pytestconfig):
    return pytestconfig.cache

def test_common_pass():
    assert True

def test_common_flaky_fail(cache):
    retrieved_cache = cache.get("common_flaky_fail",0)
    if not retrieved_cache > 2:
        cache.set("common_flaky_fail", retrieved_cache +1)
        assert False
    # Passes after the first failure
    assert True

def test_unique1_pass(cache):
    assert True

def test_unique1_flaky_fail(cache):
    retrieved_cache = cache.get("unique1_flaky_fail",0)
    if not retrieved_cache > 2:
        cache.set("unique1_flaky_fail", retrieved_cache +1)
        assert False
    # Passes after the first failure
    assert True

def test_unique2_pass():
    assert True

def test_unique2_fail():
    assert False

EOF

# Create the Python test file unique to the first run
cat << EOF > lf-behaviour/test_example_unique3.py
#!/usr/bin/env python3
import pytest

@pytest.fixture
def cache(pytestconfig):
    return pytestconfig.cache

def test_separately_unique3_pass(cache):
    assert True

def test_separately_unique3_fail(cache):
    retrieved_cache = cache.get("unique3_flaky_fail",0)
    if not retrieved_cache > 2:
        cache.set("unique3_flaky_fail", retrieved_cache +1)
        assert False
    # Passes after the first failure
    assert True


EOF

# Create the Python test file unique to the second run
cat << EOF > lf-behaviour/test_example_unique4.py
#!/usr/bin/env python3
import pytest
def test_separately_unique4_pass():
    assert True

def test_separately_unique4_fail():
    assert False

EOF

# Create the shell script
cat << EOF > lf-behaviour/run_unique_individually.sh
#!/bin/bash
set -ex

pytest \${EXTRA_PYTEST_ARGS} -k "unique1" test_example.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} test_example_unique3.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} -k "unique2" test_example.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} test_example_unique4.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
EOF

# Create the shell script
cat << EOF > lf-behaviour/run_tests.sh
#!/bin/bash
set -ex

pytest \${EXTRA_PYTEST_ARGS} -k "test_common or unique1 or unique3" . || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} -k "test_common or unique2 or unique4" . || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
EOF


cd lf-behaviour
bash run_unique_individually.sh

# Echo the command to run the tests
echo "To run the tests, use the following command (you can also add pytest args --cache-clear and --last-failed-no-failures [all/none]):"
echo cd lf-behaviour
echo EXTRA_PYTEST_ARGS="--last-failed --last-failed-no-failures none" bash run_unique_individually.sh

That leaves one remaining problem for this approach to be usable (i.e. this PR could be merged): currently some of the pytest invocations hang. When this happens the tests that were not executed are automatically determined to have passed which can result in most of the test-suite being skipped completely. I haven't thought of a way to address this.

The simpler solution of removing these extra pytest arguments would work but the timeout for the full job needs to be increased a good bit more than what it is currently (45 mins).

EDIT: opting for the simpler solution for now. Perhaps using pytest-timeout could help with test hanging. Previous work here
EDIT 2: using separate pytest cache directories for different runs/executions of pytest on a particular CI machine allows the --last-failed functionality to be used effectively.

@leej3 leej3 requested a review from vfdev-5 April 2, 2024 15:20
@github-actions github-actions bot added the ci CI label Apr 2, 2024
@leej3 leej3 force-pushed the retry-tests branch 4 times, most recently from fc938a9 to 855e6f1 Compare April 4, 2024 15:26
@leej3 leej3 mentioned this pull request Apr 4, 2024
@leej3 leej3 changed the title WIP: Retry tests Retry tests Apr 4, 2024
@leej3 leej3 force-pushed the retry-tests branch 2 times, most recently from 9238a9f to e5498ba Compare April 25, 2024 12:03
@github-actions github-actions bot added the module: metrics Metrics module label Apr 25, 2024
@leej3 leej3 changed the title Retry tests [WIP] Retry tests Apr 25, 2024
@leej3 leej3 changed the title [WIP] Retry tests Retry tests Apr 29, 2024
@leej3
Copy link
Collaborator Author

leej3 commented Apr 29, 2024

This is working reasonably well now.

  • There seems to be some neptune cleanup that is slowing things down. That might be worth debugging
  • There are other sources of tests hanging I believe.
  • distributed tests during the first pytest invocation of run_cpu_tests.sh are still hanging but with the timeout of the "retry github action" they now get rerun successfully.
  • one potential issue that I see locally is that tests that cause an error don't get rerun.

Also, I abandoned the "--dist=loadgroup" due to the various things hanging. I can't figure out what's causing it but if we could turn that on for the linux tests that would drop the time it takes to execute them considerably.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Apr 29, 2024

Do you know why this one is still faling: https://github.com/pytorch/ignite/actions/runs/8878484399/job/24374218672?pr=3229 ?

one potential issue that I see locally is that tests that cause an error don't get rerun.

Can you elaborate this one?

@leej3
Copy link
Collaborator Author

leej3 commented Apr 29, 2024

Do you know why this one is still faling

After all tests completed a "shutting down" message from Neptune was displayed and that hung for another 10mins or so.

Currently when tests hang they get restarted by the GitHub action at the 25minute mark. That means if a particular machine hangs twice it is cancelled due to the 45minute timeout. We could adjust it down to 20mins to improve chances of success but overall figuring out the causes of hanging is the ideal.

Regarding the errors: if an invocation of pytest has failures (invalid assertions etc) and errors (failed fixture execution etc) it seems that pytest reruns the failures but ignores the errors. It's likely a deal breaker... I haven't thought of a work around yet. (EDIT: I think I was wrong about this, a more thorough investigation shows errors get rerun)

@leej3
Copy link
Collaborator Author

leej3 commented May 9, 2024

I think this is ready to go now. Overall it:

  • retries failed test runs
  • only rerun tests from the pytest suite that previously failed
  • adjusts the timeouts for some of the tests/CI test steps to make things more robust
  • consolidates the core test functionality into a script shared across the pre-existing test scripts

Test failures were too frequent to avoid using the "--last-failed" functionality of pytest. This introduced some extra details like needing to specify the pytest cache directory, trapping an exit code of 5 when a previous pytest invocation had no failures. To keep things maintainable I moved these details into a function shared across all scripts.

@leej3 leej3 force-pushed the retry-tests branch 2 times, most recently from 426d4a3 to f37e188 Compare May 13, 2024 11:40
@github-actions github-actions bot added the module: handlers Core Handlers module label May 13, 2024
@leej3 leej3 force-pushed the retry-tests branch 3 times, most recently from b9015f6 to 8cd5c65 Compare May 21, 2024 13:24
@leej3 leej3 changed the title Retry tests [WIP] Retry tests May 21, 2024
leej3 and others added 8 commits May 22, 2024 11:08
greatly speeds up reruns of tests as only previously failed tests are
rerun.

define pytest cachedir for each pytest invocation to prevent interaction
between different selections of tests.

protect against exit code of 5 when a previous pytest invocation
had no failed tests which results in all tests being deselected.

use eval to avoid issues with the -k and -m expansions.
@leej3 leej3 force-pushed the retry-tests branch 2 times, most recently from 15d4c95 to 475be56 Compare May 22, 2024 15:06
@leej3
Copy link
Collaborator Author

leej3 commented May 22, 2024

Success!

A significant issue with retrying the test suite was that the retry action sends a sigterm which pytest completely ignores. The result was overlapping pytest sessions that caused a bit of a mess. I abandoned a failed strategy of trying to kill (sigint and then sigkill) the pytest process upon retrying. It is easier to modify how pytest interprets the sigterm (it’s worth noting this is modified in case it causes other undesirable behaviour though).

Regardless of how the test session is interrupted, one tricky detail was to include “unrun” tests in subsequent retries. These tests are not run because a session hangs and is interrupted: Pytest appropriately classifies these tests as not failed but for our purposes we do in fact want to include them in the next run of the pytest —last-failed. My solution was to add an option “—treat-unrun-as-failed” that classifies such tests as failed for subsequent runs.

The overall PR summary:

  • Failures on CI are now rerun up to 5 times. Tests that have previously passed are not rerun.
  • Common functionality in the bash test scripts has been consolidated into tests/common-test-functionality.sh
  • tests/ignite is passed to the pytest command. This pattern uses the correct conftest.py to configure the cli.

@leej3 leej3 changed the title [WIP] Retry tests Retry tests May 22, 2024
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, John, for working on this PR since all that time!
Now CI is passing :)
I left few comments and questions, can you please check them

.github/workflows/hvd-tests.yml Outdated Show resolved Hide resolved
.github/workflows/pytorch-version-tests.yml Outdated Show resolved Hide resolved
.github/workflows/tpu-tests.yml Outdated Show resolved Hide resolved
.github/workflows/unit-tests.yml Outdated Show resolved Hide resolved
tests/ignite/conftest.py Outdated Show resolved Hide resolved
tests/ignite/conftest.py Show resolved Hide resolved

# Run the command
if [ "$trap_deselected_exit_code" -eq "1" ]; then
CUDA_VISIBLE_DEVICES="" eval "pytest ${pytest_args}" || { exit_code=$?; if [ "$exit_code" -eq ${last_failed_no_failures_code} ]; then echo "All tests deselected"; else exit $exit_code; fi; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need eval call here, can't we make the call without eval ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an eval here because of some bugs I was running into where things like "-k ''" ends up as -k in the final command. The horrors of using eval are somewhat mitigated by the "-x" bash flag so that bugs in the command can be spotted more quickly. I think consistently using arrays for assembling commands in bash is a better alternative but I think using python is the best long term solution.

@@ -0,0 +1,102 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it would be more simple to write this script in python for later maintenance instead of a bash script and how much effort it would take?
If you think this is feasible we can do that in a follow-up PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easy enough: largely just providing a simple cli, setting environment variables, and assembling commands to run as a subprocess. A few hours probably. Perhaps a few more to iron out issues and add some tests for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good, let's make a python script instead of this bash script in a follow-up PR

Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks John!

@leej3 leej3 enabled auto-merge (squash) May 23, 2024 09:44
@leej3 leej3 merged commit 8db318b into pytorch:master May 23, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CI module: handlers Core Handlers module module: metrics Metrics module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants