Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Github actions: "Re-run failed jobs" will run the entire test suite #531

Open
BioCarmen opened this issue Apr 4, 2022 · 64 comments
Open
Labels
Cypress Cloud triaged Issue has been routed to backlog. This is not a commitment to have it prioritized by the team. type: feature feature request

Comments

@BioCarmen
Copy link

BioCarmen commented Apr 4, 2022

We have the e2e tests configure to run on cypress dashboard parellely.
I was following this thread to add the custom-build-id to the command to distinguish different run based on different build id. Every thing works fine until github actions roll out the ability to Re-run failed jobs.
If i just set the custom-build-id to ${{ github.run_id }}, the second attempt will always marks tests as passing with 'Run Finished' but tests are not triggered at all.
So I set set the custom-build-id to ${{ github.run_id }}-${{ github.run_attempt }}, now it will run the entire test suite instead of running the originally allocated subset of tests.

 E2E_tests:
    runs-on: ubuntu-latest
    name: E2E tests
    strategy:
      fail-fast: false
      matrix:
        ci_node_total: [6]
        ci_node_index: [0, 1, 2, 3, 4, 5]
    timeout-minutes: 45
    steps:
        - uses: actions/checkout@v2

        - name: Use Node.js
          uses: actions/setup-node@v2

        - name: Install Dependencies
          run: npm ci

        - name: Start app
          run: make start-app-for-e2e
          timeout-minutes: 5

        - name: Cypress Dashboard - Cypress run
          run: |
              npm run cypress

@tanimaroy2012
Copy link

im facing the same issue, in re run failed jobs, cypress runs all the tests again without the parallel setup that it used in the first run

@ninasarabia
Copy link

I'm having this same issue as well -- anyone found any workarounds to this?

@dannyskoog
Copy link

+1 👍

@tebeco
Copy link

tebeco commented May 13, 2022

That's definitly something critical
given how the billing works (Cypress and Github included), it sounds like we're getting billed for something that already passed

@admah
Copy link
Contributor

admah commented May 16, 2022

@BioCarmen @ninasarabia do you have any links to runs that you could share where this is happening?

@samanthablasbalg
Copy link

Is there any update on this? Getting charged for an entire test suite re-run when one test fails on one parallel job is really upsetting, given the size of our test suite.

@VinceBarresi
Copy link

Im experiencing this also. Any updates on a fix or workaround?

@ilovegithub2
Copy link

ilovegithub2 commented Jul 4, 2022

We are seeing the same issue - here is our configuration

- name: Run integration tests
        timeout-minutes: 20
        uses: cypress-io/github-action@v4
        env:
          CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          ci-build-id: ${{ needs.prepare.outputs.uuid }}
          config: baseUrl=${{ format('https://pr{0}-www.build.{1}', github.event.number, env.CBR_PROJECT_DOMAIN) }}
          wait-on: ${{ format('https://pr{0}-www.build.{1}', github.event.number, env.CBR_PROJECT_DOMAIN) }}
          wait-on-timeout: 120
          browser: chrome
          record: true
          parallel: true
          group: merge
          install: false
          working-directory: tests/web

@thijsnado
Copy link

Reading up on https://docs.cypress.io/guides/guides/parallelization, I think this may be a side effect on how cypress load balances things. Things aren't evenly split up in a deterministic way, it tries to find un-run tests and distribute them to the workers which are free. When you do re-run only failures, you are essentially just reducing the amount of total workers that can process all the jobs.

@tebeco
Copy link

tebeco commented Aug 5, 2022

so this mean cypress code base might need more work to address that issue ?

we're still getting over billed because of internal implementation, is that what you're saying ?

@thijsnado
Copy link

thijsnado commented Aug 5, 2022

I can't comment since I don't work for cypress but this part https://docs.cypress.io/guides/guides/parallelization#CI-parallelization-interactions makes me think that each individual container won't always run the same tests. The re-run is just interpreted as another parallel test run but with less containers to run them.

@tebeco
Copy link

tebeco commented Aug 5, 2022

it's slightly worse than this if you use standard github
"re-run fail job" feature (the title of this issue)

  • rerun literally every test suite
  • don't parallelize the rerun as it did originally
  • if you had a 5min timeout per parallel run and 5 rune it will then 100% fail because of both previous points. it would need 25min to run in less than 5min

so you're down to re-rub the matrix run (not the title of this issue)
and you're billing for 25min instead of 5min if only one failed

@thijsnado
Copy link

@tebeco is it that it doesn't parallelize or that only one of the containers failed so all the tests get run in that one container? I'd have to try a few more times to know for certain but I think if you have multiple containers fail it will parallelize those containers but more tests will run per container since the "passing" containers don't do anything.

@tebeco
Copy link

tebeco commented Aug 5, 2022

both are bad, think about the billing
one test fail ... should be 1-2 min you're billed about 25x more core now in my previous example

and that is regardless is the parallel matrix respected or not since all test / run minutes are accounted

i think not trimming in parallel would be less critical if

  • it only re-ram the fail test
  • or a threshold on rerun to split
  • or rerun the same container count but only what failed in each so that "job 1" would still be "job 1"

for now it's unpredictable / and full rerun / full billing only

@admah
Copy link
Contributor

admah commented Sep 13, 2022

There were recently some changes in our services repo that may have taken care of this issue. Can someone retest with 10.7.0 or later and post results? Thanks!

@kinson
Copy link

kinson commented Sep 14, 2022

There were recently some changes in our services repo that may have taken care of this issue. Can someone retest with 10.7.0 or later and post results? Thanks!

@admah I just tested this after upgrading to 10.8.0 and still saw all of the tests run in a single job when one of the parallelized containers had a failed test.

To give some more detail, the codebase I am working on uses the Cypress parallelization feature, attached to Cypress dashboard, to split our test suite into 5 different jobs. In this situation, one test failed in one of the parallelized jobs. To retry this test, I clicked the "re-run failed jobs" button in GitHub and that kicked off the Cypress tests again in the same job containing the failed test. But, instead of running the same set of tests, it re-ran all of the tests in the single job. I have included a screenshot that should hopefully illustrate this a little better.

Thanks for looking into this, it would be a huge improvement to our CI pipeline if this issue was resolved!

Screen Shot 2022-09-14 at 6 44 52 AM

@ashanka-singh-qatalog
Copy link

Thanks for looking into this, it would be a huge improvement to our CI pipeline if this issue was resolved!

Agree. This fix is very much needed to optimise the CI run time and avoid the unnecessary trigger of the tests that have already passed in one attempt. There by reducing the billing.

@samanthablasbalg
Copy link

Yes, I also tried to replicate this last night and saw this same behavior:

I clicked the "re-run failed jobs" button in GitHub and that kicked off the Cypress tests again in the same job containing the failed test. But, instead of running the same set of tests, it re-ran all of the tests in the single job. I have included a screenshot that should hopefully illustrate this a little better.

This by default will fail the job, because one single worker can't possibly run all of the tests before the job timeout kicks in (which is why it is parallelized in the first place). We are on 10.7.0.

I believe what would need to happen is for Cypress to remember which tests get allocated to which workers so that if there is a failure on worker 3 of 5, and "re-run failed jobs" is selected on the GHA side, the same set of tests will get re-run on that worker.

@admah
Copy link
Contributor

admah commented Sep 14, 2022

I was able to get some more clarity on this from our Cloud team. Issue #574 also has some additional context.

Here is the current status:

  • Before, there was an issue where all re-runs got a PASS, regardless of actual status. This issue has been fixed.
  • Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

I will be updating this issue as new information is available.

@trevor-bennett
Copy link

For the issue I wrote that is linked above, turns out to be any re-run job is skipping every single test regardless if its from start or failed when run against cypress 10.9.0 and the cypress: cypress-io/cypress@2.1.0 orb

@dannyskoog
Copy link

@admah Any news? Thanks

@piotrekkr
Copy link
Contributor

All those problems could be fixed if dashboard could work this way for same dashboard run KEY.

7 test, 3 workers

first run

  • run all tests and load balance them on all workers
  • 5 / 7 tests green, 2 workers failed

next runs with same cypress KEY

  • check dashboard result for given KEY and failed tests
  • run all failed tests and load balance them on two available runners (two failed workers so on rerun github provides only those two)

At least this would work fine with GitHub imho.

Not sure how hard is it to implement but it is on dashboard side to orchestrate and send tests to workers so my guess would be that this should not be very hard. Unless somehow done tests suites runs cannot be updated...

@piotrekkr
Copy link
Contributor

piotrekkr commented Nov 21, 2022

I was able to get some more clarity on this from our Cloud team. Issue #574 also has some additional context.

Here is the current status:

* Before, there was an issue where all re-runs got a PASS, regardless of actual status. This issue has been fixed.

@admah Does it mean if I rerun failed workers with same cypress run KEY (ci-build-id param in cypress-io/github-action@v4 gh action), cypress action will fail now? Previously it was returning success after few seconds without running any actual tests on workers. I needed to make some workarounds to fail myself in that case, before even triggering cypress action.

* Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

@admah You mean rerun with same cypress run KEY? If so this seems contradicting with first point.

@kinson
Copy link

kinson commented Dec 16, 2022

@admah is there a planned release version for this yet?

Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

@davisg123
Copy link

davisg123 commented Dec 20, 2022

The ability to re-run failed tests is becoming more and more necessary as we scale; it's making us consider alternatives to Cypress cloud

@Git2DaChoppaa
Copy link

@admah Any news on this?

@MikeMcC399
Copy link
Collaborator

@Git2DaChoppaa

According to https://www.linkedin.com/in/amu/ Adam Murray (@admah) doesn't work for Cypress.io any more.

@spark-annes
Copy link

We're experiencing the same issue, and I'm commenting to both add another user for prioritization numbers and to get updates

@jaffrepaul
Copy link
Member

Thank you for bumping this. This is currently under discussion as it relates to other work and Cypress Cloud features. There is a sense of urgency, but a solution isn't quite as simple as it sounds. I'll continue to update this thread as information is available.

@tvthatsme
Copy link

Any news on this thread? Seems like it went quiet for a couple of months. My team is experiencing this issue as well.

@mjhenkes mjhenkes added the triaged Issue has been routed to backlog. This is not a commitment to have it prioritized by the team. label Apr 21, 2023
@mjhenkes mjhenkes removed their assignment Apr 24, 2023
@ryanpei
Copy link

ryanpei commented Apr 29, 2023

Hi! As @jaffrepaul mentioned, we're exploring various ways to solve this on Cypress Cloud. These are a few of the potential directions towards which we are leaning:

  1. Add a "spec retries" feature, whereby specs with the failing tests will be automatically retried at the end of the test run. This could also include some scoping configurations, for instance, if you want to only retry tests which just failed in the latest run, but passed in previous runs (as opposed to "known issue" tests which failed over multiple consecutive runs).

  2. Add a "environment stability test and cooldown" feature, whereby this custom-to-your-testing-environment test would execute, perhaps under conditions like the occurrence of "new" failures as described above or just periodically in between batches of tests, and, if it fails, would pause the execution of more tests until a specified wait period passes, thereby allowing your infrastructure sufficient time to cool down from memory leaks etc. and become stable again.

As you can tell, we're leaning towards more automated, less CI-specifc solutions in these cases, as opposed to a "retry button" integration. Requiring you to inspect test failures and click these buttons on a per job basis seems like a good opportunity for automation. Also, and this is mostly a Cypress concern, we would need to not only add support for GitHub Actions' "Re-run failed jobs", but also other CI providers. And even if we did use an automated GHA "Re-run failed jobs" (as opposed to the manual mode), that implementation adds challenges for us in terms of adding more cross-run test-result linking in Cloud.

I can't yet give an estimate of when we would address this specific issue, however we are currently actively working on improvements to Cypress' failure retries and there's a strong chance this could be worked on in one of the next phases of this project.

@alexjoeyyong
Copy link

Any update on this? Or does anyone have a workaround for getting a re-run to ONLY run the failed tests?

@ryanpei
Copy link

ryanpei commented Jun 27, 2023

Hi all, I’ll surface our latest thinking thus far about this:

We’ve heard from some of you that you don’t want your team to use GHA’s Rerun at all, for any Cypress test run. It may not be desirable either because the test scenario is not idempotent, it creates performance issues, or it simply hides potentially real issues.

GHA unfortunately does not have a means to disable their “Rerun” functionality on specific jobs. I’ve posed a question in the GHA community about this. If you agree it would be better for your team to disable GHA reruns for specific jobs, please upvote/comment on that question and it hopefully gets more attention.

As for use-cases where the rerun is still a useful feature, the ideal solution has been a challenge to pinpoint due to differing use-cases.

Keep in mind, the only valid situation for a rerun is when there are false failures (failed tests that you expect to pass on a 2nd run, without any code or environment changes). There are two features which already exist in Cypress for this:

  • Configurable timeouts: Try increasing the timeouts for tests which may need that.
  • Test retries: Add multiple attempts for failing tests; useful in cases like this where the test is actually flaky, and not consistently failing.

If you have not yet tried any of the above workarounds, I highly recommend checking them out to see if they would work for you.

If the above workarounds don’t work for you, then there’s likely a more elusive problem for your test environment. It’s probably either infrastructure related or due to sporadic issues with the machines the test ran on. We want to gather feedback in the future (you can also let us know now, but I don’t want to imply we’re solving for this right now) that indicate to us the most common root cause, so we can later solve for that first.

If you give your feedback here in this GH issue to tell us whether your “false” failures are due to infrastructure or to specific machines experiencing issues on a given run, or something else, please also provide a little context about these issues, such as:

  • how you can differentiate between the “false” and “true” failing tests (if you can)
  • what status code/message are you seeing in the error(s)
  • if it's a timeout, what is the timeout error (i.e. cypress error, network error, etc.)
  • if multiple failures happen all on the same machine(s)
  • if the “false” failures are always due to this same issue or a number of different ones
  • frequency (e.g. every other run, every ~10 runs, etc.)
  • if your team knows the specific root cause of the issues in the respective test environment, and reasons why it won’t be addressed anytime soon
  • if any of the solutions mentioned in my earlier comment sound like they would solve your issue
  • anything else that may be relevant to this

Answers to the above questions give us both contextual clarity as well as tell us whether your specific issue is correctly categorized (in the past we’ve seen confusion about which is which).

@kj-brown
Copy link

It feels like Cypress Dashboard is able to store the GHA run (which changes by commit)- is it too far a leap to potentially store the failures for a given run ID and (if the number is greater than 0) simply use the array of failed specs when it's rerun? I doubt anyone's really re-running their tests for the fun of it.

@richardsondx
Copy link

+1 to bump this ticket to top priority. What's the latest update about this issue? For anyone who experienced this issue; did you end up finding a solution that you could share here?

@cgraham-rs
Copy link

If the above workarounds don’t work for you, then there’s likely a more elusive problem for your test environment. It’s probably either infrastructure related or due to sporadic issues with the machines the test ran on. We want to gather feedback in the future (you can also let us know now, but I don’t want to imply we’re solving for this right now) that indicate to us the most common root cause, so we can later solve for that first.

@ryanpei You are correct. In my last project the failures were not application or test related but due to infrastructure problems. In those scenarios we really just want to re-run the failed tests in the existing workflow.

Note that there are two observed behaviors of this issue:

  • Cypress re-runs the whole suite instead of JUST the failed jobs, per the title of this issue
  • If you're parallelizing, Cypress takes your entire test suite and sticks it into ONLY the runners with a failure. So, if there's a single test failure in a single runner the issue is even worse as your entire test suite is jammed into a single runner on the re-run. Which could take hours, but usually it was definitely longer than simply re-running the entire workflow.

In that scenario the only "solution" was to opt to Re-run all jobs for Github workflows that encounter a Cypress error. Cypress is already going to run the whole suite anyway. But at least now it will successfully parallelize them across all runners.

@donacross
Copy link

Thank you for bumping this. This is currently under discussion as it relates to other work and Cypress Cloud features. There is a sense of urgency, but a solution isn't quite as simple as it sounds. I'll continue to update this thread as information is available.

Hello @jaffrepaul
Thank you for the great work accomplished by the team so far.
Would you have any visibility on when the feature of re-running only failed tests may be shipped ?
Do you know on which price plan may it land as well ?
Thank you for your time

@jaffrepaul
Copy link
Member

Thank you for bumping this. This is currently under discussion as it relates to other work and Cypress Cloud features. There is a sense of urgency, but a solution isn't quite as simple as it sounds. I'll continue to update this thread as information is available.

Hello @jaffrepaul Thank you for the great work accomplished by the team so far. Would you have any visibility on when the feature of re-running only failed tests may be shipped ? Do you know on which price plan may it land as well ? Thank you for your time

@ryanpei

@rodrigo-rac2
Copy link

+1 to bump this ticket to top priority.

@nitishvu
Copy link

nitishvu commented Nov 4, 2023

+1 to bump this ticket to top priority.

@MikeMcC399
Copy link
Collaborator

MikeMcC399 commented Nov 5, 2023

The GitHub Actions runner option Re-run failed jobs documentation says:

Re-running a workflow or jobs in a workflow uses the same GITHUB_SHA (commit SHA) and GITHUB_REF (Git ref) of the original event that triggered the workflow run.

This means that if a CI run on GitHub Actions fails and then offers you the "Re-run failed jobs" option, this is not useful if your Application under Test (AUT) is also part of the same repository.

image

If you now correct the error in your application which caused the Cypress test failure and then push the correction to GitHub, selecting "Re-run failed jobs" will not use the application correction, since it is tied to the previous state of the application (commit SHA) and it does not use the new commit with the correction.

GitHub Actions "Re-run failed jobs" is useful to repeat workflow runs that have failed due to GitHub issues such as temporary network connectivity. It does not seem to be helpful to re-run Cypress tests in general, because they would need to run on a corrected version of the AUT.

Spec Prioritization and Auto Cancellation. Used together, they ensure that, on the next run, the failed specs will run first and then, depending on your Auto Cancellation threshold, the run stops after X failures. This way, you can ensure that you're only going to see the whole test suite run again once the failures are fixed. And you do want to make sure no other tests were failed because of the "fix", so it's recommended to re-run everything at the end anyway.

Auto Cancellation is supported by the Cypress GitHub Action with the parameter auto-cancel-after-failures since cypress-io/github-action@v5.1.0

So the recommendation would be to use Spec Prioritization and Auto Cancellation and avoid the use of GitHub Action's "Re-run failed jobs".

If there is any enhancement to cypress-io/github-action it would be to support possible new features from Cypress Cloud. "Re-run failed jobs" itself does not lend itself to an enhancement in this action.

@MikeMcC399 MikeMcC399 added the type: feature feature request label Nov 5, 2023
@cbookg
Copy link

cbookg commented Nov 6, 2023

GitHub Actions "Re-run failed jobs" is useful to repeat workflow runs that have failed due to GitHub issues such as temporary network connectivity. It does not seem to be helpful to re-run Cypress tests in general, because they would need to run on a corrected version of the AUT.

Aside from github issues, its useful to be able to re-run due to issues with any other external dependency.

@MikeMcC399
Copy link
Collaborator

@MikeMcC399
Copy link
Collaborator

The README section > Parallel has been revised to reflect how the action currently works together with Cypress Cloud:


The Cypress GH Action does not spawn or create any additional containers - it only links the multiple containers spawned using the matrix strategy into a single logical Cypress Cloud run where it splits the specs amongst the machines. See the Cypress Cloud Smart Orchestration guide for a detailed explanation.

If you use the GitHub Actions facility for Re-running workflows and jobs, note that Re-running failed jobs in a workflow is not suited for use with parallel recording into Cypress Cloud. Re-running failed jobs in this situation does not simply re-run failed Cypress tests. Instead it re-runs all Cypress tests, load-balanced over the containers with failed jobs.

To optimize runs when there are failing tests present, refer to optional Cypress Cloud Smart Orchestration Premium features:

@kawsugiarto
Copy link

+1 to bump this ticket to top priority.

@MikeMcC399
Copy link
Collaborator

MikeMcC399 commented Apr 17, 2024

@kawsugiarto

+1 to bump this ticket to top priority.

This issue is left open so that users can more easily find the discussion although the dependency is on Cypress Cloud and any improvement would need to be initiated in that software function, not in the Cypress GitHub Action.

At this time there is nothing which can be done to improve the Cypress GitHub Action in this regard. You can find the recommendations about how to use the action optimally in the documentation (and in the previous posting #531 (comment)).

See also the previous discussions in this thread.

@kinson
Copy link

kinson commented May 2, 2024

@MikeMcC399 is there another repository where we should instead raise this issue to be fixed? If Cypress controls how tests are distributed with "Smart Orchestration" then hopefully they can distribute the same tests on the failing container back on the single container being re-run 🤞🏻

@MikeMcC399
Copy link
Collaborator

@kinson

is there another repository where we should instead raise this issue to be fixed?

Cypress Cloud is not open source, and so there is no public repository containing the Cypress Cloud code or one which can accept external issues.

The Cypress Help Center however describes the support available for Team, Business and Enterprise Cypress Cloud subscribers including the ability to submit tickets. The page also links to the Cypress technical user community on Discord where there is a dedicated Cypress Cloud channel. Discord is available to all users, including those using Cypress Cloud under the Free plan.

If Cypress controls how tests are distributed with "Smart Orchestration" then hopefully they can distribute the same tests on the failing container back on the single container being re-run 🤞🏻

It seems that there is no simple solution for this requirement in a GitHub Actions environment, so the recommendation remains to take advantage of the facilities currently offered by "Smart Orchestration" as described in previous postings.

@theomelo
Copy link

theomelo commented May 2, 2024

@MikeMcC399, thank you for always taking the time and patience to answer questions and provide solutions when possible. I really appreciate you 🙏

@MikeMcC399
Copy link
Collaborator

@theomelo

thank you for always taking the time and patience to answer questions and provide solutions when possible. I really appreciate you 🙏

Thank you for your kind comments! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cypress Cloud triaged Issue has been routed to backlog. This is not a commitment to have it prioritized by the team. type: feature feature request
Projects
None yet
Development

No branches or pull requests