Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manager tests failures related to Spot instances #7413

Closed
mikliapko opened this issue May 8, 2024 · 9 comments
Closed

Manager tests failures related to Spot instances #7413

mikliapko opened this issue May 8, 2024 · 9 comments
Assignees

Comments

@mikliapko
Copy link
Contributor

Issue description

Last month I've seen a multiple failures for manager tests related to Spot instances (manager jobs use this type of instances by default).
Couple of examples:

15:34:02  ----- LAST ERROR EVENT -------------------------------------------------------
15:34:02  2024-05-02 13:33:44.128: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=27a9d539-dc2b-42ac-bc8d-51dc742f28ea, source=MgmtCliTest.SetUp()
15:34:02  exception=Failed to get spot instances: capacity-not-available
23:40:29  ----- LAST CRITICAL EVENT ----------------------------------------------------
23:40:29  2024-04-17 20:54:31.807: (SpotTerminationEvent Severity.CRITICAL) period_type=one-time event_id=edb9ca5a-4f3c-4ce2-a04d-d717c16dd97c: node=Node manager-regression-master-loader-node-9e05767c-1 [44.193.201.15 | 10.12.2.105] (dc name: us-east-1) message={'action': 'terminate', 'time': '2024-04-17T20:56:29Z', 'time-left': 117.1921648979187}

Switching to on_demand type of instances solved the problem in each case.

@fruch

  1. Do we have any way to fight with this issue keeping the spot instance type in place?
  2. If not, I'd suggest switching all manager pipelines to use on_demand instance types.

Impact

High

How frequently does it reproduce?

I'd say ~50% of test executions last month.

@fruch
Copy link
Contributor

fruch commented May 8, 2024

@fruch

Do we have any way to fight with this issue keeping the spot instance type in place?
If not, I'd suggest switching all manager pipelines to use on_demand instance types.

no there isn't a magic thing to solve it, spot can be taken during test, and something they aren't available.
there are regions which might be less chances it might happen, or AZ that might be more available.
one can try check if switch to different instance type might also help, some instance type are more available

at the end it's a matter of cost, also in core we run the longer test with on_demand, and when there a release we do the same but for a very small set of regularly triggered jobs
so they are triggered with spot, and if need we trigger them again with on_demand, to have a result.

in you case you might want to have the trigger for releases to be with on_demend, and the ones on master with spot.

@mikliapko
Copy link
Contributor Author

mikliapko commented May 9, 2024

in you case you might want to have the trigger for releases to be with on_demend, and the ones on master with spot.

Agree, sounds like a good solution in our case.
Also, in case of future failures I'll try to experiment with different instance types.

@mykaul
Copy link
Contributor

mykaul commented May 15, 2024

I'd argue there's a difference between no spot available and spot termination. The former - we can easily fallback to ondemand, and I think it makes sense. The latter - harder to deal with - but I'd like to hope is less common - and happens when the tests are 1h or longer, I reckon?

@fruch
Copy link
Contributor

fruch commented May 15, 2024

I'd argue there's a difference between no spot available and spot termination. The former - we can easily fallback to ondemand, and I think it makes sense. The latter - harder to deal with - but I'd like to hope is less common - and happens when the tests are 1h or longer, I reckon?

those tests are longer then 1 hour, and it's currently the same, since someone need to manually re-run if they are failing. (we are not doing it automatically)

@mykaul
Copy link
Contributor

mykaul commented May 16, 2024

I'd argue there's a difference between no spot available and spot termination. The former - we can easily fallback to ondemand, and I think it makes sense. The latter - harder to deal with - but I'd like to hope is less common - and happens when the tests are 1h or longer, I reckon?

those tests are longer then 1 hour, and it's currently the same, since someone need to manually re-run if they are failing. (we are not doing it automatically)

All of them are longer than 1h?

@rayakurl
Copy link
Contributor

rayakurl commented May 28, 2024

longer

@mykaul - all of them.
@mikliapko is working on shorter tests but there were not merged yet - #7456

@mykaul
Copy link
Contributor

mykaul commented May 28, 2024

longer

@mykaul - all of them.

That's too bad. I don't have time now, but I'd be happy to review this at a later point. It makes little sense to me - we should be able to be more efficient.

@rayakurl
Copy link
Contributor

longer

@mykaul - all of them.

That's too bad. I don't have time now, but I'd be happy to review this at a later point. It makes little sense to me - we should be able to be more efficient.

@mikliapko is working on shorter tests but there were not merged yet - #7456

@rayakurl
Copy link
Contributor

@mykaul - if you wish to review - https://docs.google.com/spreadsheets/d/1enOmxToYVXQEQgGPBCPZIA0JblV5zaBBEFNRaKMS5Ho/edit#gid=605769695
@mikliapko created a document with all the pipleines that existed before he joined the project. @mikliapko - it's worth to create a new tab with all the new pipleines both on master and 3.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants