ec2: Wait for describe_spot_instance_requests() #5401

tonyhutter · 2020-07-13T22:49:35Z

I frequently see this error when starting up buildbot:

An error occurred (InvalidSpotInstanceRequestID.NotFound) when calling the DescribeSpotInstanceRequests operation: The spot instance request ID 'sir-abcd1234' does not exist

After the error, I'll see "zombie" instances running with no tags in AWS. This is caused by EC2LatentWorker._wait_for_request() calling describe_spot_instance_requests() before the request is ready. I noticed it can sometimes take a second or so for the request to show up.

This patch waits up to five seconds for describe_spot_instance_requests() to return successfully.

Contributor Checklist:

I have updated the unit tests
I have created a file in the master/buildbot/newsfragments directory (and read the README.txt in that directory)
I have updated the appropriate documentation

codecov · 2020-07-15T00:06:31Z

Codecov Report

Merging #5401 (f6cd9d7) into master (fe0d017) will increase coverage by 0.09%.
The diff coverage is 93.75%.

@@            Coverage Diff             @@
##           master    #5401      +/-   ##
==========================================
+ Coverage   91.75%   91.84%   +0.09%     
==========================================
  Files         345      345              
  Lines       36861    36866       +5     
==========================================
+ Hits        33820    33860      +40     
+ Misses       3041     3006      -35

Impacted Files	Coverage Δ
master/buildbot/worker/ec2.py	`73.29% <93.75%> (+13.35%)`	⬆️
master/buildbot/util/queue.py	`90.69% <0.00%> (-6.98%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe0d017...f6cd9d7. Read the comment docs.

p12tic · 2020-07-15T07:22:59Z

Is it possible to wait for the request to show up specifically instead of relying on describe_spot_instance_requests erroring out when it does not exist?

tardyp

5s looks pretty low threshold.

I would have just put the try catch line 554 and make a fake status out of that exception.

master/buildbot/worker/ec2.py

tonyhutter · 2020-07-15T22:15:29Z

@p12tic Unfortunately I didn't see another function that would work better than describe_spot_instance_requests().

@tardyp my latest push refactors the code to incorporate the changes you were requesting.

tardyp · 2020-07-16T21:04:18Z

logic looks better. We still need to adapt the unit tests

tonyhutter · 2020-07-16T23:31:51Z

I assume some of these test failures are to be expected?:

[ERROR]
Traceback (most recent call last):
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/trial/_synctest.py", line 1078, in run
    warnings.warn_explicit(**w)
builtins.UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
buildbot.test.unit.test_version.VersioningUtilsTests_PKG.test_getVersionFromArchiveIdNoTag
-------------------------------------------------------------------------------
Ran 6110 tests in 444.243s
FAILED (skips=29, errors=1, successes=6081)
Exception ignored in: <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('0.0.0.0', 0)>
ResourceWarning: unclosed <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('0.0.0.0', 0)>
Exception ignored in: <_io.FileIO name='/dev/null' mode='rb' closefd=True>
ResourceWarning: unclosed file <_io.TextIOWrapper name='/dev/null' mode='r' encoding='UTF-8'>

(https://buildbot.buildbot.net/#builders/2/builds/1796)

tonyhutter · 2020-07-21T19:59:31Z

@tardyp what is left to do to get this merged?

tardyp · 2020-07-22T08:38:29Z

hi @tonyhutter we had a regression in our CI due to a new warning by setuptools. It took me a while to fix it, that is why it took so long. sorry about that. I rebased you work to let the test run again. lets see what happen.

tardyp · 2020-07-26T13:27:37Z

@tonyhutter now, I can see that there are a few issues in flake8/pylint, and that the coverage of the diff is not very good. Can you please look at it?

tardyp

We would need a unit test which hooks that API to simulate the case you are trying to add. This is important for maintainance so that your feature is not destroyed when somebody else do some refactor on this code later.

master/buildbot/worker/ec2.py

tonyhutter · 2020-07-28T01:01:17Z

We would need a unit test which hooks that API to simulate the case you are trying to add.

Ok, I'll add it in

tonyhutter · 2020-07-28T19:17:09Z

@tardyp I was planning to use https://github.com/buildbot/buildbot/blob/master/master/buildbot/test/unit/test_worker_ec2.py#L392
as a template for my test, but I notice that it doesn't actually request a spot instance:

    @mock_ec2
    def test_start_spot_instance(self):
        c, r = self.botoSetup('latent_buildbot_slave')
        amis = list(r.images.all())
        product_description = 'Linux/Unix'
        bs = ec2.EC2LatentWorker('bot1', 'sekrit', 'm1.large',
                                 identifier='publickey',
                                 secret_identifier='privatekey',
                                 keypair_name='keypair_name',
                                 security_name='security_name',
                                 ami=amis[0].id, spot_instance=True,
                                 max_spot_price=1.5,
                                 product_description=product_description
                                 )
        bs._poll_resolution = 0

        instance_id, _, _ = bs._start_instance()
        instances = r.instances.filter(
            Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])
        instances = list(instances)
        self.assertTrue(bs.spot_instance)
        self.assertEqual(bs.product_description, product_description)
        self.assertEqual(len(instances), 1)
        self.assertEqual(instances[0].id, instance_id)
        self.assertIsNone(instances[0].tags)

It should be calling:

        instance_id, _, _ = bs._request_sport_instance()

... to actually exercise the spot instance code. I assume that's because you don't want to rack up a bunch of bills with AWS running test cases. So I'm not sure how I should proceed testing my changes to _thd_wait_for_request(). Suggestions?

tardyp · 2020-07-28T20:02:49Z

hi @tonyhutter notice the @mock_ec2 decorator... We don't really call ec2, but we rather mock any call to them using this hangly mock_ec2 library.
So what we have to do here is to patch the mock_ec2 library to make sure the request_spot_instance is actually raising the needed exception.

tonyhutter · 2020-08-01T00:32:52Z

@tardyp thanks for the info, I now see what you're talking about (using boto3 as a fake EC2). Give me some time to put together a test case.

tardyp · 2020-08-01T09:45:48Z

hi. Its actually moto which does the mocking of boto.

tonyhutter · 2020-08-18T23:06:57Z

@tardyp I wrote a test case, but it appears describe_spot_price_history() is not implemented in moto:

 TestEC2LatentWorker
    test_constructor_minimal ...                                           [OK]
    test_constructor_region ...                                            [OK]
    test_constructor_tags ...                                              [OK]
    test_fail_mixing_classic_and_vpc_ec2_settings ...                      [OK]
    test_fail_multiplier_and_max_are_none ...                              [OK]
    test_get_image_ami ...                                                 [OK]
    test_get_image_location ...                                            [OK]
    test_get_image_location_not_found ...                                  [OK]
    test_get_image_owners ...                                         [SKIPPED]
    test_start_instance ...                                                [OK]
    test_start_instance_attach_volume ...                                  [OK]
    test_start_instance_ip ...                                             [OK]
    test_start_instance_tags ...                                           [OK]
    test_start_instance_volumes ...                                        [OK]
    test_start_instance_volumes_deprecated ...                             [OK]
    test_start_spot_instance ...                                           [OK]
    test_start_spot_instance_slow_startup ...                           [ERROR]
...
Traceback (most recent call last):
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/models.py", line 88, in wrapper
    result = func(*args, **kwargs)
  File "/home/buildbot/test/buildbot/master/buildbot/test/unit/test_worker_ec2.py", line 478, in test_start_spot_instance_slow_startup
    instance_id, _, _ = bs._request_spot_instance()
  File "/home/buildbot/test/buildbot/master/buildbot/worker/ec2.py", line 477, in _request_spot_instance
    bid_price = self._bid_price_from_spot_price_history()
  File "/home/buildbot/test/buildbot/master/buildbot/worker/ec2.py", line 457, in _bid_price_from_spot_price_history
    spot_prices = self.ec2.meta.client.describe_spot_price_history(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/client.py", line 621, in _make_api_call
    http, parsed_response = self._make_request(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/client.py", line 641, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 136, in _send_request
    while self._needs_retry(attempts, operation_model, request_dict,
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 253, in _needs_retry
    responses = self._event_emitter.emit(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 250, in __call__
    should_retry = self._should_retry(attempt_number, response,
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 316, in __call__
    checker_response = checker(attempt_number, response,
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 222, in __call__
    return self._check_caught_exception(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 197, in _do_get_response
    responses = self._event_emitter.emit(event_name, request=request)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/models.py", line 271, in __call__
    status, headers, body = response_callback(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/responses.py", line 197, in dispatch
    return cls()._dispatch(*args, **kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/responses.py", line 295, in _dispatch
    return self.call_action()
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/responses.py", line 380, in call_action
    response = method()
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/ec2/responses/spot_instances.py", line 38, in describe_spot_price_history
    raise NotImplementedError(
builtins.NotImplementedError: SpotInstances.describe_spot_price_history is not yet implemented

buildbot.test.unit.test_worker_ec2.TestEC2LatentWorker.test_start_spot_instance_slow_startup

Is it ok to skip the test case?

Conan-Kudo · 2020-08-20T20:57:47Z

@tonyhutter I would suggest implementing the test and adding a skip condition at the beginning of the test, with it printing a warning of it due to it being unimplemented in moto. You can make it skip when it detects the function returns the NotImplementedError.

p12tic · 2020-08-20T21:36:52Z

I wonder if we could use the mock library to mock moto itself :-) Can we maybe just return some dummy data out of that function?

tonyhutter · 2020-08-20T22:39:34Z

Thanks all for the suggestions. @tardyp do you have a preference on how to proceed with the test case?

tardyp · 2020-08-21T06:40:40Z

I agree with @p12tic that for moto unimplemented methods, we should just used mock. skipping is not really an option, as it will just skip unconditionally, so the test is never run and adds no value

tonyhutter · 2020-08-27T01:14:29Z

Quick updates:

EC2 spot instances fail on InvalidSpotInstanceRequestID.NotFound #4617 appears to be the same bug as this PR.
I went down the rabbit hole of updating moto to implement describe_spot_price_history(), and was able to make it work. The bad news is that I once I fixed that, I found other bugs in moto (like in their request_spot_instance() and describe_spot_instance(), which I'm still looking into. It's been a slow slog...

tardyp · 2020-08-27T07:20:50Z

@tonyhutter thanks for update.. much appreciated!

spulec · 2020-09-04T20:54:51Z

Let me know if I can help with getting this added to Moto or with the bugs in the other endpoints.

(long-time Buildbot user. thank you for all of your work)

tardyp · 2020-09-05T08:25:22Z

@spulec I guess any help is appreciated. You can tell here that you are working on this branch in order to avoid double work, and then create a new PR a side to show your progress.

tonyhutter · 2020-09-09T00:28:36Z

Looks like there's an existing request_spot_price_history() moto issue: getmoto/moto#783. I'll post my moto updates in that bug until I get my PR for them ready.

stale · 2020-12-25T16:30:51Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Conan-Kudo · 2020-12-25T16:31:39Z

@tonyhutter Any progress here?

tonyhutter · 2020-12-28T16:41:36Z

I've been working on some higher priority projects for the last couple months and unfortunately this has fallen by the wayside. It's probably going to be at least a month or more before I can look at it again, but I do plan to get back to it eventually.

hixio-mh

#5401

tonyhutter · 2021-05-03T16:27:31Z

Just to summarize where this PR is at:

This patch works fine and fixes the issue. I even tested it on the 3.0.x branch where the problem still exists and can confirm that this patch fixes it.
I was asked in this PR to add a test case.
The test case needs to call describe_spot_price_history() but moto hasn't implemented that function yet (it's currently just a stub function).
I started implementing describe_spot_price_history() in moto, and got it to work and return some mock data, but then hit further problems in the test case where the instance wasn't starting.
I spent a ton of time messing with moto trying to figure out why the instance wasn't starting, but no dice. At this point I'm done banging my head in the wall trying to get the moto stuff to work.

I've rebased this PR on master so if you want to pull it in without the test case, great. If not, no big deal, I'll just include this PR's patch in our project's buildbot repo, and apply it manually. I'd recommend you do pull it though since it's a fairly low-risk patch and fixes a real issue.

Conan-Kudo · 2021-05-03T16:31:16Z

@tonyhutter Do you have your work on moto where the problem exists available somewhere? Maybe someone here can help with that.

tonyhutter · 2021-05-03T18:34:15Z

@Conan-Kudo: Here's my attempt at test case for posterity (doesn't work):

buildbot: https://gist.github.com/tonyhutter/63e1a8a0ae7cb442c685097969009083
moto: https://gist.github.com/tonyhutter/47173e2f2e810920b5ee264cabb5e171

I frequently see this error when starting up buildbot: An error occurred (InvalidSpotInstanceRequestID.NotFound) when calling the DescribeSpotInstanceRequests operation: The spot instance request ID 'sir-abcd1234' does not exist After the error, I'll see "zombie" instances running with no tags in AWS. This is caused by EC2LatentWorker._wait_for_request() calling describe_spot_instance_requests() before the request is ready. I noticed it can sometimes take a second or so for the request to show up. This patch waits for describe_spot_instance_requests() to return successfully.

p12tic · 2021-05-13T17:56:16Z

@tonyhutter I've hacked missing support into moto as part as Buildbot tests. As part of that I had to change exception capturing to the following:

except ClientError as e:
    if 'InvalidSpotInstanceRequestID.NotFound' in str(e):
    <...>

That is, we catch botocore.client.ClientError instead of self.ec2.meta.errors.InvalidSpotInstanceRequestID.NotFound.

Could you please check if this still works with real ec2? If yes, then this PR is ready for merge, thanks a lot for the work you put in.

tonyhutter · 2021-05-14T00:00:36Z

@p12tic thanks, I'll give that a test

tonyhutter · 2021-05-14T16:05:56Z

@p12tic I just tested and verified that your code works. Do you want me to make the change in my PR to:

            except ClientError as e:
                if 'InvalidSpotInstanceRequestID.NotFound' in str(e):
                    requests = None
                else:
                    raise

... or just leave my PR as-is?

p12tic · 2021-05-15T04:36:35Z

@tonyhutter Actually I've already pushed this change to your branch, I just needed a confirmation that it works and will merge the PR soon. Thanks a lot!

tonyhutter · 2021-05-17T16:26:14Z

@p12tic thanks 👍

tonyhutter force-pushed the fix-wait-for-request branch from 6805398 to c1d9da7 Compare July 14, 2020 23:54

tardyp reviewed Jul 15, 2020

View reviewed changes

master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved

master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved

master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved

tonyhutter force-pushed the fix-wait-for-request branch 3 times, most recently from 81c37aa to c75ff0c Compare July 15, 2020 22:10

tardyp requested changes Jul 26, 2020

View reviewed changes

master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved

tonyhutter mentioned this pull request Jul 31, 2020

Update buildbot to 2.8.x openzfs/zfs-buildbot#200

Closed

tonyhutter force-pushed the fix-wait-for-request branch from 59dd5ca to 4f68182 Compare August 18, 2020 20:50

tonyhutter mentioned this pull request Aug 21, 2020

Monitor Spot Instance Price in Real Time using Moto getmoto/moto#783

Open

stale bot added the stalled label Dec 25, 2020

stale bot removed the stalled label Dec 25, 2020

tonyhutter force-pushed the fix-wait-for-request branch from 4f68182 to ab63bb5 Compare January 12, 2021 22:13

hixio-mh approved these changes May 3, 2021

View reviewed changes

tonyhutter force-pushed the fix-wait-for-request branch from ab63bb5 to 15647bb Compare May 3, 2021 16:23

tonyhutter and others added 2 commits May 13, 2021 20:49

ec2: Actually test spot instances endpoints in spot_* tests

613a48a

p12tic force-pushed the fix-wait-for-request branch from 15647bb to 613a48a Compare May 13, 2021 17:49

Add newsfragment

f6cd9d7

p12tic approved these changes May 15, 2021

View reviewed changes

p12tic merged commit 1c4f045 into buildbot:master May 15, 2021

tonyhutter mentioned this pull request Jul 1, 2021

Update buildbot to 3.2.0 openzfs/zfs-buildbot#233

Merged

ec2: Wait for describe_spot_instance_requests() #5401

ec2: Wait for describe_spot_instance_requests() #5401

Conversation

tonyhutter commented Jul 13, 2020

Contributor Checklist:

codecov bot commented Jul 15, 2020 • edited

Codecov Report

p12tic commented Jul 15, 2020

tardyp left a comment

Choose a reason for hiding this comment

tonyhutter commented Jul 15, 2020

tardyp commented Jul 16, 2020

tonyhutter commented Jul 16, 2020

tonyhutter commented Jul 21, 2020

tardyp commented Jul 22, 2020

tardyp commented Jul 26, 2020

tardyp left a comment

Choose a reason for hiding this comment

tonyhutter commented Jul 28, 2020

tonyhutter commented Jul 28, 2020

tardyp commented Jul 28, 2020

tonyhutter commented Aug 1, 2020

tardyp commented Aug 1, 2020

tonyhutter commented Aug 18, 2020

Conan-Kudo commented Aug 20, 2020

p12tic commented Aug 20, 2020

tonyhutter commented Aug 20, 2020

tardyp commented Aug 21, 2020

tonyhutter commented Aug 27, 2020

tardyp commented Aug 27, 2020

spulec commented Sep 4, 2020

tardyp commented Sep 5, 2020

tonyhutter commented Sep 9, 2020

stale bot commented Dec 25, 2020

Conan-Kudo commented Dec 25, 2020

tonyhutter commented Dec 28, 2020

hixio-mh left a comment

Choose a reason for hiding this comment

tonyhutter commented May 3, 2021

Conan-Kudo commented May 3, 2021

tonyhutter commented May 3, 2021

p12tic commented May 13, 2021

tonyhutter commented May 14, 2021

tonyhutter commented May 14, 2021

p12tic commented May 15, 2021

tonyhutter commented May 17, 2021

codecov bot commented Jul 15, 2020 •

edited