Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ec2: Wait for describe_spot_instance_requests() #5401

Merged
merged 3 commits into from
May 15, 2021

Conversation

tonyhutter
Copy link
Contributor

I frequently see this error when starting up buildbot:

An error occurred (InvalidSpotInstanceRequestID.NotFound) when calling the DescribeSpotInstanceRequests operation: The spot instance request ID 'sir-abcd1234' does not exist

After the error, I'll see "zombie" instances running with no tags in AWS. This is caused by EC2LatentWorker._wait_for_request() calling describe_spot_instance_requests() before the request is ready. I noticed it can sometimes take a second or so for the request to show up.

This patch waits up to five seconds for describe_spot_instance_requests() to return successfully.

Contributor Checklist:

  • I have updated the unit tests
  • I have created a file in the master/buildbot/newsfragments directory (and read the README.txt in that directory)
  • I have updated the appropriate documentation

@codecov
Copy link

codecov bot commented Jul 15, 2020

Codecov Report

Merging #5401 (f6cd9d7) into master (fe0d017) will increase coverage by 0.09%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5401      +/-   ##
==========================================
+ Coverage   91.75%   91.84%   +0.09%     
==========================================
  Files         345      345              
  Lines       36861    36866       +5     
==========================================
+ Hits        33820    33860      +40     
+ Misses       3041     3006      -35     
Impacted Files Coverage Δ
master/buildbot/worker/ec2.py 73.29% <93.75%> (+13.35%) ⬆️
master/buildbot/util/queue.py 90.69% <0.00%> (-6.98%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe0d017...f6cd9d7. Read the comment docs.

@p12tic
Copy link
Member

p12tic commented Jul 15, 2020

Is it possible to wait for the request to show up specifically instead of relying on describe_spot_instance_requests erroring out when it does not exist?

Copy link
Member

@tardyp tardyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5s looks pretty low threshold.

I would have just put the try catch line 554 and make a fake status out of that exception.

master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved
master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved
master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved
@tonyhutter tonyhutter force-pushed the fix-wait-for-request branch 3 times, most recently from 81c37aa to c75ff0c Compare July 15, 2020 22:10
@tonyhutter
Copy link
Contributor Author

@p12tic Unfortunately I didn't see another function that would work better than describe_spot_instance_requests().

@tardyp my latest push refactors the code to incorporate the changes you were requesting.

@tardyp
Copy link
Member

tardyp commented Jul 16, 2020

logic looks better. We still need to adapt the unit tests

@tonyhutter
Copy link
Contributor Author

I assume some of these test failures are to be expected?:

[ERROR]
Traceback (most recent call last):
  File "/buildbot/buildbot-job/build/sandbox/lib/python3.8/site-packages/twisted/trial/_synctest.py", line 1078, in run
    warnings.warn_explicit(**w)
builtins.UserWarning: Distutils was imported before Setuptools. This usage is discouraged and may exhibit undesirable behaviors or errors. Please use Setuptools' objects directly or at least import Setuptools first.
buildbot.test.unit.test_version.VersioningUtilsTests_PKG.test_getVersionFromArchiveIdNoTag
-------------------------------------------------------------------------------
Ran 6110 tests in 444.243s
FAILED (skips=29, errors=1, successes=6081)
Exception ignored in: <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('0.0.0.0', 0)>
ResourceWarning: unclosed <socket.socket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('0.0.0.0', 0)>
Exception ignored in: <_io.FileIO name='/dev/null' mode='rb' closefd=True>
ResourceWarning: unclosed file <_io.TextIOWrapper name='/dev/null' mode='r' encoding='UTF-8'>

(https://buildbot.buildbot.net/#builders/2/builds/1796)

@tonyhutter
Copy link
Contributor Author

@tardyp what is left to do to get this merged?

@tardyp
Copy link
Member

tardyp commented Jul 22, 2020

hi @tonyhutter we had a regression in our CI due to a new warning by setuptools. It took me a while to fix it, that is why it took so long. sorry about that. I rebased you work to let the test run again. lets see what happen.

@tardyp
Copy link
Member

tardyp commented Jul 26, 2020

@tonyhutter now, I can see that there are a few issues in flake8/pylint, and that the coverage of the diff is not very good. Can you please look at it?

Copy link
Member

@tardyp tardyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need a unit test which hooks that API to simulate the case you are trying to add. This is important for maintainance so that your feature is not destroyed when somebody else do some refactor on this code later.

master/buildbot/worker/ec2.py Outdated Show resolved Hide resolved
@tonyhutter
Copy link
Contributor Author

We would need a unit test which hooks that API to simulate the case you are trying to add.

Ok, I'll add it in

@tonyhutter
Copy link
Contributor Author

@tardyp I was planning to use https://github.com/buildbot/buildbot/blob/master/master/buildbot/test/unit/test_worker_ec2.py#L392
as a template for my test, but I notice that it doesn't actually request a spot instance:

    @mock_ec2
    def test_start_spot_instance(self):
        c, r = self.botoSetup('latent_buildbot_slave')
        amis = list(r.images.all())
        product_description = 'Linux/Unix'
        bs = ec2.EC2LatentWorker('bot1', 'sekrit', 'm1.large',
                                 identifier='publickey',
                                 secret_identifier='privatekey',
                                 keypair_name='keypair_name',
                                 security_name='security_name',
                                 ami=amis[0].id, spot_instance=True,
                                 max_spot_price=1.5,
                                 product_description=product_description
                                 )
        bs._poll_resolution = 0

        instance_id, _, _ = bs._start_instance()
        instances = r.instances.filter(
            Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])
        instances = list(instances)
        self.assertTrue(bs.spot_instance)
        self.assertEqual(bs.product_description, product_description)
        self.assertEqual(len(instances), 1)
        self.assertEqual(instances[0].id, instance_id)
        self.assertIsNone(instances[0].tags)

It should be calling:

        instance_id, _, _ = bs._request_sport_instance()

... to actually exercise the spot instance code. I assume that's because you don't want to rack up a bunch of bills with AWS running test cases. So I'm not sure how I should proceed testing my changes to _thd_wait_for_request(). Suggestions?

@tardyp
Copy link
Member

tardyp commented Jul 28, 2020

hi @tonyhutter notice the @mock_ec2 decorator... We don't really call ec2, but we rather mock any call to them using this hangly mock_ec2 library.
So what we have to do here is to patch the mock_ec2 library to make sure the request_spot_instance is actually raising the needed exception.

@tonyhutter
Copy link
Contributor Author

@tardyp thanks for the info, I now see what you're talking about (using boto3 as a fake EC2). Give me some time to put together a test case.

@tardyp
Copy link
Member

tardyp commented Aug 1, 2020

hi. Its actually moto which does the mocking of boto.

@tonyhutter
Copy link
Contributor Author

@tardyp I wrote a test case, but it appears describe_spot_price_history() is not implemented in moto:

 TestEC2LatentWorker
    test_constructor_minimal ...                                           [OK]
    test_constructor_region ...                                            [OK]
    test_constructor_tags ...                                              [OK]
    test_fail_mixing_classic_and_vpc_ec2_settings ...                      [OK]
    test_fail_multiplier_and_max_are_none ...                              [OK]
    test_get_image_ami ...                                                 [OK]
    test_get_image_location ...                                            [OK]
    test_get_image_location_not_found ...                                  [OK]
    test_get_image_owners ...                                         [SKIPPED]
    test_start_instance ...                                                [OK]
    test_start_instance_attach_volume ...                                  [OK]
    test_start_instance_ip ...                                             [OK]
    test_start_instance_tags ...                                           [OK]
    test_start_instance_volumes ...                                        [OK]
    test_start_instance_volumes_deprecated ...                             [OK]
    test_start_spot_instance ...                                           [OK]
    test_start_spot_instance_slow_startup ...                           [ERROR]
...
Traceback (most recent call last):
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/models.py", line 88, in wrapper
    result = func(*args, **kwargs)
  File "/home/buildbot/test/buildbot/master/buildbot/test/unit/test_worker_ec2.py", line 478, in test_start_spot_instance_slow_startup
    instance_id, _, _ = bs._request_spot_instance()
  File "/home/buildbot/test/buildbot/master/buildbot/worker/ec2.py", line 477, in _request_spot_instance
    bid_price = self._bid_price_from_spot_price_history()
  File "/home/buildbot/test/buildbot/master/buildbot/worker/ec2.py", line 457, in _bid_price_from_spot_price_history
    spot_prices = self.ec2.meta.client.describe_spot_price_history(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/client.py", line 621, in _make_api_call
    http, parsed_response = self._make_request(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/client.py", line 641, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 136, in _send_request
    while self._needs_retry(attempts, operation_model, request_dict,
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 253, in _needs_retry
    responses = self._event_emitter.emit(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 183, in __call__
    if self._checker(attempts, response, caught_exception):
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 250, in __call__
    should_retry = self._should_retry(attempt_number, response,
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 269, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 316, in __call__
    checker_response = checker(attempt_number, response,
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 222, in __call__
    return self._check_caught_exception(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/retryhandler.py", line 359, in _check_caught_exception
    raise caught_exception
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/endpoint.py", line 197, in _do_get_response
    responses = self._event_emitter.emit(event_name, request=request)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/models.py", line 271, in __call__
    status, headers, body = response_callback(
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/responses.py", line 197, in dispatch
    return cls()._dispatch(*args, **kwargs)
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/responses.py", line 295, in _dispatch
    return self.call_action()
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/core/responses.py", line 380, in call_action
    response = method()
  File "/home/buildbot/test/buildbot/.venv/lib/python3.8/site-packages/moto/ec2/responses/spot_instances.py", line 38, in describe_spot_price_history
    raise NotImplementedError(
builtins.NotImplementedError: SpotInstances.describe_spot_price_history is not yet implemented

buildbot.test.unit.test_worker_ec2.TestEC2LatentWorker.test_start_spot_instance_slow_startup

Is it ok to skip the test case?

@Conan-Kudo
Copy link
Contributor

@tonyhutter I would suggest implementing the test and adding a skip condition at the beginning of the test, with it printing a warning of it due to it being unimplemented in moto. You can make it skip when it detects the function returns the NotImplementedError.

@p12tic
Copy link
Member

p12tic commented Aug 20, 2020

I wonder if we could use the mock library to mock moto itself :-) Can we maybe just return some dummy data out of that function?

@tonyhutter
Copy link
Contributor Author

Thanks all for the suggestions. @tardyp do you have a preference on how to proceed with the test case?

@tardyp
Copy link
Member

tardyp commented Aug 21, 2020

I agree with @p12tic that for moto unimplemented methods, we should just used mock. skipping is not really an option, as it will just skip unconditionally, so the test is never run and adds no value

@tonyhutter
Copy link
Contributor Author

Quick updates:

  1. EC2 spot instances fail on InvalidSpotInstanceRequestID.NotFound #4617 appears to be the same bug as this PR.
  2. I went down the rabbit hole of updating moto to implement describe_spot_price_history(), and was able to make it work. The bad news is that I once I fixed that, I found other bugs in moto (like in their request_spot_instance() and describe_spot_instance(), which I'm still looking into. It's been a slow slog...

@tardyp
Copy link
Member

tardyp commented Aug 27, 2020

@tonyhutter thanks for update.. much appreciated!

@spulec
Copy link

spulec commented Sep 4, 2020

Let me know if I can help with getting this added to Moto or with the bugs in the other endpoints.

(long-time Buildbot user. thank you for all of your work)

@tardyp
Copy link
Member

tardyp commented Sep 5, 2020

@spulec I guess any help is appreciated. You can tell here that you are working on this branch in order to avoid double work, and then create a new PR a side to show your progress.

@tonyhutter
Copy link
Contributor Author

Looks like there's an existing request_spot_price_history() moto issue: getmoto/moto#783. I'll post my moto updates in that bug until I get my PR for them ready.

@stale
Copy link

stale bot commented Dec 25, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stalled label Dec 25, 2020
@Conan-Kudo
Copy link
Contributor

@tonyhutter Any progress here?

@stale stale bot removed the stalled label Dec 25, 2020
@tonyhutter
Copy link
Contributor Author

I've been working on some higher priority projects for the last couple months and unfortunately this has fallen by the wayside. It's probably going to be at least a month or more before I can look at it again, but I do plan to get back to it eventually.

Copy link

@hixio-mh hixio-mh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonyhutter
Copy link
Contributor Author

Just to summarize where this PR is at:

  1. This patch works fine and fixes the issue. I even tested it on the 3.0.x branch where the problem still exists and can confirm that this patch fixes it.
  2. I was asked in this PR to add a test case.
  3. The test case needs to call describe_spot_price_history() but moto hasn't implemented that function yet (it's currently just a stub function).
  4. I started implementing describe_spot_price_history() in moto, and got it to work and return some mock data, but then hit further problems in the test case where the instance wasn't starting.
  5. I spent a ton of time messing with moto trying to figure out why the instance wasn't starting, but no dice. At this point I'm done banging my head in the wall trying to get the moto stuff to work.

I've rebased this PR on master so if you want to pull it in without the test case, great. If not, no big deal, I'll just include this PR's patch in our project's buildbot repo, and apply it manually. I'd recommend you do pull it though since it's a fairly low-risk patch and fixes a real issue.

@Conan-Kudo
Copy link
Contributor

@tonyhutter Do you have your work on moto where the problem exists available somewhere? Maybe someone here can help with that.

@tonyhutter
Copy link
Contributor Author

tonyhutter and others added 2 commits May 13, 2021 20:49
I frequently see this error when starting up buildbot:

    An error occurred (InvalidSpotInstanceRequestID.NotFound) when
    calling the DescribeSpotInstanceRequests operation: The spot
    instance request ID 'sir-abcd1234' does not exist

After the error, I'll see "zombie" instances running with no tags in
AWS.  This is caused by EC2LatentWorker._wait_for_request()
calling describe_spot_instance_requests() before the request is ready.
I noticed it can sometimes take a second or so for the request to show
up.

This patch waits for describe_spot_instance_requests() to return
successfully.
@p12tic
Copy link
Member

p12tic commented May 13, 2021

@tonyhutter I've hacked missing support into moto as part as Buildbot tests. As part of that I had to change exception capturing to the following:

except ClientError as e:
    if 'InvalidSpotInstanceRequestID.NotFound' in str(e):
    <...>

That is, we catch botocore.client.ClientError instead of self.ec2.meta.errors.InvalidSpotInstanceRequestID.NotFound.

Could you please check if this still works with real ec2? If yes, then this PR is ready for merge, thanks a lot for the work you put in.

@tonyhutter
Copy link
Contributor Author

@p12tic thanks, I'll give that a test

@tonyhutter
Copy link
Contributor Author

@p12tic I just tested and verified that your code works. Do you want me to make the change in my PR to:

            except ClientError as e:
                if 'InvalidSpotInstanceRequestID.NotFound' in str(e):
                    requests = None
                else:
                    raise

... or just leave my PR as-is?

@p12tic
Copy link
Member

p12tic commented May 15, 2021

@tonyhutter Actually I've already pushed this change to your branch, I just needed a confirmation that it works and will merge the PR soon. Thanks a lot!

@p12tic p12tic merged commit 1c4f045 into buildbot:master May 15, 2021
@tonyhutter
Copy link
Contributor Author

@p12tic thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants