Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(custom-resources): handle Inactive lambda functions #20922

Closed
wants to merge 18 commits into from

Conversation

comcalvi
Copy link
Contributor

@comcalvi comcalvi commented Jun 29, 2022

all lambda functions can become inactive eventually. This will result in invocations failing. This PR adds logic to retry the invocation up to a maximum of ten minutes and some change. This number was chosen because a lambda only runs for 15 minutes, and 10 seemed like a good number.

Closes #20123.


All Submissions:

Adding new Unconventional Dependencies:

  • This PR adds new unconventional dependencies following the process described here

New Features

  • Have you added the new feature to an integration test?
    • Did you use yarn integ to deploy the infrastructure and generate the snapshot (i.e. yarn integ without --dry-run)?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@gitpod-io
Copy link

gitpod-io bot commented Jun 29, 2022

@github-actions github-actions bot added the p2 label Jun 29, 2022
@comcalvi comcalvi added pr-linter/exempt-test The PR linter will not require test changes pr-linter/exempt-integ-test The PR linter will not require integ test changes and removed p2 labels Jun 29, 2022
@comcalvi comcalvi marked this pull request as draft June 29, 2022 20:59
@aws-cdk-automation aws-cdk-automation requested a review from a team June 29, 2022 20:59
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Jun 29, 2022

log('user function is still being initialized by Lambda, retrying with delay of: ', newDelay);

return setTimeout(invokeUserFunction, newDelay, functionArnEnv, payload, newRetryOptions);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not as readable as a for loop with a sleep.

function sleep(ms: number): Promise<void> {
  return new Promise<void>(ok => window.setTimeout(ok, ms));
}

You can call await sleep(10_000) on that.

Copy link
Contributor

@rix0rrr rix0rrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still changes requested, getting it out of my queue

@github-actions github-actions bot added bug This issue is a bug. effort/small Small work item – less than a day of effort p1 labels Jul 19, 2022
@comcalvi comcalvi removed pr-linter/exempt-test The PR linter will not require test changes pr-linter/exempt-integ-test The PR linter will not require integ test changes labels Aug 3, 2022
@comcalvi comcalvi marked this pull request as ready for review August 3, 2022 23:42
@@ -13,6 +13,9 @@ export = {
[consts.FRAMEWORK_ON_TIMEOUT_HANDLER_NAME]: onTimeout,
};

const BASE_SLEEP = 10_000;
const MAX_TOTAL_SLEEP = 620_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apparently already forgot about our conversation on the duplicate PR because I was about to be like, why this number? 😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereading that, however, made me realize that the ordering on these checks might be somewhat off. If the lambda. is called while it is in inactive or pending state, the invocation will simply fail so a new invocation will need to be called instead of giving time for the current invocation to pass. Basically, I think that invokeUserFunction may need to have the following workflow:

Within a while loop `await invokeFunction` -> `await getFunctionResponse` -> 
    if `Configuration. LastUpdateStatus.failure && Configuration.State === inactive or pending` -> retry
    else if -> `Configuration. LastUpdateStatus.failure` i.e. it's status is active or failed -> get and throw error error (break)
    else -> return payload (break)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above could be a do-while loop so that you don't have to use breaks but obviously the actual implementation is up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, regarding this number, do we know in general how long it usually takes a function to return to active? That data might be good in determining this number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we do not at this time. I should have lambdas that will become inactive in the next week or so (someone from lambda told me offline that this is ~30 days), so I can check once I have an inactive function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original assessment was only partially right. I was under the impression that the Lamba function failed, not errored, but upon further digging, I see that it throws an exception. See outbound.ts for my revised suggestion here.

});

if (!(getFunctionResponse.Configuration?.State === 'Inactive' || getFunctionResponse.Configuration?.State === 'Pending')) {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this just break the while loop and not the if statement here? If so, this will still continue on to log that the error was thrown and then throw the error even though the lambda has now succeeded (presumably).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh good catch, you're right

// parse function name from arn
// arn:${Partition}:lambda:${Region}:${Account}:function:${FunctionName}
const arn = functionArn.split(':');
const functionName = arn[arn.length - 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a circumstance where this lambda might have a version added to the end of it or are our custom resources using unqualified ARNs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user function is under the user's control, so they can create versions as they please. However, getFunction() will happily accept a name with a :version-string at the end as well, so we're safe in either case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that if they do provide us with one that has a version at the end, arn[arn.length - 1] will pick up the version number instead of the function name, since it would be arn:${Partition}:lambda:${Region}:${Account}:function:${FunctionName}:${Version}

while (totalSleep <= MAX_TOTAL_SLEEP) {
// if the user's lambda has become Inactive, we must retry the invocation until Lambda finishes provisioning resources for it.
const getFunctionResponse = await getFunction({
FunctionName: functionName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can just provide the ARN here instead of trying to parse per GetFunction Documentation.

Copy link
Contributor Author

@comcalvi comcalvi Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but we need the name later anyway for the logs (line 154 in this file with these changes), and imo passing the name is cleaner. If you prefer though we can change this to the ARN

// parse function name from arn
// arn:${Partition}:lambda:${Region}:${Account}:function:${FunctionName}
const arn = functionArn.split(':');
const functionName = arn[arn.length - 1];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that if they do provide us with one that has a version at the end, arn[arn.length - 1] will pick up the version number instead of the function name, since it would be arn:${Partition}:lambda:${Region}:${Account}:function:${FunctionName}:${Version}

@@ -13,6 +13,9 @@ export = {
[consts.FRAMEWORK_ON_TIMEOUT_HANDLER_NAME]: onTimeout,
};

const BASE_SLEEP = 10_000;
const MAX_TOTAL_SLEEP = 620_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original assessment was only partially right. I was under the impression that the Lamba function failed, not errored, but upon further digging, I see that it throws an exception. See outbound.ts for my revised suggestion here.

Comment on lines +49 to +56
async function defaultGetFunction(req: AWS.Lambda.GetFunctionRequest): Promise<AWS.Lambda.GetFunctionResponse> {
if (!lambda) {
lambda = new AWS.Lambda(awsSdkConfig);
}

return lambda.getFunction(req).promise();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, according to the documentation, we can run into a situation where new lamba functions also haven't finished provisioning before we try to invoke them so I wonder if the right path forward here is to replace `defaultInvokeFunction with the following (roughly, this also contains notes):

async function defaultInvokeFunction(req: AWS.Lambda.InvocationRequest): Promise<AWS.Lambda.InvocationResponse> {
  if (!lambda) {
    lambda = new AWS.Lambda(awsSdkConfig);
  }

  let status = await defaultGetFunction(req);
  if (status.Configuration?.State !== 'Failed') {
    try {
      // If it is inactive, pending, or failed it will throw instead of returning
      return lambda.invoke(req).promise();
    } catch (e) {
      // We only care about inactive or pending for retries
      while (status.Configuration?.State === 'Inactive' || status.Configuration?.State === 'Pending') {
        // Add wait of some sort in here if we want
        // Since we don't know how long the reprovisioning of resources will take, we shouldn't add in a max time.
        status = await defaultGetFunction(req);
      }
    }
  }
  // Now that it's either active or failed, return the terminal response, error or not
  return lambda.invoke(req).promise();
}

This would mean that we don't want to export defaultGetFunction as getFunction since the scope of its use is only in this file. This edit would require no change (I think) to the logic in invokeUserFunction() in framework.ts. Well, except that there's still the issue with the ARN that I commented on. Maybe we just never take in qualified ARNs, though, so maybe that isn't an issue. On the other hand, you could export getFunction and use that to make sure we're getting the correct function name. It's part of Configuration.

What do you think?

@Naumel
Copy link
Contributor

Naumel commented Sep 12, 2022

@Mergifyio update

@mergify
Copy link
Contributor

mergify bot commented Sep 12, 2022

update

✅ Branch has been successfully updated

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: e8f3eac
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@TheRealAmazonKendra
Copy link
Contributor

Closing this as the work has been continued in another PR.

@gshpychka
Copy link
Contributor

@TheRealAmazonKendra which one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. contribution/core This is a PR that came from AWS. effort/small Small work item – less than a day of effort p1
Projects
None yet
6 participants