Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent DynamoDB Client Creating Excessive KMS API Calls #2129

Closed
jshlbrd opened this issue May 15, 2023 · 11 comments
Closed

Concurrent DynamoDB Client Creating Excessive KMS API Calls #2129

jshlbrd opened this issue May 15, 2023 · 11 comments
Assignees
Labels
bug This issue is a bug. p2 This is a standard priority issue service-api This issue is due to a problem in a service API, not the SDK implementation.

Comments

@jshlbrd
Copy link

jshlbrd commented May 15, 2023

Describe the bug

When the DynamoDB client is used concurrently with a customer-managed key it creates a Decrypt API call per goroutine / thread instead of creating only one API call for the client. More detail is described here: #2128.

Expected Behavior

The DynamoDB client makes 1 Decrypt API call up to every 5 minutes (based on the documentation here) for a single client used in any number of goroutines.

Current Behavior

The DynamoDB client makes N Decrypt API call up to every 5 minutes for a single client based on the number of goroutines. For example, if the client is used in 25 goroutines, then close to 25 Decrypt calls are made.

Reproduction Steps

#2128 (comment)

Possible Solution

No response

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2 v1.18.0
github.com/aws/aws-sdk-go-v2/config v1.18.25
github.com/aws/aws-sdk-go-v2/credentials v1.13.24
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.10.25
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.13.3
github.com/aws/aws-sdk-go-v2/internal/configsources v1.1.33
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.4.27
github.com/aws/aws-sdk-go-v2/internal/ini v1.3.34
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.19.7
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.14.11
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.9.11
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.7.27
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.9.27
github.com/aws/aws-sdk-go-v2/service/sso v1.12.10
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.14.10
github.com/aws/aws-sdk-go-v2/service/sts v1.19.0
github.com/aws/aws-sdk-go-v2/service/xray v1.16.11

Compiler and Version used

go1.19.6 linux/arm64

Operating System and version

Debian 11

@jshlbrd jshlbrd added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 15, 2023
@RanVaknin
Copy link
Contributor

RanVaknin commented May 15, 2023

Hi @jshlbrd ,

I see you opened a discussion and also an issue. I will close the discussion in favor of tracking it here.

I'm a bit confused by the problem you are describing. When creating a table on Dynamo, you can specify which key you want to use for encryption at rest (whether its AWS managed or a CMK), the encryption is happening on the service side as described in this doc:

When you access an encrypted table, DynamoDB decrypts the table data transparently. You don't have to change any code or applications to use or manage encrypted tables. DynamoDB continues to deliver the same single-digit millisecond latency that you have come to expect, and all DynamoDB queries work seamlessly on your encrypted data.

In other words, the SDK does not know / care about encryption at rest for dynamo.

To make sure I'm not wrong, I provisioned Dynamodb tables with CMK per your description:

$ aws dynamodb describe-table --table-name aws_sdk_go_v2_2128

{
    "Table": {
        "AttributeDefinitions": [
            {
                "AttributeName": "pk",
                "AttributeType": "S"
            }
        ],
        "TableName": "aws_sdk_go_v2_2128",
        "KeySchema": [
            {
                "AttributeName": "pk",
                "KeyType": "HASH"
            }
        ],
        "TableStatus": "ACTIVE",
        "CreationDateTime": "2023-05-15T15:46:02.519000-07:00",
        "ProvisionedThroughput": {
            "NumberOfDecreasesToday": 0,
            "ReadCapacityUnits": 1,
            "WriteCapacityUnits": 1
        },
        "TableSizeBytes": 0,
        "ItemCount": 0,
        "TableArn": "arn:aws:dynamodb:us-east-1:REDACTED:table/aws_sdk_go_v2_2128",
        "TableId": "REDACTED",
        "SSEDescription": {
            "Status": "ENABLED",
            "SSEType": "KMS",
            "KMSMasterKeyArn": "arn:aws:kms:us-east-1:REDACTED"
        },
        "TableClassSummary": {
            "TableClass": "STANDARD"
        },
        "DeletionProtectionEnabled": false
    }
}

Using the code you have provided I'm not seeing any network calls to KMS.
You can enable client logs on your config like so:

	ctx := context.TODO()
	cfg, err := config.LoadDefaultConfig(ctx, config.WithClientLogMode(aws.LogResponseWithBody|aws.LogRequestWithBody))
	if err != nil {
		panic(err)
	}

You will see 25 network calls to the Dynamodb service, but no calls are made to KMS.

Can you provide some more information about how you noticed the difference? From playing around with Cloudtrail I wasn't able to identify any Decrypt calls made to KMS. It would be helpful if you can point me to where you are seeing the Decrypt call being made so I can have a better understanding about the issue at hand.

Thanks,
Ran~

@RanVaknin RanVaknin self-assigned this May 15, 2023
@RanVaknin RanVaknin added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p3 This is a minor priority issue and removed needs-triage This issue or PR still needs to be triaged. labels May 15, 2023
@jshlbrd
Copy link
Author

jshlbrd commented May 16, 2023

Hey @RanVaknin, thanks for taking a look at this! The API calls are not made by the client, they're made by the DynamoDB service -- but the number of KMS calls made by DynamoDB appear to be the direct result of using concurrency with the Go client.

Referencing this documentation may help clarify (this is from the section about the GetItem API call):

DynamoDB caches the plaintext table keys for each principal in memory. If DynamoDB gets a request for the cached table key after five minutes of inactivity, it sends a new request to AWS KMS to decrypt the table key.

Here's an example CloudTrail event that shows the behavior I'm describing (removed and redacted private information):

{
  "eventVersion": "1.08",
  "userIdentity": {
      "type": "AssumedRole",
    },
    "invokedBy": "dynamodb.amazonaws.com"
  },
  "eventTime": "2023-05-16T00:19:11Z",
  "eventSource": "kms.amazonaws.com",
  "eventName": "Decrypt",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "dynamodb.amazonaws.com",
  "userAgent": "dynamodb.amazonaws.com",
  "requestParameters": {
    "encryptionContext": {
      "aws:dynamodb:tableName": "aws_sdk_go_v2_2128_concurrency",
      "aws:dynamodb:subscriberId": `AWS ACCOUNT ID`
    },
    "encryptionAlgorithm": "SYMMETRIC_DEFAULT"
  },
  "responseElements": null,
  "requestID": `REQUEST ID`
  "eventID": `EVENT ID`
  "readOnly": true,
  "resources": [
    {
      "accountId": `AWS ACCOUNT ID`
      "type": "AWS::KMS::Key",
      "ARN": `KMS Key ARN`
    }
  ],
  "eventType": "AwsApiCall",
  "managementEvent": true,
  "recipientAccountId": `AWS ACCOUNT ID`
  "eventCategory": "Management"
}

I see approximately 25 events like the above for the concurrent client. One thing worth noting, I'm using role assumption and each Decrypt call has a unique access key ID. This is a guess, but is it possible that DynamoDB is treating each goroutine as its own client due to how authentication is handled when using a concurrent client?

@RanVaknin
Copy link
Contributor

RanVaknin commented May 16, 2023

Hi @jshlbrd ,

Thanks for additional info.

One thing worth noting, I'm using role assumption and each Decrypt call has a unique access key ID

When you say "unique access key ID" are you referring to accountId in the cloud trail event?

How are you assuming the role? Are you reading it from env variables?
I don't have a lot of confidence in this theory, but maybe using an explicit assume role call would solve this issue:

	ctx := context.TODO()
	cfg, err := config.LoadDefaultConfig(ctx, config.WithClientLogMode(aws.LogResponseWithBody|aws.LogRequestWithBody))
	if err != nil {
		panic(err)
	}

	client := sts.NewFromConfig(cfg)
	appCreds := stscreds.NewAssumeRoleProvider(client, "arn:aws:iam::REDACTED:role/role_name")

	cfgCopy := cfg.Copy()
	cfgCopy.Credentials = aws.NewCredentialsCache(appCreds)

	svc := dynamodb.NewFromConfig(cfgCopy)
	
	// ... more code here

When testing this code by enabling the request logs I can examine the request sent by each Goroutine and I notice in the Authorization header, that all access keys are the same.

Should look something like this:

SDK 2023/05/16 11:40:33 DEBUG Request
POST / HTTP/1.1
Host: dynamodb.us-east-1.amazonaws.com
User-Agent: aws-sdk-go-v2/1.18.0 os/macos lang/go/1.19.1 md/GOOS/darwin md/GOARCH/arm64 api/dynamodb/1.19.7
Content-Length: 69
Accept-Encoding: identity
Amz-Sdk-Invocation-Id: REDACTED
Amz-Sdk-Request: attempt=1; max=3
Authorization: AWS4-HMAC-SHA256 Credential={{Secret Access ID}}/20230516/us-east-1/dynamodb/aws4_request, SignedHeaders=accept-encoding;amz-sdk-invocation-id;amz-sdk-request;content-length;content-type;host;x-amz-date;x-amz-security-token;x-amz-target, Signature=REDACTED
Content-Type: application/x-amz-json-1.0
X-Amz-Date: 20230516T184033Z
X-Amz-Security-Token: REDACTED
X-Amz-Target: DynamoDB_20120810.GetItem

{"Key":{"pk":{"S":"b"}},"TableName":"aws_sdk_go_v2_2128_concurrency"}

This should be the same as the implicit assume role you are probably doing through injecting env variable but I thought we can try anyway.
You should be able to see that all 25 requests contain the same "Credential=" pointing to the secret access ID produced by assume role.


Another theory I can think of, is because you are sending all of these requests at once, the Dynamo server is not able to effectively cache the requests - resulting in multiple Decrypt calls. A way I would try to work around it, is by sending a single getItem request synchronously, and then sending the remaining calls via Goroutines.


At any rate, this seems like a service side limitation. If none of these methods work for you, we will have to upstream this to the Dynamodb team internally. I can do this on your behalf, or if you file a support ticket, you'll get assigned a dedicated person that will help navigate this request to the service team - they usually have a little more tools than us to help expedite things. If you choose the latter, just make sure to have them CC me on the ticket using my alias rvaknin

Thanks again,
Ran~

@jshlbrd
Copy link
Author

jshlbrd commented May 16, 2023

Thanks @RanVaknin!

When you say "unique access key ID" are you referring to accountId in the cloud trail event?

Not quite, it's the accessKeyId seen in the CloudTrail events which is unique for each call that originates from the Go client. For example:

{
  "userIdentity": {
    "type": "AssumedRole",
    "principalId": "REDACTED",
    "arn": "REDACTED",
    "accountId": "REDACTED",
    "accessKeyId": "REDACTED",
    "sessionContext": {
      "sessionIssuer": {
        "type": "Role",
        "principalId": "REDACTED",
        "arn": "REDACTED",
        "accountId": "REDACTED",
        "userName": "REDACTED"
      }
    },
    "invokedBy": "dynamodb.amazonaws.com"
  }
}

Based on your suggestions I confirmed the following:

  • Using environment variables for role assumption vs explicitly assuming role doesn't change the behavior
  • Executing a synchronous GetItem call before goroutines doesn't change the behavior
    • I added a 10 second sleep after the synchronous call to let the DynamoDB service cache "catch up" and that made no difference

If you can file a request with the DynamoDB team then that would help. My feeling is that, for some reason, the DynamoDB service does not recognize that the Go client is a single requester when it is used concurrently; it may be treating each goroutine as a unique client, which causes the key caching feature to fail.

@RanVaknin
Copy link
Contributor

RanVaknin commented May 16, 2023

Hi @jshlbrd,

Thanks for trying to implement the suggestions.

Can you please provide me with 2 separate x-amzn-requestid header values you get back from 2 of your concurrent requests? (you need to enable response logging to see that)

The service team will need that to identify your requests and investigate.

Thanks!
Ran~

@jshlbrd
Copy link
Author

jshlbrd commented May 16, 2023

@RanVaknin Here are the 25 requests from my latest run with the concurrent requests:

X-Amzn-Requestid: 1S8R9EJTU8JUT0TL1L00EFD6TBVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: HONNCUGOU343ONRA4RNU0B1J8VVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 0TUF3HN7UBEK3KNT9I26QLI6I3VV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: HAB7EABTT0VLBQ3V6QKPBO47NBVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 6V1NKOOU0K5NCPLJLR90E3AO1RVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 1472LU9G44L1Q8I45LUOVLTUQBVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 2U23SATL73L7LD4MUESDK75L77VV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: F5FNLEE9TKGIN889KOH00F172RVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 17M54JRIVU56V2G8THG290L6LBVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: OF04FB8S2BRN2UDB6H78IH4D07VV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: ITM8CCEB3KNN0D7N2OBGRSMVPVVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: NP4HUF0ARUMDL2I3HQVDTDJEHFVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 2JJUGCIIAUQ61B43FNCCPEQGPRVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 4CH4KM9SVQON1SMDS65C0UE4RNVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: EUP73Q4RP7ISVC0E318T3DE8KJVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: LBFVN786V0SLO8BF1DAKMCGM27VV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: AIM29AF8V7FMC3RUFEHOAJ6E4RVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 316JO58K4BAKQQNE6FC8QREU2NVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 8I0AOB0GMD0AETAJ1TLQLCCHFJVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 6JKJ0I5HN1CVCKVPNRMP384B0RVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 8N4S80T1SCD4KTRA69IL6Q1Q73VV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: 5Q89B5NMN7SOABCK5NCUK27VUFVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: R86UE8VVV8J4FKCUGK629IHHK3VV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: II3N7D8FKBQH9K52UNF34VG2VVVV4KQNSO5AEMVJF66Q9ASUAAJG
X-Amzn-Requestid: O1PNFDO6GN186A3FCTNLURRPHBVV4KQNSO5AEMVJF66Q9ASUAAJG

@RanVaknin
Copy link
Contributor

Hi @jshlbrd ,

Thanks for that. Just to triple check, when you enable the request logging, you see the same principal / AWS Secret Access Key on sent on the request? The doc you linked specifies that the principal is what influences the caching mechanism.

Principal should look something like this: ASIA6ZK6KVPCPHI5DVVD

In my last provided log:

SDK 2023/05/16 11:40:33 DEBUG Request
POST / HTTP/1.1
...
...
...
Authorization: AWS4-HMAC-SHA256 Credential={{ PRINCIPAL }}/20230516/us-east-1/dynamodb/aws4_request, 
...
...
...

Is the principal sent here with every request the same? (In my repro code it is the same) because if it is not, then that's what is throwing Dynamo off.
If it is The same, then the problem is most likely with the service as I have previously mentioned.

Thanks again for your prompt responses.

All the best,
Ran~

@jshlbrd
Copy link
Author

jshlbrd commented May 16, 2023

@RanVaknin That's correct, the principal is the same in every request. Thank you!

@RanVaknin
Copy link
Contributor

Hi @jshlbrd,

I created an internal ticket, we now have to wait for a response. My suggestion would be to use the iterative approach for now so you can save on billing costs.

Additionally if you see this is taking too long, feel free to reach out to AWS Support, they might be able to expedite this faster than I can.

I will keep you updated when I learn anything new.
Thanks,
Ran~

P88754518

@RanVaknin RanVaknin added service-api This issue is due to a problem in a service API, not the SDK implementation. p2 This is a standard priority issue and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. p3 This is a minor priority issue labels May 16, 2023
@RanVaknin
Copy link
Contributor

Hi @jshlbrd ,

I finally heard back from the dynamo team:

I looked up the 25 request IDs, and those 25 requests were handled by 17 unique hosts, some of them handling more than 1 request.

When you kick off many concurrent requests to DDB, you're using HTTP/1.1, which means that each TCP connection can have 1 ongoing request at a time (usually true for HTTP/1.1 with some exceptions).

In this case it would open many connections to different hosts, each of which may not have the appropriate KMS material in their cache. This would in turn cause each unique host to request to KMS and would in turn cause the increased charges.

This is further explained by this little nuance (in bold) from the doc you shared:

Depending on the data being encrypted, DynamoDB identifies the right data encryption key to encrypt the data. To avoid having DynamoDB call KMS for every DynamoDB operation, the table key is cached for each principal in memory.

Since the goroutine will establish a new connection on a different host, in memory caching will occur per host.

To avoid the excess decrypt call I'd stick with invoking those operations sequentially rather than concurrently.

Thanks again for your patience.

All the best,
Ran~

@RanVaknin RanVaknin closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2024
Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue service-api This issue is due to a problem in a service API, not the SDK implementation.
Projects
None yet
Development

No branches or pull requests

2 participants