Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve detection of resource attributes on ECS #4574

Merged

Conversation

felixscheinost
Copy link
Contributor

This improves the detection of resource attributes on ECS by fetching ECS metadata from ECS_CONTAINER_METADATA_URI or ECS_CONTAINER_METADATA_URI_V4.

Previously only CONTAINER_NAME and CONTAINER_ID id were set.

Now we set:

  • CONTAINER_ID
  • CONTAINER_NAME
  • AWS_ECS_CONTAINER_ARN
  • CONTAINER_IMAGE_NAME
  • CONTAINER_IMAGE_TAG
  • aws.ecs.container.image.id
  • AWS_LOG_GROUP_ARNS
  • AWS_LOG_GROUP_NAMES
  • AWS_LOG_STREAM_NAMES
  • AWS_ECS_TASK_ARN
  • AWS_ECS_TASK_FAMILY
  • AWS_ECS_TASK_REVISION

Especially AWS_LOG_GROUP_ARNS is important so that connection of traces to logs works OOTB on X-Ray.

@felixscheinost felixscheinost requested a review from a user June 30, 2022 16:14
@felixscheinost felixscheinost force-pushed the feature/improve-ecs-resource branch 2 times, most recently from e9ebf1e to 46fbf18 Compare June 30, 2022 16:21
@codecov
Copy link

codecov bot commented Jun 30, 2022

Codecov Report

Base: 90.04% // Head: 90.60% // Increases project coverage by +0.56% 🎉

Coverage data is based on head (1fc24b4) compared to base (0860b38).
Patch has no changes to coverable lines.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4574      +/-   ##
============================================
+ Coverage     90.04%   90.60%   +0.56%     
+ Complexity     5057     4906     -151     
============================================
  Files           582      567      -15     
  Lines         15580    14728     -852     
  Branches       1495     1414      -81     
============================================
- Hits          14029    13345     -684     
+ Misses         1096      959     -137     
+ Partials        455      424      -31     
Impacted Files Coverage Δ
...java/io/opentelemetry/sdk/trace/data/SpanData.java 75.00% <0.00%> (-25.00%) ⬇️
...esting/context/SettableContextStorageProvider.java 37.93% <0.00%> (-24.14%) ⬇️
...a/io/opentelemetry/sdk/trace/internal/JcTools.java 28.57% <0.00%> (-4.77%) ⬇️
...ernal/aggregator/ExplicitBucketHistogramUtils.java 97.29% <0.00%> (-2.71%) ⬇️
.../opentelemetry/sdk/internal/ComponentRegistry.java 90.90% <0.00%> (-2.43%) ⬇️
...o/opentelemetry/api/internal/ReadOnlyArrayMap.java 91.11% <0.00%> (-2.23%) ⬇️
...opentelemetry/opentracingshim/SpanBuilderShim.java 77.89% <0.00%> (-0.68%) ⬇️
...n/java/io/opentelemetry/sdk/logs/LogProcessor.java 85.71% <0.00%> (ø)
.../java/io/opentelemetry/sdk/logs/SdkLogEmitter.java 100.00% <0.00%> (ø)
...va/io/opentelemetry/sdk/logs/NoopLogProcessor.java 100.00% <0.00%> (ø)
... and 160 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This generally looks good, @willarmiros can you also give it a check?


AttributesBuilder attrBuilders = Attributes.builder();
static void parseUrl(SimpleHttpClient httpClient, String url, AttributesBuilder attrBuilders) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void parseUrl(SimpleHttpClient httpClient, String url, AttributesBuilder attrBuilders) {
static void fetchMetadata(SimpleHttpClient httpClient, String url, AttributesBuilder attrBuilders) {

This improves the detection of resource attributes on ECS by fetching ECS metadata from `ECS_CONTAINER_METADATA_URI` or `ECS_CONTAINER_METADATA_URI_V4`.

Previously only `CONTAINER_NAME` and `CONTAINER_ID` id were set.

Now we set:

- CONTAINER_ID
- CONTAINER_NAME
- AWS_ECS_CONTAINER_ARN
- CONTAINER_IMAGE_NAME
- CONTAINER_IMAGE_TAG
- aws.ecs.container.image.id
- AWS_LOG_GROUP_ARNS
- AWS_LOG_GROUP_NAMES
- AWS_LOG_STREAM_NAMES
- AWS_ECS_TASK_ARN
- AWS_ECS_TASK_FAMILY
- AWS_ECS_TASK_REVISION

Especially AWS_LOG_GROUP_ARNS is important so that connection of traces to logs works OOTB on X-Ray.
@willarmiros
Copy link

Thanks so much for this improvement! I will have someone from AWS take a look at this shortly

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jul 20, 2022
@felixscheinost
Copy link
Contributor Author

Bump: Please don't close this PR automatically.

@github-actions github-actions bot removed the Stale label Jul 27, 2022
@jkwatson
Copy link
Contributor

Thanks so much for this improvement! I will have someone from AWS take a look at this shortly

@willarmiros any update here?

@willarmiros
Copy link

Apologies for the delay here, we will be having someone review this shortly!

@mmanciop
Copy link
Contributor

mmanciop commented Aug 9, 2022

@felixscheinost i missed out on this PR and worked on #4670. Your PR is effectively a superset of mine, with the exception of AWS_ECS_LAUNCHTYPE and AWS_LOGS_STREAM_ARNS and a couple small bugs in the generation of the AWS Log Group ARN. I prepared a fix for your PR in https://github.com/lumigo-io/opentelemetry-java/tree/pr-for-felix. @willarmiros is going to have a look at these PRs in the next few days. Would you be interested in pulling into your PR my changes? :-)

@felixscheinost
Copy link
Contributor Author

felixscheinost commented Aug 12, 2022

@mmanciop Thanks for rebasing your changes on my PR!

Regarding the change in the log group ARN. That dosn't seem right to me. Have you tested this on AWS?

I just changed this line on my branch, to test this change in isolation, and deployed to our test evironment.

// from
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:" + logGroupName);
// to
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:/aws" + logGroupName + ":*");

With this change in place I can no longer see the logs for a trace in CloudWatch.

If I view a trace, for which I previously could view logs, and click on "Resources" for "Cloudwatch logs" I see the following:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws<the name of my log group>:*"
  }
]

I also tried adding another separting / between /aws and logGroupName but that also doesn't work. In CloudWatch I then see:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws/<the name of my log group>:*"
  }
]

But that is also wrong as my log group is just called <the name of my log group> and not /aws/<the name of my log group>.

@mmanciop
Copy link
Contributor

@mmanciop Thanks for rebasing your changes on my PR!

Regarding the change in the log group ARN. That dosn't seem right to me. Have you tested this on AWS?

I just changed this line on my branch, to test this change in isolation, and deployed to our test evironment.

// from
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:" + logGroupName);
// to
attributesBuilder.put(ResourceAttributes.AWS_LOG_GROUP_ARNS, "arn:aws:logs:" + region + ":" + account + ":log-group:/aws" + logGroupName + ":*");

With this change in place I can no longer see the logs for a trace in CloudWatch.

If I view a trace, for which I previously could view logs, and click on "Resources" for "Cloudwatch logs" I see the following:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws<the name of my log group>:*"
  }
]

I also tried adding another separting / between /aws and logGroupName but that also doesn't work. In CloudWatch I then see:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account>:log-group:/aws/<the name of my log group>:*"
  }
]

But that is also wrong as my log group is just called <the name of my log group> and not /aws/<the name of my log group>.

I compared the ARNs with the output of aws logs describe-log-groups and aws logs describe-log-streams. At first, when I started investigating the change, I was as surprised as you. @willarmiros do you concur with these ARN values?

@felixscheinost
Copy link
Contributor Author

felixscheinost commented Aug 12, 2022

@mmanciop When I do describe-log-groups on my log groups I get an ARN without /aws.

Only when the log group name contains /aws, like for lambdas, does the ARN also contain /aws.

@mmanciop
Copy link
Contributor

@mmanciop When I do describe-log-groups on my log groups I get an ARN without /aws.

Only when the log group name contains /aws, like for lambdas, does the ARN also contain /aws.

I tried it on ECS, but this bears more validation. Brb :-)

@mmanciop
Copy link
Contributor

So, what I see in the ECS console when deploying a simple CDK + Amazon ECS project is this:

awslogs-group | /aws/batch/job
-- | --
awslogs-region | eu-central-1
awslogs-stream-prefix | CrawlerJobDefinition4D5-55fb222b188f18e

The log-group is called /aws/batch/job in CloudWatch and the ARN is arn:aws:logs:eu-central-1:170906415945:log-group:/aws/batch/job:*.

The AWS CLI reports:

% aws logs describe-log-groups | jq '.logGroups[] | select(.logGroupName == "/aws/batch/job")'
{
  "logGroupName": "/aws/batch/job",
  "creationTime": 1658054662260,
  "metricFilterCount": 0,
  "arn": "arn:aws:logs:eu-central-1:170906415945:log-group:/aws/batch/job:*",
  "storedBytes": 141110571
}

So, yes, we should not add programmatically /aws. The :* at the end of the ARN for the log-group, however, seem pretty legit. Do you agree @felixscheinost ?

@mmanciop
Copy link
Contributor

@felixscheinost I pushed a commit to remove the spurious /aws prefix from the ARNs. Thanks for catching that :-)

@felixscheinost
Copy link
Contributor Author

The :* at the end of the ARN for the log-group, however, seem pretty legit.

Yeah, I think so. It seems like it does work without it but both the CLI and aws.amazon.com display log groups with that trailing :*.

Okay, I will manually test the other changes from your commit after the weekend and if everything looks fine (which I assume it will), pull them into this PR.

Is it okay for you if I squash your two commits into one and push it on top of this branch?

@mmanciop
Copy link
Contributor

Is it okay for you if I squash your two commits into one and push it on top of this branch?

Yup, I think in the end it is best to squash the entire PR on merge anyhow

@felixscheinost
Copy link
Contributor Author

Hi @mmanciop, unfortunately even the trailing :* seems to break showing the logs for a trace.

I cherry picked all your changes, deployed and tried showing the logs for a trace but didn't see any. As sometimes the logs take a while to show up I waited and tried again but still no logs.

Then I removed the trailing :* from the log group ARNs, deployed again and now I can see the logs again.

The output for the trace under "Resources" -> "CloudWatch logs" is:

[
  {
    "log_group": "*",
    "arn": "arn:aws:logs:eu-central-1:<account number>:log-group:<log group name>:*"
  }
]

vs this:

[
  {
    "log_group": "<log group name>",
    "arn": "arn:aws:logs:eu-central-1:<account number>:log-group:<log group name>"
  }
]

So it seems CloudWatch is very peculiar about how the log group ARNs are formatted. Maybe someone from AWS can verify this.

@mmanciop
Copy link
Contributor

mmanciop commented Aug 17, 2022

I cherry picked all your changes, deployed and tried showing the logs for a trace but didn't see any. As sometimes the logs take a while to show up I waited and tried again but still no logs.

Then I removed the trailing :* from the log group ARNs, deployed again and now I can see the logs again.

@felixscheinost how exactly are you using these ARNs? With which tool? Because the issue could be there too.

Besides, the spec clearly states with the example that the :* must be there: https://opentelemetry.io/docs/reference/specification/resource/semantic_conventions/cloud_provider/aws/logs/

@felixscheinost
Copy link
Contributor Author

I am using CloudWatch traces on the official AWS web console.

@felixscheinost
Copy link
Contributor Author

Besides, the spec clearly states with the example that the :* must be there: https://opentelemetry.io/docs/reference/specification/resource/semantic_conventions/cloud_provider/aws/logs/

These are merely "examples".

This links also references the official AWS documentation, which states:

Both of the following are used. The second one, with the * at the end, is what is returned by the describe-log-groups CLI command and the DescribeLogGroups API.
arn:aws:logs:region:account-id:log-group:log_group_name
arn:aws:logs:region:account-id:log-group:log_group_name:*

I think in the end what works is what's correct. 3rd party tools which parse aws.log.group.arns can be adapted. AWS CloudWatch can hardly be adapted.

@mmanciop
Copy link
Contributor

mmanciop commented Aug 17, 2022

I suppose @willarmiros, @Aneurysm9 should be making the call and, if :* has to go, update the examples in the spec. However, considering that in CloudWatch actually is showing even in the console ARNs with the :*, I think it is a bug that X-Ray should fix.

Screenshot 2022-08-17 at 13 20

@willarmiros
Copy link

willarmiros commented Aug 19, 2022

Sorry for being late here: so from my reading there were 2 issues:

  1. /aws was being pre-pended to log groups in the resource detector, but that was fixed to no longer do that (which is the correct fix)
  2. There is a question of whether to include the trailing :* in the log group ARN when recording it on the trace. It seems like the expected behavior is "it doesn't matter" because both such ARN formats are valid, however in practice only the ARN without the trailing :* works. I have just found the bug on our end here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/8a821bb7829a23924edd2e23b22c6751277ab2ea/exporter/awsxrayexporter/internal/translator/aws.go#L272-L279

I will submit a fix for the bug so that both ARN formats work. The spec should stay as-is because the examples are still valid formats, just our parsing code is wrong. However, for now it would be best if the resource detectors could record the ARNs without the trailing :* so that everything still works while we wait for the fix to land.

assertThat(attributes)
.containsOnly(
entry(ResourceAttributes.CLOUD_PROVIDER, "aws"),
entry(ResourceAttributes.CLOUD_PLATFORM, "aws_ecs"),
entry(ResourceAttributes.CONTAINER_NAME, InetAddress.getLocalHost().getHostName()));
entry(ResourceAttributes.CONTAINER_NAME, "ecs-curltest-24-curl-cca48e8dcadd97805600"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "24" to "26" for consistency.

{
"DockerId": "ea32192c8553fbff06c9340478a2ff089b2bb5646fb718b4ee206641c9086d66",
"Name": "curl",
"DockerName": "ecs-curltest-24-curl-cca48e8dcadd97805600",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "24" to "26" for consistency.

"com.amazonaws.ecs.container-name": "curl",
"com.amazonaws.ecs.task-arn": "arn:aws:ecs:us-west-2:111122223333:task/default/8f03e41243824aea923aca126495f665",
"com.amazonaws.ecs.task-definition-family": "curltest",
"com.amazonaws.ecs.task-definition-version": "24"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "26" to match ecs-task-metadata-v4.json

Copy link

@jj22ee jj22ee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Just requested a small change to keep the task-definition-version number consistent in the V4 files.

@willarmiros
Copy link

Hi @mmanciop - I have opened the PR to address the :* issue on the X-Ray exporter side, and @jj22ee has left a few comments but approved. Would you mind addressing the comments so we can then merge?

@mmanciop
Copy link
Contributor

@willarmiros I cannot push to @felixscheinost's branch (I have to push rights for his fork, which he used to open this PR), but I prepared the commit on https://github.com/lumigo-io/opentelemetry-java/tree/pr-for-felix .

felixscheinost and others added 2 commits August 30, 2022 09:42
`CONTAINER_NAME` and `AWS_ECS_TASK_REVISION` should match
This commit adds implementations for the `aws.ecs.launchtype`
and `aws.logs.stream.arns` attributes, as well as fixing
the generation of log group ARNs.
@felixscheinost
Copy link
Contributor Author

I addressed @jj22ee's comments and cherry-picked @mmanciop's additions.

@felixscheinost
Copy link
Contributor Author

@willarmiros

However, for now it would be best if the resource detectors could record the ARNs without the trailing :* so that everything still works while we wait for the fix to land.

I pushed another commit which removes the trailing :* for now.

Both with and without trailing `:*` are valid formats but there is a bug in the OpenTelementry collector which can’t handle the trailing `:*` (for now) (see open-telemetry/opentelemetry-collector-contrib#13702)

So remove addition of the trailing `:*` for now.
@jkwatson
Copy link
Contributor

jkwatson commented Sep 5, 2022

@willarmiros when you think this is ready, please approve and we can move forward. Thanks!

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants