Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect containerID detection in apm-java-agent on GKE Autopilot cluster #3510

Closed
b2ronn opened this issue Jan 30, 2024 · 15 comments
Closed
Labels
agent-java community Issues and PRs created by the community

Comments

@b2ronn
Copy link

b2ronn commented Jan 30, 2024

Description:
The apm-java-agent seems to incorrectly determine the containerID when the application is running on a GKE Autopilot cluster. This issue leads to discrepancies between the actual containerID (and other related identifiers like pod name/pod UID) and those obtained by collected via elastic-agent (https://www.elastic.co/blog/elastic-observe-gke-autopilot-clusters).

Steps to Reproduce:
Deploy an application utilizing the apm-java-agent on a GKE Autopilot cluster.
Check /proc/self/cgroup and /proc/self/mountinfo.

root@petclinic-ecs-674d54744b-7vbpf:/# cat /proc/self/cgroup
0::/

root@petclinic-ecs-674d54744b-7vbpf:/# cat /proc/self/mountinfo
7253 5454 0:767 / / rw,relatime master:1694 - overlay overlay rw,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.gcfs/snapshotter/snapshots/723/fs:/var/lib/containerd/io.containerd.snapshotter.v1.gcfs/snapshotter/snapshots/722/fs:/var/lib/containerd/io.containerd.snapshotter.v1.gcfs/snapshotter/snapshots/721/fs:/var/lib/containerd/io.containerd.sn
apshotter.v1.gcfs/snapshotter/snapshots/720/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.gcfs/snapshotter/snapshots/741/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.gcfs/snapshotter/snapshots/741/work
7254 7253 0:769 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
7255 7253 0:777 / /dev rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
7256 7255 0:778 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=666
7257 7255 0:703 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
7258 7253 0:763 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs ro
7259 7258 0:25 / /sys/fs/cgroup ro,nosuid,nodev,noexec,relatime - cgroup2 cgroup rw
7260 7253 8:1 /var/lib/kubelet/pods/2273539b-7ec7-4467-a054-2d3281d70f50/etc-hosts /etc/hosts rw,relatime - ext4 /dev/sda1 rw,commit=30
7261 7255 8:1 /var/lib/kubelet/pods/2273539b-7ec7-4467-a054-2d3281d70f50/containers/petclinic-ecs/371ec6a4 /dev/termination-log rw,relatime - ext4 /dev/sda1 rw,commit=30
7262 7253 8:1 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/1e4ea4f01f1ccb5d8cd20fc0b818c3e308c99df78efdaa6de2c283f013033a0d/hostname /etc/hostname rw,nosuid,nodev,relatime - ext4 /dev/sda1 rw,commit=30
7263 7253 8:1 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/1e4ea4f01f1ccb5d8cd20fc0b818c3e308c99df78efdaa6de2c283f013033a0d/resolv.conf /etc/resolv.conf rw,nosuid,nodev,relatime - ext4 /dev/sda1 rw,commit=30
7264 7255 0:613 / /dev/shm rw,nosuid,nodev,noexec,relatime - tmpfs shm rw,size=65536k
7265 7253 8:1 /var/lib/kubelet/pods/2273539b-7ec7-4467-a054-2d3281d70f50/volumes/kubernetes.io~empty-dir/elastic-apm-agent /elastic/apm/agent rw,relatime - ext4 /dev/sda1 rw,commit=30
7266 7253 0:612 / /run/secrets/kubernetes.io/serviceaccount ro,relatime - tmpfs tmpfs rw,size=2097152k
5455 7254 0:769 /bus /proc/bus ro,nosuid,nodev,noexec,relatime - proc proc rw
5456 7254 0:769 /fs /proc/fs ro,nosuid,nodev,noexec,relatime - proc proc rw
5457 7254 0:769 /irq /proc/irq ro,nosuid,nodev,noexec,relatime - proc proc rw
5458 7254 0:769 /sys /proc/sys ro,nosuid,nodev,noexec,relatime - proc proc rw
5459 7254 0:769 /sysrq-trigger /proc/sysrq-trigger ro,nosuid,nodev,noexec,relatime - proc proc rw
5460 7254 0:779 / /proc/acpi ro,relatime - tmpfs tmpfs ro
5461 7254 0:777 /null /proc/kcore rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
5462 7254 0:777 /null /proc/keys rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
5463 7254 0:777 /null /proc/timer_list rw,nosuid - tmpfs tmpfs rw,size=65536k,mode=755
5464 7254 0:780 / /proc/scsi ro,relatime - tmpfs tmpfs ro
5465 7258 0:781 / /sys/firmware ro,relatime - tmpfs tmpfs ro
root@petclinic-ecs-674d54744b-7vbpf:/#

and pod status

status:
  initContainerStatuses:
    - name: elastic-java-agent
      state:
        terminated:
          exitCode: 0
          reason: Completed
          startedAt: '2024-01-30T15:31:38Z'
          finishedAt: '2024-01-30T15:31:38Z'
          containerID: >-
            containerd://47948ded77336c7b2bf16cf5976834ddc8b42bde7adec06a9b700b78dc917f2e
      lastState: {}
      ready: true
      restartCount: 0
      image: docker.elastic.co/observability/apm-agent-java:latest
      imageID: >-
        docker.elastic.co/observability/apm-agent-java@sha256:fbe7c86ef814626ba52b5efed4a7101d12eef86ad05b9108b39c23f32eadd6d6
      containerID: >-
        containerd://47948ded77336c7b2bf16cf5976834ddc8b42bde7adec06a9b700b78dc917f2e
  containerStatuses:
    - name: petclinic-ecs
      state:
        running:
          startedAt: '2024-01-30T15:31:40Z'
      lastState: {}
      ready: true
      restartCount: 0
      image: ghcr.io/pavolloffay/spring-petclinic:latest
      imageID: >-
        ghcr.io/pavolloffay/spring-petclinic@sha256:83954b8b893bc010071ffc82db60262dd4b8d1b410f29174abf0926e7c27de4e
      containerID: >-
        containerd://c9e786202e41b88d91a5ef1997cf6a0751aa3c4b3d3fcbcc4cc7e7418402b787
      started: true
  qosClass: Guaranteed

Compare the obtained containerID with the one reported in the status.

Click to open Example Deployment ```
apiVersion: apps/v1
kind: Deployment
metadata:
  name: petclinic-ecs
spec:
  replicas: 1
  selector:
    matchLabels:
      name: petclinic-ecs
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: petclinic-ecs
    spec:
      volumes:
        - name: elastic-apm-agent
          emptyDir: {}
      initContainers:
        - name: elastic-java-agent
          image: docker.elastic.co/observability/apm-agent-java:latest
          command:
            - cp
            - '-v'
            - /usr/agent/elastic-apm-agent.jar
            - /elastic/apm/agent
          volumeMounts:
            - name: elastic-apm-agent
              mountPath: /elastic/apm/agent
      containers:
        - name: petclinic-ecs
          image: ghcr.io/pavolloffay/spring-petclinic:latest
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
          env:
            - name: KUBERNETES_POD_IP
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: KUBERNETES_POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: KUBERNETES_POD_UID
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.uid
            - name: KUBERNETES_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
            - name: KUBERNETES_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: ELASTIC_APM_SERVICE_NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.uid
            - name: ELASTIC_APM_SERVER_URL
              value: https://APMSERVER
            - name: ELASTIC_APM_SERVICE_NAME
              value: petclinic-ecs
            - name: ELASTIC_APM_APPLICATION_PACKAGES
              value: org.springframework.samples.petclinic
            - name: ELASTIC_APM_ENVIRONMENT
              value: dev
            - name: ELASTIC_APM_LOG_ECS_REFORMATTING
              value: OVERRIDE
            - name: JAVA_TOOL_OPTIONS
              value: '-javaagent:/elastic/apm/agent/elastic-apm-agent.jar'
            - name: ELASTIC_APM_GLOBAL_LABELS
              value: some=labels
          volumeMounts:
            - name: elastic-apm-agent
              mountPath: /elastic/apm/agent
```

PS: and it would be good to define Pod_Name(as example from /etc/hostname)/Pod_UID/Node_Name from inside the container, without env variables.

@github-actions github-actions bot added agent-java community Issues and PRs created by the community triage labels Jan 30, 2024
@SylvainJuge
Copy link
Member

Thanks for reporting this @b2ronn , we did a few changes in 1.44.0 that should have improved that but maybe there is a regression.

As the docker image ID of the agent is fbe7c86ef814626ba52b5efed4a7101d12eef86ad05b9108b39c23f32eadd6d6, this matches the latest version which means the agent version is 1.46.0.

Can you please provide the following information:

  • what are the values currently reported by the agent for containder.id, kubernetes.pod.uid and host.hostname stored in Elasticsearch documents ?
  • what is the content of /etc/hostname file ?

Here if I'm not mistaken, you would have expected the agent to report container.id = c9e786202e41b88d91a5ef1997cf6a0751aa3c4b3d3fcbcc4cc7e7418402b787 to be consistent with the pod status command.
The probable issue here is that because the agent only parses the content of /proc/self/cgroup and /proc/self/mountinfo and none of them contain this ID, we can't guess the correct container ID.

@b2ronn
Copy link
Author

b2ronn commented Jan 31, 2024

Yes, expecting to see container.id=c9e786202e41b88d91a5ef1997cf6a0751aa3c4b3d3fcbcc4cc7e7418402b787.
Additionally, in /proc/self/mountinfo there is pod.uid (in my example it is 2273539b-7ec7-4467-a054-2d3281d70f50) and container.name (petclinic-ecs).

pod.name (if not passed through environment variables) can probably be obtained from:

#cat /etc/hostname
petclinic-ecs-674d54744b-7vbpf

and from elasticsearch document

cloud.instance.name: gk3-epm-iass-elastic-europe-we-pool-4-440a58a0-o5aa
host.name: gk3-epm-iass-elastic-europe-we-pool-4-440a58a0-o5aa
kubernetes.node.name: gk3-epm-iass-elastic-europe-we-pool-4-440a58a0-o5aa
kubernetes.pod.name: petclinic-ecs-674d54744b-7vbpf
kubernetes.pod.uid: 2273539b-7ec7-4467-a054-2d3281d70f50
container.id: 1e4ea4f01f1ccb5d8cd20fc0b818c3e308c99df78efdaa6de2c283f013033a0d

However, Kubernetes data is present because I set them as environment variables.
If KUBERNETES_POD_IP/KUBERNETES_POD_NAME/KUBERNETES_POD_UID/KUBERNETES_NODE_NAME/KUBERNETES_NAMESPACE are not specified, only cloud data will be present in the document.

P.S. All of this was tested on a GKE Autopilot cluster.

@SylvainJuge
Copy link
Member

The underlying problem here is that there is strictly no stable and standard way to get the container/pod ID from the container itself, and as a consequence we have to rely on implementation details like parsing the /proc/self/cgroup or /proc/self/mountinfo and apply heuristics. There are numerous bug report for this like kubernetes/kubernetes#50309 and containerd/containerd#8185 that better describe the problem here.

For google (at least with GCP, not sure if it also applies to GKE), there is the metadata endpoint that can be called with (doc) curl http://metadata.google.internal/computeMetadata/v1/?recursive=true but it does not provide any information about the container ID.

For Kubernetes, the downward API with environment variables can be used to provide some information and our agents should already support them if they are set, which is very probably what you are doing here. The major downside here is that is requires explicit changes to the k8s configuration.

So I think here we need to answer the following questions:

  • What is the impact on your side ? The ID we capture for container.id should at least be stable, but trying to use it for correlation won't be possible (for example if you have logs ingested with the other container ID).
  • What does the 1e4ea4f01f1ccb5d8cd20fc0b818c3e308c99df78efdaa6de2c283f013033a0d ID that we capture actually corresponds to ?
  • Would it be better here to not capture a container ID rather than a potentially invalid one ?

@b2ronn
Copy link
Author

b2ronn commented Feb 1, 2024

  1. Yes. At the moment, having the correct containerID is necessary for correlating APM and metrics/logs from these containers.

For example, to work around the limitation of the absence of containerID in Kibana in the APM interface > Services > ANY_SERVICE > infrastructure > Containers (we can't modify this interface), currently perform (for my example) a query "container.id: 1e4ea4f01f1ccb5d8cd20fc0b818c3e308c99df78efdaa6de2c283f013033a0d" but need to replace it with request "container.id: 1e4ea4f01f1ccb5d8cd20fc0b818c3e308c99df78efdaa6de2c283f013033a0d or kubernetes.pod.name: petclinic-metr-6fcd78f487-2p2wr".
Because we can reliably pass the Kubernetes pod.name/pod.uid through an environment variable.

  1. It looks like a sandbox container

  2. Since the containerID is only needed for correlation with other data, and if it is incorrect, it doesn't matter whether it has some incorrect number or is absent and it's not of interest to us.

@trainings
Copy link

@SylvainJuge what do you think about the workaround suggested above? I think it can help at least in some cases.

@jackshirazi
Copy link
Contributor

What workaround exactly are you suggesting here?

@trainings
Copy link

Read the last message from b2ronn, about how to modify group by filter.

@jackshirazi
Copy link
Contributor

Changing the kibana query? That sounds like something you could build into your own custom dashboard. But I'll pass this suggestion on to the UI folks and see if there is appetite

@trainings
Copy link

Of course, a lot can be done on a custom dashboard.
But if there is a built-in tool that promises correlation, then why not make it work? )

@jackshirazi
Copy link
Contributor

It's not that helpful because the containerID exists so the k8s pod part of the or clause doesn't get executed. It's the ID wrong but it exists

@b2ronn
Copy link
Author

b2ronn commented Feb 13, 2024

Since there is an incorrect containerID, the filter ends up empty.
However, when querying "container.id: 1e4ea4f01f1ccb5d8cd20fc0b818c3e308c99df78efdaa6de2c283f013033a0d or kubernetes.pod.name: petclinic-metr-6fcd78f487-2p2wr", it will attempt to find the container with the incorrect ID and additionally output all containers belonging to the kubernetes.pod.name: "petclinic- metr-6fcd78f487-2p2wr"
This is I mean about the Observability > APM > Services > SOME_SERVICE > Infrastructure > Containers interface.

@trainings
Copy link

What do you think gentlemen?
Looks like this workaround can help.

@jackshirazi
Copy link
Contributor

It will be considered in the kibana enhancement requests

@trainings
Copy link

Could you please provide a link to the ticket for kibana?

@jackshirazi
Copy link
Contributor

Closing as this can't be fixed in the agent, however elastic/kibana#178209 enhancement request has been opened for the suggestions to improve the UI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-java community Issues and PRs created by the community
Projects
None yet
Development

No branches or pull requests

4 participants