Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong Pod name in argo get command result from CLI #9906

Closed
2 of 3 tasks
xubofei1983 opened this issue Oct 25, 2022 · 12 comments · Fixed by #9995
Closed
2 of 3 tasks

Wrong Pod name in argo get command result from CLI #9906

xubofei1983 opened this issue Oct 25, 2022 · 12 comments · Fixed by #9995
Labels
area/cli The `argo` CLI P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@xubofei1983
Copy link
Contributor

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

Run an example workflow https://github.com/argoproj/argo-workflows/blob/master/examples/retry-on-error.yaml

The Pod name in the result of argo get are wrong.

argo get retry-on-error-v2pk2 -n workflow
Name: retry-on-error-v2pk2
...

STEP TEMPLATE PODNAME DURATION MESSAGE
✖ retry-on-error-v2pk2 error-container No more retries left
├─⚠ retry-on-error-v2pk2(0) error-container retry-on-error-v2pk2-error-container-2869263017 26s Error (exit code 1): failed to put file: 404 Not Found
├─✖ retry-on-error-v2pk2(1) error-container retry-on-error-v2pk2-error-container-2427568992 4s Error (exit code 3)
└─✖ retry-on-error-v2pk2(2) error-container retry-on-error-v2pk2-error-container-816476283 4s Error (exit code 4)

kubectl get pods -n workflow
NAME READY STATUS RESTARTS AGE
retry-on-error-v2pk2-error-container-1195955417 0/2 Completed 0 6m17s
retry-on-error-v2pk2-error-container-1800096796 0/2 Error 0 5m41s
retry-on-error-v2pk2-error-container-3410203767 0/2 Error 0 5m31s

The UI works good
NAME
retry-on-error-v2pk2(0)
ID
retry-on-error-v2pk2-1195955417
POD NAME
retry-on-error-v2pk2-error-container-1195955417

Version

v3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-on-error-
spec:
  entrypoint: error-container
  templates:
  - name: error-container
    retryStrategy:
      limit: "2"
      retryPolicy: "Always"   # Retry on errors AND failures. Also available: "OnFailure" (default), "OnError", and "OnTransientError" (retry only on transient errors such as i/o or TLS handshake timeout. Available after v3.0.0-rc2)
    container:
      image: python
      command: ["python", "-c"]
      # fail with a 80% probability
      args: ["import random; import sys; exit_code = random.choice(range(0, 5)); sys.exit(exit_code)"]

Logs from the workflow controller

Not related

Logs from in your workflow's wait container

Not related

@814HiManny
Copy link

814HiManny commented Oct 31, 2022

I am having the same issue here. The pod names when performing argo get don't correspond to actual pod names used in the cluster.

@sarabala1979
Copy link
Member

@JPZ13 @rohankmr414 Can you take a look?

@sarabala1979 sarabala1979 added type/regression Regression from previous behavior (a specific type of bug) P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Oct 31, 2022
@ognjen-it
Copy link

I am having the same issue here 👍🏻

@JPZ13
Copy link
Member

JPZ13 commented Nov 3, 2022

I'm OOO this week @sarabala1979. How's your capacity @rohankmr414 or @isubasinghe?

@mweibel
Copy link
Contributor

mweibel commented Nov 4, 2022

I have the same issue here. This is happening since version 3.4.0 (I unfortunately only upgraded this week and to 3.4.3 directly but traced it back to 3.4.0).

The case seems to only happen when a retry strategy is set. The hello-world.yaml example does not suffer from the same issue.

» k get po
NAME                                             READY   STATUS    RESTARTS   AGE
retry-on-error-khzpg-error-container-550301540   2/2     Running   0          4s
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2022-11-04T12:08:12Z"
  generateName: retry-on-error-
  generation: 2
  labels:
    workflows.argoproj.io/phase: Running
  name: retry-on-error-khzpg
  namespace: default
  resourceVersion: "16229"
  uid: 1d4e6dc4-e2be-475c-9c32-f3aaaef1cdf1
spec:
  arguments: {}
  entrypoint: error-container
  templates:
  - container:
      args:
      - import random; import sys; exit_code = random.choice(range(0, 5)); sys.exit(exit_code)
      command:
      - python
      - -c
      image: python
      name: ""
      resources: {}
    inputs: {}
    metadata: {}
    name: error-container
    outputs: {}
    retryStrategy:
      limit: "2"
      retryPolicy: Always
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository: {}
    default: true
  finishedAt: null
  nodes:
    retry-on-error-khzpg:
      children:
      - retry-on-error-khzpg-550301540
      displayName: retry-on-error-khzpg
      finishedAt: null
      id: retry-on-error-khzpg
      name: retry-on-error-khzpg
      phase: Running
      progress: 0/1
      startedAt: "2022-11-04T12:08:12Z"
      templateName: error-container
      templateScope: local/retry-on-error-khzpg
      type: Retry
    retry-on-error-khzpg-550301540:
      displayName: retry-on-error-khzpg(0)
      finishedAt: null
      id: retry-on-error-khzpg-550301540
      name: retry-on-error-khzpg(0)
      phase: Pending
      progress: 0/1
      startedAt: "2022-11-04T12:08:12Z"
      templateName: error-container
      templateScope: local/retry-on-error-khzpg
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2022-11-04T12:08:12Z"

It might be related to #6712 and #8748 but I'm not sure why it only happens for retry enabled workflows.

FWIW: Pretty important for us, since we gather data based on the status of workflows and we can't match them to pods right now.

@mweibel
Copy link
Contributor

mweibel commented Nov 4, 2022

I believe the retry strategy to be relevant because of
https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/operator.go#L1730

And I believe the nodeID in status does get calculated wrongly here:
https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/operator.go#L2266

Since I'm not sure what the proper course of action is to fix this, I won't create a PR for it.

@isubasinghe
Copy link
Member

isubasinghe commented Nov 4, 2022

@JPZ13 Should be able to handle it first thing Monday. @sarabala1979 feel free to assign me if that timeline is okay with you

@isubasinghe
Copy link
Member

isubasinghe commented Nov 7, 2022

commit cc9d14c introduces the bug I believe or rather makes the bug appear (this could just be the canary in the coal mine), I checked this with git bisect. Am working on a fix.
Not really sure if this is a side effect of something else or not yet, will update as I make progress.

This is interesting because the json output from argo get is correct. But I don't think it is a formatting issue, something funky is going on I think. I say this because if I submit a workflow on :latest and then checkout a previous commit, it still will display the output in an incorrect manner.

@mweibel
Copy link
Contributor

mweibel commented Nov 25, 2022

@isubasinghe @terrytangyuan unfortunately there is still a bug with the workflow status.

example workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: nodename-
spec:
  arguments: {}
  entrypoint: render
  templates:
    - inputs: {}
      metadata: {}
      name: render
      steps:
        - - arguments:
              parameters:
                - name: frames
                  value: '{{item.frames}}'
            name: run-blender
            template: blender
            withItems:
              - frames: 1
    - container:
        image: argoproj/argosay:v2
        command: ["/bin/sh", "-c"]
        args:
          - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
        name: ""
      inputs:
        parameters:
          - name: frames
      name: blender
      retryStrategy:
        limit: 2
        retryPolicy: Always

yields the following status:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2022-11-25T11:33:41Z"
  generateName: nodename-
  generation: 3
  labels:
    workflows.argoproj.io/phase: Running
  name: nodename-bvd45
  namespace: argo
  resourceVersion: "15649"
  uid: ea233eef-210d-4394-a238-ef847b104458
spec:
  activeDeadlineSeconds: 300
  arguments: {}
  entrypoint: render
  podSpecPatch: |
    terminationGracePeriodSeconds: 3
  templates:
  - inputs: {}
    metadata: {}
    name: render
    outputs: {}
    steps:
    - - arguments:
          parameters:
          - name: frames
            value: '{{item.frames}}'
        name: run-blender
        template: blender
        withItems:
        - frames: 1
  - container:
      args:
      - /argosay echo 0/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s && /argosay
        echo 50/100 $ARGO_PROGRESS_FILE && /argosay sleep 10s
      command:
      - /bin/sh
      - -c
      image: argoproj/argosay:v2
      name: ""
      resources: {}
    inputs:
      parameters:
      - name: frames
    metadata: {}
    name: blender
    outputs: {}
    retryStrategy:
      limit: 2
      retryPolicy: Always
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository:
      archiveLogs: true
      s3:
        accessKeySecret:
          key: accesskey
          name: my-minio-cred
        bucket: my-bucket
        endpoint: minio:9000
        insecure: true
        secretKeySecret:
          key: secretkey
          name: my-minio-cred
    configMap: artifact-repositories
    key: default-v1
    namespace: argo
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  nodes:
    nodename-bvd45:
      children:
      - nodename-bvd45-701773242
      displayName: nodename-bvd45
      finishedAt: null
      id: nodename-bvd45
      name: nodename-bvd45
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: render
      templateScope: local/nodename-bvd45
      type: Steps
    nodename-bvd45-701773242:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3728066428
      displayName: '[0]'
      finishedAt: null
      id: nodename-bvd45-701773242
      name: nodename-bvd45[0]
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateScope: local/nodename-bvd45
      type: StepGroup
    nodename-bvd45-3728066428:
      boundaryID: nodename-bvd45
      children:
      - nodename-bvd45-3928099255
      displayName: run-blender(0:frames:1)
      finishedAt: null
      id: nodename-bvd45-3728066428
      inputs:
        parameters:
        - name: frames
          value: "1"
      name: nodename-bvd45[0].run-blender(0:frames:1)
      phase: Running
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Retry
    nodename-bvd45-3928099255:
      boundaryID: nodename-bvd45
      displayName: run-blender(0:frames:1)(0)
      finishedAt: null
      hostNodeName: k3d-argowf-server-0
      id: nodename-bvd45-3928099255
      inputs:
        parameters:
        - name: frames
          value: "1"
      message: PodInitializing
      name: nodename-bvd45[0].run-blender(0:frames:1)(0)
      phase: Pending
      progress: 0/1
      startedAt: "2022-11-25T11:33:41Z"
      templateName: blender
      templateScope: local/nodename-bvd45
      type: Pod
  phase: Running
  progress: 0/1
  startedAt: "2022-11-25T11:33:41Z"

The pod is named nodename-bvd45-blender-3928099255 but the node ID is without the template name.

Can you please reopen or should I create a new issue?

@isubasinghe
Copy link
Member

isubasinghe commented Nov 25, 2022

@mweibel could you please tell me what the desired pod name should be?
The PR addressed the case where a wrong number was generated when pretty printing only.
Is this the yaml output of "argo get"?

I have strong suspicions this is a controller/operator issue and different to the issue initially created, which was formatting based.

If so this issue is distinct from the original issue, is it better to create a new issue to keep them atomic?

@mweibel
Copy link
Contributor

mweibel commented Nov 25, 2022

Yeah I suspected that the issue at hand is because the argo workflow status doesn't contain the right node IDs and that's why the CLI is unable to access it. I'll create a new issue with the details.

@mweibel
Copy link
Contributor

mweibel commented Nov 25, 2022

See #10107

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cli The `argo` CLI P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants