Broken "message" on last transition in case of failure #2029

afrittoli · 2020-02-10T13:30:46Z

Expected Behavior

The message shall point to the first step that failed and how to get its logs.

Actual Behavior

The message points to the last step that failed.
Additionally the message to get the logs includes an extra "new line" that breaks the copy/paste.

When a taskrun has output resources, steps are appended to process those. If a step during the task fails, all the appended resource steps are marked as failed e.g.

    state:
      terminated:
        containerID: docker://35d721aed3233f6217a63f97661afdd8d344b33718744ab374992c19c6cf8487
        exitCode: 1
        finishedAt: "2020-02-10T13:16:57Z"
        message: |
          {"level":"info","ts":1581340399.0176625,"logger":"fallback-logger","caller":"logging/config.go:69","msg":"Fetch GitHub commit ID from kodata failed: open /var/run/ko/HEAD: no such file or directory"}
          2020/02/10 13:16:57 Skipping step because a previous step failed
        reason: Error
        startedAt: "2020-02-10T13:13:19Z"
``

# Steps to Reproduce the Problem

1. https://dashboard.dogfooding.tekton.dev/#/namespaces/default/pipelineruns/triggers-release-nightly-sk54v

# Additional Info
One option here is to fix the logic to get the first failed stepped instead of the last - however I think the current logic relies on the fact that only one step may be failed, which - at least for now - seems reasonable. I wonder if we should leave these steps added by resources as "unknown" instead of marking them as "failed" when they are skipped.

The text was updated successfully, but these errors were encountered:

afrittoli · 2020-02-10T13:31:26Z

/kind bug
/priority urgent

GregDritschler · 2020-03-10T17:22:31Z

I have been looking into this issue. There are two problems reported.

The PipelineRun "for logs run..." message does not point to the first failing step.
The message has an extra newline.

For the first problem, pkg/pod/status.go does sort the container status by finish time (State.Terminated.FinishedAt) to try to find the first failure. However this field has a resolution of seconds. In my experiments the steps that are skipped usually finish in the same second as the step that failed. This defeats the intention of the sort. By the way this problem can be observed without using output resources. Any user steps that follow the failed step get the skipped message too.

My initial thought to fix this would be to store a higher-precision finish time in the pod's termination message and copy that to the termination status, imitating what happens with the start time but with a higher-precision timestamp. One side effect is that the higher-precision timestamp would show up externally in the finishedAt field. It might be desirable to change startedAt have the same precision so the formats look alike.

I have not been able to reproduce the second problem. I see that the pod status has the "skipping" message you've shown above. However I don't see any effect on the PipelineRun's "for logs run" message, i.e. I don't see a newline in it. Can you show an example of what you are seeing?

afrittoli · 2020-03-13T11:40:43Z

@GregDritschler Thank you for looking into this!
This is the full spec of the taskRun (with status) where I originally saw the issue: https://gist.github.com/25a4fb15cb0e52a28a2c3f8dbbf67878

The problem there is that the step-image-digest-exporter-bhdb6 is marked as the cause of the failure of the taskrun, while in reality a step in the task failed, and step-image-digest-exporter-bhdb6 was never even executed.

        - lastTransitionTime: "2020-02-10T13:16:59Z"
          message: '"step-image-digest-exporter-bhdb6" exited with code 1 (image:
            "docker-pullable://gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/imagedigestexporter@sha256:1cf3f27f3ff7c73782d8a65853e8fc7f0d4aafc6443893e0150bdbe614a9169d");
            for logs run: kubectl -n default logs triggers-release-nightly-sk54v-publish-images-9fpzt-pod-8wcvh
            -c step-image-digest-exporter-bhdb6'
          reason: Failed
          status: "False"
          type: Succeeded

I've been trying to reproduce this against Tekton v0.10.1, and I've not been able to yet. I used a taskrun directly, while in the original case the taskrun was triggered as part of a pipeline, but I cannot image it would make a difference. This is the taskrun I used as a reproducer:

  apiVersion: tekton.dev/v1alpha1
  kind: TaskRun
  metadata:
    generateName: a-task-to-reproduce-tekton-pipelines-issue-2029-
  spec:
    taskSpec:
      outputs:
        resources:
        - name: pr
          type: pullRequest
        - name: ci
          type: image
        - name: notification
          type: cloudEvent
      steps:
      - name: this-step-does-something-it-sleeps-30-seconds
        image: busybox
        script: |
          #!/bin/sh
          set -ex
          echo "Wait for a bit"
          sleep 30
      - name: fail
        image: busybox
        script: |
          #!/bin/sh
          set -ex
          echo "This fails"
          false
      - name: never-runs
        image: busybox
        script: |
          #!/bin/sh
          set -ex
          echo "I never get to run"
    outputs:
      resources:
      - name: pr
        resourceSpec:
          type: pullRequest
          params:
          - name: url
            value: https://github.com/tektoncd/pipeline/pull/100
      - name: ci
        resourceSpec:
            type: image
            params:
            - name: url
              value: fake-registry/test/fake-image
      - name: notification
        resourceSpec:
            type: cloudEvent
            params:
            - name: targetURI
              value: http://el-pipeline-release-post-processing.default.svc.cluster.local:8080

What I still see though is the issue with the message to access log broken by a new line:

  conditions:
  - lastTransitionTime: "2020-03-13T11:20:41Z"
    message: '"step-fail" exited with code 1 (image: "docker.io/library/busybox@sha256:b26cd013274a657b86e706210ddd5cc1f82f50155791199d29b9e86e935ce135");
      for logs run: kubectl -n default logs a-task-to-reproduce-tekton-pipelines-issue-2029-lpdpk-pod-scpn9
      -c step-fail'
    reason: Failed
    status: "False"
    type: Succeeded

For this to happen the pod name generated by the taskrun must be long enough, so you can either have a long taskrun name or a taskrun within a pipelinerun, so that names are concatenated and become very long.

I will make another attempt to reproduce this, and if it fails I will close it.
I think we could split out the second issue in a dedicated one - it's not terrible, but it forces one to copy/paste twice to get the logs of a step, which I find rather inconvenient.

afrittoli · 2020-03-13T11:50:49Z

Moved the second part of the issue to #2221

afrittoli · 2020-03-13T12:07:25Z

I tried to reproduce the issue using a pipeline but I could not reproduce it either.

afrittoli · 2020-03-13T12:20:24Z

I have been looking into this issue. There are two problems reported.
1. The PipelineRun "for logs run..." message does not point to the first failing step.

2. The message has an extra newline.
For the first problem, pkg/pod/status.go does sort the container status by finish time (State.Terminated.FinishedAt) to try to find the first failure. However this field has a resolution of seconds. In my experiments the steps that are skipped usually finish in the same second as the step that failed. This defeats the intention of the sort. By the way this problem can be observed without using output resources. Any user steps that follow the failed step get the skipped message too.

This is an interesting observation, thank you.

I believe that any step after the failed one, whether or not the belong to pipeline resources, should be skipped and not failed, so there should not a "first failure", only "the failure".
In case multiple failures could happen, we should have a better mechanism in place to hint at all the steps that failed in the error message.
We could still add the ability to handle multiple failure there @vdemeester @bobcatfish thoughts?

afrittoli · 2020-03-17T11:17:15Z

@GregDritschler thanks for your analysis on this, I realize now that I've not been able to reproduce this because of the timing in my pipeline. The issue does exists, and I believe it's worth solving, even if pipeline resources are not part of the beta.

The sorting by status introduced as a solution to #1905, however as you pointed out, the time resolution may not be enough to correctly identify the correct step that failed when relying on the time alone.

So I think the problem is that we do not encode enough information in the steps - at pod level - to distinguish a failed one from a skipped one. The pod for a skipped test looks like this:

                "containerID": "containerd://54195e79b3e02d67ce14cb21ed04513ce0694f7072da2f7d1fee6f0bf41793a7",
                "image": "sha256:fd5c69b7e36d96dcd25ed34c523062c5d0c47fc62bc84e49dde0146425492404",
                "imageID": "gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/pullrequest-init@sha256:3783254c379b286dd0987674160d3e19a95be1ccf0985788d6dcc0f159199095",
                "lastState": {},
                "name": "step-pr-source-pr-cxmlh",
                "ready": false,
                "restartCount": 0,
                "state": {
                    "terminated": {
                        "containerID": "containerd://54195e79b3e02d67ce14cb21ed04513ce0694f7072da2f7d1fee6f0bf41793a7",
                        "exitCode": 1,
                        "finishedAt": "2020-03-13T12:04:36Z",
                        "reason": "Error",
                        "startedAt": "2020-03-13T12:04:00Z"
                    }
                }

Which has no indication whatsoever about the step being skipped.
That information is encoded in the corresponding taskrun in the sense that only steps that were executed are tracked in the taskrun's status.steps.

I think the solution to this should involve finding the only step that actually failed. As of today it is not possible for more than one step to fail: even if multiple containers are marked as failed, only one is failed because of a failed step execution.

A couple of ways we could achieve that:

We could use the Reason field in the pod to provide more context, e.g. instead of Error we could say Step skipped.
We could change the method signature to pass the TaskRun status, and filter the pod containers to include only those that have a matching step in TaskRun.status.steps.

GregDritschler · 2020-03-30T13:53:53Z

We could use the Reason field in the pod to provide more context, e.g. instead of Error we could say Step skipped.
How? The Reason field in the pod status is set by Kubernetes. How would you alter it?

We could change the method signature to pass the TaskRun status, and filter the pod containers to include only those that have a matching step in TaskRun.status.steps.
What method signature?

The Taskrun step status is built from the K8 container status. The only way I've seen to alter what's in there is via what was done to adjust the start time via the task results. It's ugly though.

afrittoli · 2020-03-30T14:18:37Z

We could use the Reason field in the pod to provide more context, e.g. instead of Error we could say Step skipped.
How? The Reason field in the pod status is set by Kubernetes. How would you alter it?

Good point. I guess we could extend entrypoint to allow some logic that writes to the termination log, but that might be a bit too much to solve the issue.

We could change the method signature to pass the TaskRun status, and filter the pod containers to include only those that have a matching step in TaskRun.status.steps.
What method signature?

I mean the signature of [getFailureMessage](https://github.com/tektoncd/pipeline/blob/4e4884fb82de205569305e762b44741ebf2c77d2/pkg/pod/status.go#L278). If getFailureMessage was passed the taskRun spec by its called, it would be able to use it to determine which of the failed steps was the first one to be started and this the first one to fail (since they run sequentially). updateCompletedTaskRun only gets the task status, but it could be extended to include the spec.

The Taskrun step status is built from the K8 container status. The only way I've seen to alter what's in there is via what was done to adjust the start time via the task results. It's ugly though.

GregDritschler · 2020-03-31T13:50:17Z

Well, status.go is already sorting the taskrun step status according to the task spec. The problem with this is that the internally-generated steps fall to the bottom of the list. So if the code were changed to use the taskrun step status to find the first failing step, it could still misidentify the first failing step when an internally-generated step failed.

tekton-robot · 2020-08-13T20:32:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-13T20:32:17Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-13T20:32:18Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-13T20:32:19Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2020

vdemeester added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 10, 2020

bobcatfish added this to Needs triage in Tekton Pipelines Feb 26, 2020

afrittoli removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 13, 2020

afrittoli mentioned this issue Mar 14, 2020

Panic in controller when step fails before image digest exporter #2222

Merged

2 tasks

vdemeester moved this from Needs triage to High priority in Tekton Pipelines Mar 16, 2020

GregDritschler mentioned this issue Apr 20, 2020

TaskRun status wrong when unnamed step fails (followed by named step) #2415

Closed

JustinKuli mentioned this issue May 14, 2020

Pipelineruns in Tekton dashboard show incorrect status for each step tektoncd/dashboard#1367

Closed

tekton-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 13, 2020

tekton-robot closed this as completed Aug 13, 2020

Tekton Pipelines automation moved this from High priority to Closed Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken "message" on last transition in case of failure #2029

Broken "message" on last transition in case of failure #2029

afrittoli commented Feb 10, 2020

afrittoli commented Feb 10, 2020

GregDritschler commented Mar 10, 2020

afrittoli commented Mar 13, 2020 •

edited

afrittoli commented Mar 13, 2020

afrittoli commented Mar 13, 2020 •

edited

afrittoli commented Mar 13, 2020

afrittoli commented Mar 17, 2020

GregDritschler commented Mar 30, 2020

afrittoli commented Mar 30, 2020

GregDritschler commented Mar 31, 2020

tekton-robot commented Aug 13, 2020

tekton-robot commented Aug 13, 2020

tekton-robot commented Aug 13, 2020

tekton-robot commented Aug 13, 2020

Broken "message" on last transition in case of failure #2029

Broken "message" on last transition in case of failure #2029

Comments

afrittoli commented Feb 10, 2020

Expected Behavior

Actual Behavior

afrittoli commented Feb 10, 2020

GregDritschler commented Mar 10, 2020

afrittoli commented Mar 13, 2020 • edited

afrittoli commented Mar 13, 2020

afrittoli commented Mar 13, 2020 • edited

afrittoli commented Mar 13, 2020

afrittoli commented Mar 17, 2020

GregDritschler commented Mar 30, 2020

afrittoli commented Mar 30, 2020

GregDritschler commented Mar 31, 2020

tekton-robot commented Aug 13, 2020

tekton-robot commented Aug 13, 2020

tekton-robot commented Aug 13, 2020

tekton-robot commented Aug 13, 2020

afrittoli commented Mar 13, 2020 •

edited

afrittoli commented Mar 13, 2020 •

edited