Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] v4 hangs uploading from mac runners #527

Open
camscale opened this issue Feb 22, 2024 · 11 comments
Open

[bug] v4 hangs uploading from mac runners #527

camscale opened this issue Feb 22, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@camscale
Copy link

camscale commented Feb 22, 2024

What happened?

When I upload artifacts at the end of a build on a mac runner (macos-13-xl-arm64), about 1 time in 3, the upload stalls part way through and never completes. The job is eventually cancelled by GHA and my entire workflow is marked cancelled. I cannot restart that job as it is not marked as a failure.

What did you expect to happen?

The artifacts upload to completion and the job finishes without error.

How can we reproduce it?

It is not easily reproducible. The workflows are private and cannot be shared here. I will open a support ticket to share more.

Anything else we need to know?

I have updated multiple workflows from v3 to v4 - all jobs in the workflow are using v4. I have 6 linux build jobs that all upload artifacts at the end. None of them have failed. For the mac builds, two jobs do architecture-specific builds and upload their artifacts. This has never failed. The final mac job takes the architecture-specific binaries and produces universal binaries. It is this job that has failed about 1 time in 3 when uploading the artifacts at the end. This failing job is typically always the last of the jobs to run. It necessarily runs after the two previous mac jobs, and the 6 previous linux jobs are quicker that the mac jobs and complete before.

The size of the failing artifacts is about 1.4G but on one run at least, it hung after logging the first 8MiB chunk.

All uploads for all the above-described jobs have:

compression: 0
retention-days: 1

The path: setting for the linux jobs is a single directory. The path: setting for the working mac jobs is a single directory and a single exclude pattern. The path: setting for the failing jobs is a single directory and three exclude patterns. Prior to uploading the chunks, the count of files to be uploaded is correct.

The overwrite: setting is false for all jobs.

Output from one of the failed/stalled runs:

Run actions/upload-artifact@v4
  with:
    name: release-mac
    compression-level: 0
    retention-days: 1
    path: build/artifacts
  !**/*-unsigned*
  !**/*-arm64*
  !**/*-amd64*
  
    if-no-files-found: warn
    overwrite: false
  env:
    GOMODCACHE: /tmp/gomodcache
    GOCACHE: /tmp/gocache
    NODE_VERSION: 18.19.1
    pythonLocation: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
    PKG_CONFIG_PATH: /Users/runner/hostedtoolcache/Python/3.11.7/arm64/lib/pkgconfig
    Python_ROOT_DIR: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
    Python2_ROOT_DIR: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
    Python3_ROOT_DIR: /Users/runner/hostedtoolcache/Python/3.11.7/arm64
With the provided path, there will be 12 files uploaded
Artifact name is valid!
Root directory input is valid!
Beginning upload of artifact content to blob storage
Uploaded bytes 8388608

That is the end of the output for the step.

What version of the action are you using?

v4.3.1

What are your runner environments?

macos

Are you on GitHub Enterprise Server? If so, what version?

No response

@camscale camscale added the bug Something isn't working label Feb 22, 2024
camscale added a commit to gravitational/teleport that referenced this issue Mar 4, 2024
Add a workflow to try to reproduce the failure we see in teleport.e with
upload-artifact on macOS, reported in
actions/upload-artifact#527.

Reproducing this in a public repo with minimal steps will make it easier
to diagnose this issue.
camscale added a commit to gravitational/teleport that referenced this issue Mar 4, 2024
Add a workflow to try to reproduce the failure we see in teleport.e with
upload-artifact on macOS, reported in
actions/upload-artifact#527.

Reproducing this in a public repo with minimal steps will make it easier
to diagnose this issue.
camscale added a commit to gravitational/teleport that referenced this issue Mar 4, 2024
Add a workflow to try to reproduce the failure we see in teleport.e with
upload-artifact on macOS, reported in
actions/upload-artifact#527.

Reproducing this in a public repo with minimal steps will make it easier
to diagnose this issue.
camscale added a commit to gravitational/teleport that referenced this issue Mar 4, 2024
Add a workflow to try to reproduce the failure we see in teleport.e with
upload-artifact on macOS, reported in
actions/upload-artifact#527.

Reproducing this in a public repo with minimal steps will make it easier
to diagnose this issue.
camscale added a commit to gravitational/teleport that referenced this issue Mar 4, 2024
Add a workflow to try to reproduce the failure we see in teleport.e with
upload-artifact on macOS, reported in
actions/upload-artifact#527.

Reproducing this in a public repo with minimal steps will make it easier
to diagnose this issue.
camscale added a commit to gravitational/teleport that referenced this issue Mar 4, 2024
Add a workflow to try to reproduce the failure we see in teleport.e with
upload-artifact on macOS, reported in
actions/upload-artifact#527.

Reproducing this in a public repo with minimal steps will make it easier
to diagnose this issue.
camscale added a commit to gravitational/teleport that referenced this issue Mar 4, 2024
Add a workflow to try to reproduce the failure we see in teleport.e with
upload-artifact on macOS, reported in
actions/upload-artifact#527.

Reproducing this in a public repo with minimal steps will make it easier
to diagnose this issue.
@camscale
Copy link
Author

camscale commented Mar 4, 2024

Using the following workflow has reproduced the issue multiple times for me:

name: Reproduce Mac upload-artifact failure
on:
  push:
    branches:
      - camh/repro-mac-upload-artifact-issue

jobs:
  test-upload-artifact:
    runs-on: macos-13-xl-arm64
    steps:
      - name: Create files (500MiB)
        run: |
          dd if=/dev/urandom of=artifact bs=1M count=500
          dd if=/dev/zero of=artifact-unsigned bs=1M count=1
      - name: upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: urandom
          compression-level: 0
          retention-days: 1
          path: |
            artifact
            !*-unsigned*

Please note that I was testing this on the branch camh/repro-mac-upload-artifact-issue, hence the workflow push trigger. You will obviously need to change this to whatever branch name you are using, or perhaps merge a workflow_dispatch triggered version and manually trigger it each time.

The purpose of the artifact-unsigned file in this was to reproduce something close to my actual workflow that is failing. I do not know if the excluded path is significant or not, but that is the pattern I am using so kept this reproduction scenario close.

I have run this workflow with a 500MiB test file twice, and twice it has failed (currently it is stalled on the second run and I am waiting for it to time out). I did a number of runs with a 100MiB file and saw one or two failures with about 10 successes. It seems a larger file is easier to trigger this failure.

@JoHuang
Copy link

JoHuang commented Mar 6, 2024

have the same issue here.

@ligaz
Copy link

ligaz commented Mar 8, 2024

@camscale Does this happen if you use v3 of the action? Have you tried using a different image (macos-14) or an alternative runner like FlyCI?

@JoHuang
Copy link

JoHuang commented Mar 8, 2024

I encountered this issue twice.
One is on macos-14
one is on FlyCI
FYI

@camscale
Copy link
Author

@ligaz I never encountered this problem with v3 of the action. Only after upgrading to v4 did I start to see this issue. I have not tried the macos-14 runners.

@sa1g0n1337
Copy link

I'm experiencing the same issue on the ubuntu runner. I only encounter this problem on v4.

@camscale
Copy link
Author

I have just tried the macos-14 arm64 runners (macos-14-xlarge) and the issue is present there too.

@camscale
Copy link
Author

Interestingly, I just noticed one of my runs that stalled for 13 minutes, but started again and ran to completion. I don't think I've seen that before - I've had a hang sit there for two hours before I cancelled it.

Wed, 13 Mar 2024 19:44:53 GMT Uploaded bytes 301989888
Wed, 13 Mar 2024 19:58:06 GMT Uploaded bytes 310378496
Wed, 13 Mar 2024 19:58:06 GMT Uploaded bytes 311799972
Wed, 13 Mar 2024 19:58:06 GMT Finished uploading artifact content to blob storage!

@kinke
Copy link

kinke commented Mar 30, 2024

I'm seeing sporadic upload failures lately too, on macOS runners only so far, and only for jobs uploading larger artifacts (around 400 MB). The weird thing is that the job shows up as failed, but the upload step is still running and doesn't show any error at all:

E.g., here's a currently running workflow, with 2 upload failures for both macOS arm64 jobs (running on macos-14 runners, but IIRC, macos-12 failed too in the past): https://github.com/ldc-developers/llvm-project/actions/runs/8492339227/job/23265222880

image

@danra
Copy link

danra commented May 4, 2024

Got the same error with a self-hosted mac runner, consistently exactly 5 minutes after the step started.
Reverting to v3 resolved the issue.

@korsour
Copy link

korsour commented May 5, 2024

Same issue here with v4 version. I don't want to get back to v3, because that works 5 times slower. Any fix expected anytime soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants