Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner-Scale-Set in Kubernetes mode fails when writing to /home/runner/_work #2890

Closed
7 tasks done
bobertrublik opened this issue Sep 12, 2023 · 25 comments
Closed
7 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested

Comments

@bobertrublik
Copy link

bobertrublik commented Sep 12, 2023

Checks

Controller Version

0.5.0

Helm Chart Version

0.5.0

CertManager Version

1.12.1

Deployment Method

Helm

cert-manager installation

Yes, it's also used in production.

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

# Source: gha-runner-scale-set/templates/autoscalingrunnerset.yaml
apiVersion: actions.github.com/v1alpha1
kind: AutoscalingRunnerSet
metadata:
  name: github-runners
  namespace: github-runners
  labels:
    app.kubernetes.io/component: "autoscaling-runner-set"
    helm.sh/chart: gha-rs-0.5.0
    app.kubernetes.io/name: gha-rs
    app.kubernetes.io/instance: github-runners
    app.kubernetes.io/version: "0.5.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: gha-rs
    actions.github.com/scale-set-name: github-runners
    actions.github.com/scale-set-namespace: github-runners
  annotations:
    actions.github.com/cleanup-github-secret-name: github-runners-gha-rs-github-secret
    actions.github.com/cleanup-manager-role-binding: github-runners-gha-rs-manager
    actions.github.com/cleanup-manager-role-name: github-runners-gha-rs-manager
    actions.github.com/cleanup-kubernetes-mode-role-binding-name: github-runners-gha-rs-kube-mode
    actions.github.com/cleanup-kubernetes-mode-role-name: github-runners-gha-rs-kube-mode
    actions.github.com/cleanup-kubernetes-mode-service-account-name: github-runners-gha-rs-kube-mode
spec:
  githubConfigUrl: https://github.com/privaterepo
  githubConfigSecret: github-runners-gha-rs-github-secret
  runnerGroup: runners
  maxRunners: 3
  minRunners: 1

  template:
    spec:
      securityContext: 
        fsGroup: 1001
      serviceAccountName: github-runners-gha-rs-kube-mode
      containers:
      - name: runner
        
        command: 
          - /home/runner/run.sh
        image: 
          ghcr.io/actions/actions-runner:latest
        env:
          - 
            name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - 
            name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - 
            name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
        volumeMounts:
          - 
            mountPath: /home/runner/_work
            name: work
      
      volumes:
      
      - 
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  storage: 4Gi
              storageClassName: zrs-delete
        name: work

To Reproduce

I deployed the gha-runner-scale-set-controller in the standard Helm chart configuration and gha-runner-scale-set chart with the following values:

githubConfigUrl: "https://github.com/privaterepo"
minRunners: 1
maxRunners: 3
runnerGroup: "runners"
containerMode:
  type: "kubernetes"
template:
  spec:
    securityContext:
      fsGroup: 1001
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOKS
          value: /home/runner/k8s/index.js
        - name: ACTIONS_RUNNER_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "true"
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: [ "ReadWriteOnce" ]
              storageClassName: "zrs-delete"
              resources:
                requests:
                  storage: 4Gi
controllerServiceAccount:
  namespace: github-arc
  name: github-arc

The storage is provisioned by Azure StandardSSD_ZRS.

Describe the bug

When I run a workflow on a self-hosted runner it always fails at the actions/checkout@v3 action with this error:

Run '/home/runner/k8s/index.js'
shell: /home/runner/externals/node16/bin/node {0}
node:internal/fs/utils:347
throw err;
^

Error: EACCES: permission denied, open '/__w/_temp/_runner_file_commands/save_state_62e77b56-723c-4eef-bfa1-55e26a98e636'
at Object.openSync (node:fs:590:3)
at Object.writeFileSync (node:fs:2202:35)
at Object.appendFileSync (node:fs:2264:6)
at Object.issueFileCommand (/__w/_actions/actions/checkout/v3/dist/index.js:2950:8)
at Object.saveState (/__w/_actions/actions/checkout/v3/dist/index.js:2867:31)
at Object.8647 (/__w/_actions/actions/checkout/v3/dist/index.js:2326:10)
at nccwpck_require (/__w/_actions/actions/checkout/v3/dist/index.js:18256:43)
at Object.2565 (/__w/_actions/actions/checkout/v3/dist/index.js:146:34)
at nccwpck_require (/__w/_actions/actions/checkout/v3/dist/index.js:18256:43)
at Object.9210 (/__w/_actions/actions/checkout/v3/dist/index.js:1141:36) {
errno: -13,
syscall: 'open',
code: 'EACCES',
path: '/__w/_temp/_runner_file_commands/save_state_62e77b56-723c-4eef-bfa1-55e26a98e636'
}
Error: Error: failed to run script step: command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/341dc910-5143-11ee-90c0-098a53d3ac15.sh], exit code 1
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

Looking inside the pod I see that the owner of _work is user root.

4.0K drwxrwsr-x 3 root runner 4.0K Sep 12 08:06 _work

Describe the expected behavior

The checkout action should have no issues checking out a repository by writing to /home/runner/_work/ inside a runner pod.

I found this issue in the runner repository which proposes to set user ownership to the runner user. I'm not sure how to do that and why it's necessary with a rather standard deployment of the runner scale set. I already configured fsGroup as per troubleshooting docs.

According to this comment I'm not supposed to set containerMode when configuring the template section. However this disables the kube mode role, rolebinding and serviceaccount in the chart, creates the noPermissionServiceAccount and the runner doesn't work at all.

Whole Controller Logs

https://gist.github.com/bobertrublik/4ee34181ceda6da120bd91fd8f68754c

Whole Runner Pod Logs

https://gist.github.com/bobertrublik/d770a62c64679db5b9eab5644f0cfebc

@bobertrublik bobertrublik added bug Something isn't working needs triage Requires review from the maintainers labels Sep 12, 2023
@github-actions
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Member

Hey @bobertrublik,

I think that you are missing the kubernetesModeWorkVolumeClaim

@bobertrublik
Copy link
Author

According to the values.yaml file containerMode should be empty if I make any customizations and configure the template.

Just to make sure I added your suggestion but the AutoScalingRunnerSet resources were equal.

@nikola-jokic
Copy link
Member

Oh, that is right, but the containerMode.type is not empty and therefore expanded.
And please correct me if I'm wrong, but the customization you are using is scoped to the requests and the storage class? If that is the case, you should be able to safely use the kubernetesModeWorkVolumeClaim with the containerMode.type: kubernetes

The idea behind the customization is that if there are requirements for customization that are difficult to expand properly, for example dind with custom volume mounts, the containerMode should be left commented out and the dind container should be provided as a side-car.
In this case, it seems to me that the customization you are targeting is supported by the helm chart

@bobertrublik
Copy link
Author

I applied your suggestion, which made better use of the Helm chart template but doesn't seem to have had any immediate effect.

values.yaml

githubConfigUrl: "https://github.com/privaterepo"
minRunners: 1
maxRunners: 3
runnerGroup: "runners"
containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "zrs-delete"
    resources:
      requests:
        storage: 4Gi
template:
  spec:
    securityContext:
      fsGroup: 1001
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOKS
          value: /home/runner/k8s/index.js
        - name: ACTIONS_RUNNER_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "true"
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
controllerServiceAccount:
  namespace: github-arc
  name: github-arc

autoscalingrunnerset.yaml

apiVersion: actions.github.com/v1alpha1
kind: AutoscalingRunnerSet
metadata:
  name: github-runners
  namespace: github-runners
  labels:
    app.kubernetes.io/component: "autoscaling-runner-set"
    helm.sh/chart: gha-rs-0.5.0
    app.kubernetes.io/name: gha-rs
    app.kubernetes.io/instance: github-runners
    app.kubernetes.io/version: "0.5.0"
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: gha-rs
    actions.github.com/scale-set-name: github-runners
    actions.github.com/scale-set-namespace: github-runners
  annotations:
    actions.github.com/cleanup-github-secret-name: github-runners-gha-rs-github-secret
    actions.github.com/cleanup-manager-role-binding: github-runners-gha-rs-manager
    actions.github.com/cleanup-manager-role-name: github-runners-gha-rs-manager
    actions.github.com/cleanup-kubernetes-mode-role-binding-name: github-runners-gha-rs-kube-mode
    actions.github.com/cleanup-kubernetes-mode-role-name: github-runners-gha-rs-kube-mode
    actions.github.com/cleanup-kubernetes-mode-service-account-name: github-runners-gha-rs-kube-mode
spec:
  githubConfigUrl: https://github.com/privaterepo
  githubConfigSecret: github-runners-gha-rs-github-secret
  runnerGroup: runners
  maxRunners: 3
  minRunners: 1

  template:
    spec:
      securityContext: 
        fsGroup: 1001
      serviceAccountName: github-runners-gha-rs-kube-mode
      containers:
      - name: runner
        
        command: 
          - /home/runner/run.sh
        image: 
          ghcr.io/actions/actions-runner:latest
        env:
          - 
            name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - 
            name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - 
            name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"
        volumeMounts:
          - 
            mountPath: /home/runner/_work
            name: work
      
      volumes:
      
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  storage: 4Gi
              storageClassName: zrs-delete

@AkshaySuryawanshi8
Copy link

AkshaySuryawanshi8 commented Sep 22, 2023

@bobertrublik Thank you for raising the issue, I too facing same issue perhaps many others.

@nikola-jokic Could you please help us to understand where we are going wrong, also if above information is not sufficient please let us know what all output info you required to analyze the issue.

Thank you!!

@Ravio1i
Copy link

Ravio1i commented Sep 29, 2023

We have a policy which automatically does not use user "root" and overwrites it with a user with id 1234

We were able to get around this for the kubernetes runner pod itsel fwith:

template:
   spec:
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
      - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
        value: "true"
      securityContext:
        runAsUser: 1001
        runAsGroup: 123

However, when using e.g a container workflow where another pod is spawned:

jobs:
  arc-runner-job:
    strategy:
      fail-fast: false
      matrix:
        job: [1, 2, 3]
    runs-on: ${{ inputs.arc_name }}
    container: ubuntu
    services:
      redis:
        image: redis
        ports:
          - 6379:6379
    steps:
      - uses: actions/checkout@v3 
      - run: echo "Hello World!"

The checkout step fails:

##[debug]Evaluating condition for step: 'Post Run actions/checkout@v3'
##[debug]Evaluating: always()
##[debug]Evaluating always:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Post Run actions/checkout@v3
##[debug]Loading inputs
##[debug]Evaluating: github.repository
##[debug]Evaluating Index:
##[debug]..Evaluating github:
##[debug]..=> Object
##[debug]..Evaluating String:
##[debug]..=> 'repository'
##[debug]=> 'GithubRunnerTest/arc-testing-workflows'
##[debug]Result: 'GithubRunnerTest/arc-testing-workflows'
##[debug]Evaluating: github.token
##[debug]Evaluating Index:
##[debug]..Evaluating github:
##[debug]..=> Object
##[debug]..Evaluating String:
##[debug]..=> 'token'
##[debug]=> '***'
##[debug]Result: '***'
##[debug]Loading env
Post job cleanup.
##[debug]Running JavaScript Action with default external tool: node16
Run '/home/runner/k8s/index.js'
##[debug]/home/runner/externals/node16/bin/node /home/runner/k8s/index.js
node:internal/fs/utils:347
    throw err;
    ^

Error: EACCES: permission denied, open '/__w/_temp/_runner_file_commands/save_state_da0afb[2](https://***arc-testing-workflows/actions/runs/84042/job/179911#step:7:2)9-9bfa-48[3](***arc-testing-workflows/actions/runs/84042/job/179911#step:7:3)0-b67[4](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:4)-26018b714d93'
    at Object.openSync (node:fs:[5](**arc-testing-workflows/actions/runs/84042/job/179911#step:7:5)90:3)
    at Object.writeFileSync (node:fs:2202:35)
    at Object.appendFileSync (node:fs:22[6](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:6)4:6)
    at Object.issueFileCommand (/__w/_actions/actions/checkout/v3/dist/index.js:2950:8)
    at Object.saveState (/__w/_actions/actions/checkout/v3/dist/index.js:286[7](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:7):31)
    at Object.[8](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:8)647 (/__w/_actions/actions/checkout/v3/dist/index.js:2326:10)
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v3/dist/index.js:18256:43)
    at Object.2565 (/__w/_actions/actions/checkout/v3/dist/index.js:146:34)
    at __nccwpck_require__ (/__w/_actions/actions/checkout/v3/dist/index.js:18256:43)
    at Object.[9](**/arc-testing-workflows/actions/runs/84042/job/179911#step:7:9)2[10](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:10) (/__w/_actions/actions/checkout/v3/dist/index.js:[11](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:11)41:36) {
  errno: -[13](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:13),
  syscall: 'open',
  code: 'EACCES',
  path: '/__w/_temp/_runner_file_commands/save_state_da0afb29-9bfa-4830-b674-26018b7[14](***/arc-testing-workflows/actions/runs/84042/job/179911#step:7:14)d93'
}
##[debug]{"message":"command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/68de3ca0-5ec4-11ee-9966-bdeda596caf8.sh], exit code 1","details":{"causes":[{"reason":"ExitCode","message":"1"}]}}
Error: Error: failed to run script step: command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/68de3ca0-5ec4-11ee-9966-bdeda596caf8.sh], exit code 1
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug]System.Exception: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug] ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'RunScriptStep' did not execute successfully
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   --- End of inner exception stack trace ---
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.RunScriptStepAsync(IExecutionContext context, ContainerInfo container, String workingDirectory, String entryPoint, String entryPointArgs, IDictionary`2 environmentVariables, String prependPath)
##[debug]   at GitHub.Runner.Worker.Handlers.ContainerStepHost.ExecuteAsync(IExecutionContext context, String workingDirectory, String fileName, String arguments, IDictionary`2 environment, Boolean requireExitCodeZero, Encoding outputEncoding, Boolean killProcessOnCancel, Boolean inheritConsoleHandler, String standardInInput, CancellationToken cancellationToken)
##[debug]   at GitHub.Runner.Worker.Handlers.NodeScriptActionHandler.RunAsync(ActionRunStage stage)
##[debug]   at GitHub.Runner.Worker.ActionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Post Run actions/checkout@v3

Question is can I control somehow the pods that are created which users they shall use in kubernetes mode?

@nikola-jokic
Copy link
Member

Hey @bobertrublik,

Try not to specify volumes. The field under containerMode.kubernetesModeWorkVolumeClaim already expands the required volume so it should be applied. Can you please try running it with something similar to this:

githubConfigUrl: "https://github.com/privaterepo"
minRunners: 1
maxRunners: 3
runnerGroup: "runners"
containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "zrs-delete"
    resources:
      requests:
        storage: 4Gi
template:
  spec:
    securityContext:
      fsGroup: 1001
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]

controllerServiceAccount:
  namespace: github-arc
  name: github-arc

@nikola-jokic
Copy link
Member

Hey @Ravio1i,

This is a slightly more difficult problem. One possible way that you can overcome this issue is by using container hook 0.4.0`. We introduced a hook extension that will take a template and modify the default pod spec created by the hook. You can specify securityContext there, that will be applied to the job container.
This ADR aims to document the way the hook extension works

@bobertrublik
Copy link
Author

Hey @bobertrublik,

Try not to specify volumes. The field under containerMode.kubernetesModeWorkVolumeClaim already expands the required volume so it should be applied. Can you please try running it with something similar to this:

Your suggestion returns the exact same AutoscalingRunnerSet manifest as my configuration. I doubt the error lies in Helm templates.

@nikola-jokic
Copy link
Member

Can you try to run init container that will apply correct permissions to all files under /home/runner? The runner we provide has the UID is 1001 and the GID is 123. Or maybe use fsGroup with the 123 value?

@bobertrublik
Copy link
Author

Using this config to set an initContainer:

githubConfigUrl: "https://github.com/private"
minRunners: 1
maxRunners: 3
runnerGroup: "runners"
containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "zrs-delete"
    resources:
      requests:
        storage: 4Gi
template:
  spec:
    initContainers:
    - name: kube-init
      image: ghcr.io/actions/actions-runner:latest
      command: ["sudo", "chown", "-R", "1001:123", "/home/runner/_work"]
      volumeMounts:
      - name: work
        mountPath: /home/runner/_work
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOKS
          value: /home/runner/k8s/index.js
        - name: ACTIONS_RUNNER_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "true"
controllerServiceAccount:
  namespace: github-arc
  name: github-arc

I get this error in the checkout step:

0s
Run actions/checkout@v3
Run '/home/runner/k8s/index.js'
node:internal/fs/utils:347
throw err;
^

Error: EACCES: permission denied, open '/__w/_temp/_runner_file_commands/save_state_671739e2-e014-4fe0-b6a8-95b37d5f68c8'
at Object.openSync (node:fs:590:3)
at Object.writeFileSync (node:fs:2202:35)
at Object.appendFileSync (node:fs:2264:6)
at Object.issueFileCommand (/__w/_actions/actions/checkout/v3/dist/index.js:2950:8)
at Object.saveState (/__w/_actions/actions/checkout/v3/dist/index.js:2867:31)
at Object.8647 (/__w/_actions/actions/checkout/v3/dist/index.js:2326:10)
at nccwpck_require (/__w/_actions/actions/checkout/v3/dist/index.js:18256:43)
at Object.2565 (/__w/_actions/actions/checkout/v3/dist/index.js:146:34)
at nccwpck_require (/__w/_actions/actions/checkout/v3/dist/index.js:18256:43)
at Object.9210 (/__w/_actions/actions/checkout/v3/dist/index.js:1141:36) {
errno: -13,
syscall: 'open',
code: 'EACCES',
path: '/__w/_temp/_runner_file_commands/save_state_671739e2-e014-4fe0-b6a8-95b37d5f68c8'
}
Error: Error: failed to run script step: command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/54914b40-6430-11ee-80ad-b19cd917a8c1.sh], exit code 1
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

and this error multiple times in the runner pod logs with varying PIDs.

[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Starting process:
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] File name: '/usr/bin/tar'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Arguments: '-xzf "/home/runner/_work/_actions/_temp_8aa72703-530e-47f6-97d6-7e49bcf4c61a/463beffc-1f09-436f-8543-e6fdfcbd4611.tar.gz"'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Working directory: '/home/runner/_work/_actions/_temp_8aa72703-530e-47f6-97d6-7e49bcf4c61a/_staging'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Require exit code zero: 'False'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Encoding web name: ; code page: ''
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Force kill process on cancellation: 'False'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Redirected STDIN: 'False'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Persist current code page: 'False'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Keep redirected STDIN open: 'False'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] High priority process: 'False'
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Failed to update oom_score_adj for PID: 63.
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] System.UnauthorizedAccessException: Access to the path '/proc/63/oom_score_adj' is denied.
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] ---> System.IO.IOException: Permission denied
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] --- End of inner exception stack trace ---
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at System.IO.RandomAccess.WriteAtOffset(SafeFileHandle handle, ReadOnlySpan1 buffer, Int64 fileOffset) [WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at System.IO.Strategies.OSFileStreamStrategy.Write(ReadOnlySpan1 buffer)
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at System.IO.Strategies.BufferedFileStreamStrategy.Flush(Boolean flushToDisk)
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at System.IO.Strategies.BufferedFileStreamStrategy.Dispose(Boolean disposing)
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at System.IO.StreamWriter.CloseStreamFromDispose(Boolean disposing)
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at System.IO.StreamWriter.Dispose(Boolean disposing)
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at System.IO.File.WriteAllText(String path, String contents)
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] at GitHub.Runner.Sdk.ProcessInvoker.WriteProcessOomScoreAdj(Int32 processId, Int32 oomScoreAdj)
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Process started with process id 63, waiting for process exit.
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[WORKER 2023-10-06 10:08:24Z INFO ProcessInvokerWrapper] Finished process 63 with exit code 0, and elapsed time 00:00:00.0298782.

When setting fsGroup: 123 I observe the same error messages.

@nikola-jokic
Copy link
Member

I am failing to reproduce the issue. Can you please switch the storage class and see if the issue persists?
Also, one more thing, you may have issues down the line with ports specified as 6379:6379
The first 6379 is going to be the hostPort so with 3 pods running the job, you may see failures there as well

@Ravio1i
Copy link

Ravio1i commented Oct 6, 2023

This is a slightly more difficult problem. One possible way that you can overcome this issue is by using container hook 0.4.0`. We introduced a hook extension that will take a template and modify the default pod spec created by the hook. You can specify securityContext there, that will be applied to the job container. This ADR aims to document the way the hook extension works

Thank you very much for the insights. I think we are almost succeeding at it

We are using 0.6.1 which from the release-notes states it utilizes webhook version 0.4.0

I've create a pod template manifest, because it sounded from here it uses a PodTemplate

apiVersion: v1
kind: PodTemplate
metadata:
  name: runner-pod-template
  labels:
    app: runner-pod-template
spec:
  securityContext:
    runAsUser: 1001
    runAsGroup: 123

I've created a simple Dockerfile to make the pod template available the runner

FROM ghcr.io/actions/actions-runner:latest

COPY pod-template.yml /home/runner/pod-template.yml

RUN sudo chown -R runner:runner /home/runner/pod-template.yml

As suggested here, I've used ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE in my template

template:
  spec:
    containers:
    - name: runner
      image: <MYPRIVATEREGISTRY>/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
      - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
        value: "true"
      - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
        value: "/home/runner/pod-template.yml"
      securityContext:
        runAsUser: 1001
        runAsGroup: 123
    imagePullSecrets:
    - name: regcred

However, it still faces the same issue, that it is not using the user 1001. I'm assuming its not yet taking the template. Any tips are highly appreciated :)

@nikola-jokic
Copy link
Member

Hey @Ravio1i,

This one comes from the runner itself. We have released the hook 0.4.0 release, but the latest version of the runner image does not contain the hook 0.4.0. I'm sorry that I did not explain it well enough in the release notes.
Runner PR is already merged, but the image will be published on the next release. Until then, can you please add the newest hook version in your Dockerfile?

@Ravio1i
Copy link

Ravio1i commented Oct 6, 2023

I see, thank you for the hint! Is there someway to see within the docker image itself which hook version is used? E.g I'm imaging something like a metadata file which says RUNNER_CONTAINER_HOOKS_VERSION=0.4.0

I've extended my Dockerfile with

ARG RUNNER_CONTAINER_HOOKS_VERSION=0.4.0
RUN sudo rm -rf ./k8s
RUN curl -f -L -o runner-container-hooks.zip https://github.com/actions/runner-container-hooks/releases/download/v${RUNNER_CONTAINER_HOOKS_VERSION}/actions-runner-hooks-k8s-${RUNNER_CONTAINER_HOOKS_VERSION}.zip \
    && unzip ./runner-container-hooks.zip -d ./k8s \
    && rm runner-container-hooks.zip

I also forget the template part before spec here

apiVersion: v1
kind: PodTemplate
metadata:
  name: runner-pod-template
  labels:
    app: runner-pod-template
template:
  spec:
    securityContext:
      runAsUser: 1001
      runAsGroup: 123

However, no luck. I guess I'm still missing something! It's still not using the container template. Although the pod-template and the new k8s/index.js

image

@nikola-jokic
Copy link
Member

It was okay before, you can see an example template here. ☺️
Basically, the way it works is you need to provide an extension somehow. I'm guessing from the screenshot that the pod-template.yaml is the file. Then you need to expose ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE env specifying the path to the extension and it should work

@Ravio1i
Copy link

Ravio1i commented Oct 7, 2023

Well I got it working with:

metadata:
  annotations:
    annotated-by: "extension"
  labels:
    labeled-by: "extension"
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
  restartPolicy: Never

However, one downer which I still need to workaround is the user management of containers. When I'm using an image without the actual user (e.g debian:bullseye which only has root user) I'm running into a gzip error. Presumably, because something similar happens to this. So I may need to tweak around with the

gzip: stdin: not in gzip format
/bin/tar: Child returned status 1
/bin/tar: Error is not recoverable: exiting now
Error: The process '/bin/tar' failed with exit code 2

While testing I found this

Docker actions must be run by the default Docker user (root).

metadata:
  annotations:
    annotated-by: "extension"
  labels:
    labeled-by: "extension"
spec:
  securityContext:
    runAsUser: 0
    runAsGroup: 0
  restartPolicy: Never

However due to my kubernetes hardening measures:

│   Warning  SyncError  5s (x13 over 27s)  pod-syncer  Error syncing to physical cluster: admission webhook "validation.gatekeeper.sh" denied the request: [psp-pods-allowed-user-ranges] Container job is attempting to run a │
│ s disallowed user 0. Allowed runAsUser: {"rule": "MustRunAsNonRoot"}                                                                                                                                                         │
│ [psp-pods-allowed-user-ranges] Container redis is attempting to run as disallowed user 0. Allowed runAsUser: {"rule": "MustRunAsNonRoot"}      

So yet again, I do have to find a way to get forward. Does the container extension support, init containers which can create users?


I do have an additional question for the template**, which wildcard vars are there like $job which is used in your example**

- name: $job # overwirtes job container

@Ravio1i
Copy link

Ravio1i commented Oct 9, 2023

Maybe related to original issue, I tried to resolve my error from #2890 (comment) by setting fsGroup: 123.

However, when setting fsGroup: 123 in runner scale set spec

template:
  spec:
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      securityContext:
        runAsUser: 1001
        runAsGroup: 123
        fsGroup: 123

it disappears once applied, as if the runner-set chart is not using it

image

@nikola-jokic
Copy link
Member

Hey @Ravio1i,

Docker actions must be run by the default Docker user (root).

This limitation only applies when you are using the runner itself, not when you are using it with the hook. But also, building docker images is not supported by the container hook. It can be modified to use kaniko for example, but it is not officially supported. As long as you are using an already built image, I would assume that you shouldn't have any issues running as any user you'd like.

Hook extension should support init containers. The $job selector lets the hook know that this container spec should be applied to the job container itself. If the name is not $job, the container will be applied next to the job container (like a side-car).

Security context within the helm chart is used to specify the runner image, not to pass that information to the hook. To deliver the hook extension, you need a file on the runner with the extension spec. Then you need to set the env ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE to point to that file.

@nikola-jokic nikola-jokic added question Further information is requested gha-runner-scale-set Related to the gha-runner-scale-set mode and removed needs triage Requires review from the maintainers labels Oct 9, 2023
@bobertrublik
Copy link
Author

Hello,

thanks to the comments by @Ravio1i I noticed that my workflow was also being ran in a container under a different user than root. I'm sorry for overseeing this and taking your time. Once I re-created the image with user root the workflow ran successfully.

Some additional context:

Thank you for your help @nikola-jokic.

@nikola-jokic
Copy link
Member

Thank you for confirmation @bobertrublik.

I will close this issue now since it is unrelated to the ARC itself ☺️

@omri-shilton
Copy link

omri-shilton commented Oct 18, 2023

@nikola-jokic Im also receiving these error, my container does really run on a nonroot user but im failing even before the checkout step. Ill even go and say that there is no way that it even pulled the image because its in a private repo and I havent given him the pullsecret.
Im getting this error

System.UnauthorizedAccessException: Access to the path '/home/runner/_work/_tool' is denied.
 ---> System.IO.IOException: Permission denied
   --- End of inner exception stack trace ---
   at System.IO.FileSystem.CreateDirectory(String fullPath)
   at System.IO.Directory.CreateDirectory(String path)
   at GitHub.Runner.Worker.JobRunner.RunAsync(AgentJobRequestMessage message, CancellationToken jobRequestCancellationToken)
   at GitHub.Runner.Worker.JobRunner.RunAsync(AgentJobRequestMessage message, CancellationToken jobRequestCancellationToken)
   at GitHub.Runner.Worker.Worker.RunAsync(String pipeIn, String pipeOut)
   at GitHub.Runner.Worker.Program.MainAsync(IHostContext context, String[] args)

@knkarthik
Copy link
Contributor

knkarthik commented Oct 18, 2023

Update: I must say adding the below initContainer to template.spec seems to have fixed the permission error.

    initContainers:
        - name: kube-init
          image: ghcr.io/actions/actions-runner:latest
          command: ["/bin/sh", "-c"]
          args:
            - |
              sudo chown -R 1001:123 /home/runner/_work
          volumeMounts:
            - name: work
              mountPath: /home/runner/_work

I have the same problem as @omri-shilton with the below values. I'm using 0.6.1 and AWS-EBS CSI driver for storage. I'm trying to run my workflow on the runner (ie without job.container in my workflow file)

 githubConfigUrl: "https://github.com/foobar"
 githubConfigSecret: github-token
 runnerScaleSetName: "general"
 template:
   spec:
     containers:
       - name: runner
         image: ghcr.io/actions/actions-runner:latest
         command: ["/home/runner/run.sh"]
         env:
           - name: ACTIONS_RUNNER_CONTAINER_HOOKS
             value: /home/runner/k8s/index.js
           - name: ACTIONS_RUNNER_POD_NAME
             valueFrom:
               fieldRef:
                 fieldPath: metadata.name
           - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
             value: "false"
         volumeMounts:
           - name: work
             mountPath: /home/runner/_work
     volumes:
       - name: work
         ephemeral:
           volumeClaimTemplate:
             spec:
               accessModes: ["ReadWriteOnce"]
               storageClassName: "ebs-csi-sc-noreclaim"
               resources:
                 requests:
                   storage: 4Gi
     nodeSelector:
       general: "true"
     affinity:
       podAntiAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
               matchExpressions:
                 - key: actions.github.com/scale-set-name
                   operator: In
                   values:
                     - general
             namespaces:
               - actions
             topologyKey: kubernetes.io/hostname
     tolerations:
       - key: scaleToZero
         operator: Exists
         effect: NoSchedule
 controllerServiceAccount:
   namespace: actions
   name: gha-runner-scale-set-controller-gha-rs-controller

@knkarthik
Copy link
Contributor

knkarthik commented Oct 20, 2023

Can you try to run init container that will apply correct permissions to all files under /home/runner? The runner we provide has the UID is 1001 and the GID is 123. Or maybe use fsGroup with the 123 value?

I'm wondering why this is not included as init-container in the chart. @nikola-jokic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants