Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating Dask GPU CI from Jenkins to GitHub Actions #348

Open
charlesbluca opened this issue Oct 5, 2023 · 5 comments
Open

Migrating Dask GPU CI from Jenkins to GitHub Actions #348

charlesbluca opened this issue Oct 5, 2023 · 5 comments

Comments

@charlesbluca
Copy link
Member

charlesbluca commented Oct 5, 2023

Currently, several of the Dask projects run a subset of their tests on GPUs, through a Jenkins-based infrastructure put together by the ops team at RAPIDS (relevant [archived] docs).

Currently, the RAPIDS projects on GitHub are undergoing a migration from this Jenkins setup to GitHub Actions (docs outlining this setup) for a variety of reasons, most relevant to Dask developers being increased visibility/control over the files controlling the GPU build matrix - in short, public GHA workflow files over private Groovy files only accessible by members of RAPIDS. Once this migration is complete, RAPIDS intends to decommission all its Jenkins-based testing infrastructure.

In making this migration, there would be several changes that would need to happen both at the Dask org and developer level to get things working smoothly, so I’m hoping to use this issue as a place for discussion around those potential changes and coordination of the efforts to complete the migration.

Requirements

In short, to get things running we would need to:

  1. Have admins of the dask and dask-contrib organizations install the GitHub applications used to manage GPU CI:
    • Install the nvidia-runners GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will handle dispatching of self-hosted GPU runners
    • Install the copy-pr-bot GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will automate the process of copying/deleting PR source code to/from the upstream repo
  2. Add a configuration file for copy-pr-bot to each repo we intend to run GPU tests on; this outlines the trusted NVIDIA users and admins for that repo:
# .github/copy-pr-bot.yaml

enabled: true
additional_trustees:
  - nvidia_employee_gh_username  # non NVIDIA employees on this list are ignored
additional_vetters:
  - nvidia_employee_gh_username  # non NVIDIA employees on this list are ignored

Once this is done, we should be able to run jobs on RAPIDS self-hosted runners by adding a workflow file looking something like this:

name: Test Self Hosted Runners

on:
  push:
    branches:
      - "pull-request/[0-9]+"

jobs:
  build:
    runs-on: "linux-amd64-gpu-t4-latest-1"
    container:
      # must run in a container
      image: nvidia/cuda:11.8.0-base-ubuntu22.04
      env:
        # must set this container env variable
        NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
    steps:
      - name: hello
        run: |
          echo "hello"
          nvidia-smi

And from there we can work on migrating the existing setup files in each repo over to GHA workflows.

Differences between Jenkins & GHA

Here’s a quick outline of some of the key differences between the Jenkins and GHA setups that will be relevant to developers:

  • Instead of running tests through an external CI platform that reports back to GitHub, tests would be run on self-hosted runners, allowing some degree of customization to the CPU/GPU instances being used as outlined here
  • Pull requests approved to run on GPU would automatically be pushed to a branch on the upstream repo of the form pull-request/<PR_NUMBER> as part of the process of triggering a self-hosted runner to run the tests; this branch would be automatically deleted once the PR is closed or merged
    • These pull-request/* branches can be ignored in local git operations by running the following:
git config \
  --global \
  --add "remote.upstream.fetch" \ 
  '^refs/heads/pull-request/*'
  • The once internal concepts of the GPU CI “allowlist” and “adminlist” would now be handled on a per-repo basis through the copy-pr-bot concepts of trusted users and vetters, respectively
  • Commits without GPG/SSH signatures will be considered untrusted and PRs from trusted users containing them will require vetters to leave an /ok to test comment for approval
  • The bot command rerun tests would no longer exist, being replaced with the standard method of rerunning GHA workflow runs

Migrating GPU image builds to GHA

An additional piece of Dask’s GPU CI infrastructure that runs on Jenkins and would need to be migrated is GHA is the automated builds of the GPU containers used for each repo; some benefits I think would come with this migration are:

  • Running these builds through GHA would greatly increase the flexibility of where the setup files can be hosted, so that they could be moved from the relatively unknown dask-build-environment repo to a place more visible to Dask maintainers
  • Running the scheduled builds through GHA allows those watching the repos hosting the builds to receive notifications when they fail, making it a lot more obvious when things break

The main blockers here would be establishing where we would like these workflow files to exist - my first thoughts would be either dask-docker, a standalone repo, or in the respective repos the images are intended for.

Thoughts?

I’m interested in soliciting opinions from Dask maintainers on what parts of this migration seem appealing/unappealing. My first impressions of some specific pain points are:

  • This setup would require new branches to get pushed to the upstream repo frequently, which would contribute some noise to local git operations if the branches aren’t manually ignored
  • Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis by vetters on the relevant repo; depending on who this is, this could put increased pressure on a small group to be constantly watching new PRs and consequently disincentivize running GPU tests on all PRs

But it would be great to hear from others on how we can make this process relatively smooth.

@GenevieveBuckley
Copy link
Collaborator

I'm generally positive about this.

  • If the RAPIDS group say this is better for a number of reasons, that's good enough for me
  • I never really liked the Jenkins interface anyway. We would sometimes get GPU CI failures and I'd click over to the workflow results on Jenkins and struggle to find useful information about why the test actually failed (maybe the info is there, just hidden in a weird place? I don't know, usually at that point I'd ask John to figure it out)

The major disadvantage here is:

Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis

This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.

This setup would require new branches to get pushed to the upstream repo frequently, which would contribute some noise to local git operations if the branches aren’t manually ignored

Usually I use the github CLI to checkout specific pull request branches for review (gh pr checkout $PR_NUMBER). So I don't think I'd care about the extra pull request branches being created, especially if they do eventually get deleted automatically.

But it would be great to hear from others on how we can make this process relatively smooth.

To make the process smooth, I think it would be helpful to split the info you have above into (1) what needs to happen at the Dask org level, and by who, (2) what needs to happen for each individual repository by the maintainers, and (3) what can only be done by NVIDIA employees. It's a good overview, but I'm a little confused about what I would need to do and when, since it seems some steps depend on others too.

@jacobtomlinson
Copy link
Member

This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.

I agree this is frustrating, but it's actually a pretty low bar. Signing your commits with your SSH keys is easy to set up. Your commits get that little verified tag too.

@GenevieveBuckley
Copy link
Collaborator

I agree this is frustrating, but it's actually a pretty low bar. Signing your commits with your SSH keys is easy to set up. Your commits get that little verified tag too.

Maybe I misunderstood. Charles says this above:

Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis

So I thought that meant:

  1. Non-Dask org members - always need to have their PR commits approved
  2. Dask org members without SSH signed commits - always need to have their PR commits approved
  3. Dask org members with SSH signed commits - do not need to have anything approved, everything is good here.

Is that incorrect?

It's category 1 I'm thinking about. But perhaps newcomers might not actually feel frustrated they have to wait on us, and maybe it's totally fine for the GPU CI to run and pass only once at the end of the PR process.

@charlesbluca
Copy link
Member Author

Maybe I misunderstood.

Nope, your understanding of the system is correct

This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.

I (along with other Dask-RAPIDS developers) share this sentiment, though my concern lies less with confusion/frustration from potential newcomers (who I assume would largely ignore the automated messages) and more with the likelihood of GPU-breaking code going in due to GPU tests not getting run as often. For now, I've generally accepted these limitations as a necessity to assuage security concerns related to the migration with hopes that we can loosen these requirements later on, but ultimately can't make any solid promises there 😕

To make the process smooth, I think it would be helpful to split the info you have above into (1) what needs to happen at the Dask org level, and by who, (2) what needs to happen for each individual repository by the maintainers, and (3) what can only be done by NVIDIA employees

Good point, will give that section a second pass to make things clearer

@jacobtomlinson
Copy link
Member

I think for a first-time contributor it wouldn't be surprising to see that some CI approval is needed. I guess it may be more frustrating for a return contributor who is not a Dask-org member.

Maybe we should be more liberal and add all contributors to the org when their PR is merged? I expect we could automate that with a GitHub Action. We give out membership pretty openly anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants