Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: override -vm-size to run the tests on GPU machines #3431

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

kzys
Copy link
Member

@kzys kzys commented Apr 5, 2024

Change Summary

What and Why:

GPU machines have some quirks (e.g. use Cloud Hypervisor instead of Firecracker). While most of the tests don't need GPU, running these tests would increase our coverage.

How:

Related to:


Documentation

  • Fresh Produce
  • In superfly/docs, or asked for help from docs team
  • n/a

@kzys kzys changed the title fix Run Preflight tests on GPU machines Apr 5, 2024
@@ -180,6 +181,10 @@ type testingTWrapper interface {

// Fly runs a flyctl the result
func (f *FlyctlTestEnv) Fly(flyctlCmd string, vals ...interface{}) *FlyctlResult {
if strings.HasPrefix(flyctlCmd, "machine run ") || strings.HasPrefix(flyctlCmd, "launch ") {
flyctlCmd += " --vm-gpu-kind a100-sxm4-80gb --vm-cpu-kind performance --vm-memory 8192 "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes:

  • a100-sxm4-80gb are scarse
  • a100-pcie-40gb or l40s availability are more common but only on iad and ord regions
  • --vm-size a100-40gb or --vm-size l40s are useful shortcuts

@kzys kzys force-pushed the gpu-ad-hoc-tests branch 4 times, most recently from 779566e to 582c3e2 Compare April 20, 2024 00:13
@kzys kzys changed the title Run Preflight tests on GPU machines ci: override -vm-size to run the tests on GPU machines Apr 20, 2024
@kzys kzys marked this pull request as ready for review April 20, 2024 00:17
GPU machines have some quirks (e.g. use Cloud Hypervisor instead of
Firecracker). While most of the tests don't need GPU, running these
tests would increase our coverage.
It is helpful to test specific regions, including GPU testing which
we need to avoid some regions.
These tests don't work on GPU machines. While we will investigate
the issue further, we want to run other tests on GPU machines to
prevent regressions.
if f.VMSize != "" {
t.Skip()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO all these skips are a big red flag this PR is going in the wrong direction

What if instead of this matrix based on vmsize that requires mangling the flyctl command and skipping things like this, we do:

  • Add a few more integration tests that use GPU machines
  • Add a process group with a [[vm]] section to current fly launch examples that also uses GPU machines
  • And longer but worth it, add another Fly organization only for cloudhypervisor testing, note I am not saying GPU testing here, but CH so the same set of tests don't require GPUs to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants