-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: override -vm-size to run the tests on GPU machines #3431
base: master
Are you sure you want to change the base?
Conversation
test/preflight/testlib/test_env.go
Outdated
@@ -180,6 +181,10 @@ type testingTWrapper interface { | |||
|
|||
// Fly runs a flyctl the result | |||
func (f *FlyctlTestEnv) Fly(flyctlCmd string, vals ...interface{}) *FlyctlResult { | |||
if strings.HasPrefix(flyctlCmd, "machine run ") || strings.HasPrefix(flyctlCmd, "launch ") { | |||
flyctlCmd += " --vm-gpu-kind a100-sxm4-80gb --vm-cpu-kind performance --vm-memory 8192 " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes:
a100-sxm4-80gb
are scarsea100-pcie-40gb
orl40s
availability are more common but only oniad
andord
regions--vm-size a100-40gb
or--vm-size l40s
are useful shortcuts
779566e
to
582c3e2
Compare
GPU machines have some quirks (e.g. use Cloud Hypervisor instead of Firecracker). While most of the tests don't need GPU, running these tests would increase our coverage.
It is helpful to test specific regions, including GPU testing which we need to avoid some regions.
These tests don't work on GPU machines. While we will investigate the issue further, we want to run other tests on GPU machines to prevent regressions.
if f.VMSize != "" { | ||
t.Skip() | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO all these skips are a big red flag this PR is going in the wrong direction
What if instead of this matrix based on vmsize that requires mangling the flyctl command and skipping things like this, we do:
- Add a few more integration tests that use GPU machines
- Add a process group with a
[[vm]]
section to currentfly launch
examples that also uses GPU machines - And longer but worth it, add another Fly organization only for cloudhypervisor testing, note I am not saying GPU testing here, but CH so the same set of tests don't require GPUs to run.
…ne size has a GPU.
…rs was a good idea. By which I mean no thanks. No thanks at all.
Change Summary
What and Why:
GPU machines have some quirks (e.g. use Cloud Hypervisor instead of Firecracker). While most of the tests don't need GPU, running these tests would increase our coverage.
How:
Related to:
Documentation