Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242

Open
dims opened this issue Mar 12, 2024 · 21 comments
Open

[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242

dims opened this issue Mar 12, 2024 · 21 comments
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@dims
Copy link
Member

dims commented Mar 12, 2024

the job ci-kubernetes-e2e-gce-device-plugin-gpu will start failing in GCP in May https://cloud.google.com/compute/docs/eol/k80-eol

By the time we are hoping sig-node folks can help figure out a transition to T4 GPUs

/assign sig-node

Context: following up from #32241

@dims dims added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 12, 2024
@k8s-ci-robot
Copy link
Contributor

@dims: GitHub didn't allow me to assign the following users: sig-node.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

the job ci-kubernetes-e2e-gce-device-plugin-gpu will start failing in GCP in May https://cloud.google.com/compute/docs/eol/k80-eol

By the time we are hoping sig-node folks can help figure out a transition to T4 GPUs

/assign sig-node

Context: following up from #32241

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 12, 2024
@dims
Copy link
Member Author

dims commented Mar 12, 2024

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2024
@ameukam
Copy link
Member

ameukam commented Mar 12, 2024

This will likely to affect v1.31 CI Signal about GPU scheduling.

@dims
Copy link
Member Author

dims commented Mar 12, 2024

We do have a backup CI job we can promote to informing - https://testgrid.k8s.io/amazon-ec2#ci-kubernetes-e2e-ec2-device-plugin-gpu&width=20

@BenTheElder
Copy link
Member

Have we brought this up to SIG node directly? (not sure how often they check this issue tracker)

@ameukam
Copy link
Member

ameukam commented Mar 22, 2024

cc @bobbypage

@kannon92
Copy link
Contributor

I'll add this to our sig-node-ci agenda.

@SergeyKanzhelev
Copy link
Member

/assign

@endocrimes
Copy link
Member

I see there was a previous attempt to migrate GPU stuff that was reverted, but without much context (#32147) - it's too long ago to (easily?) find any logs that would indicate why.

Anyone got a TL;DR? (and then I can take this)

@BenTheElder
Copy link
Member

xref: kubernetes/kubernetes#124950

This is now hard-failing to bring up the clusters as the grace period has expired.

We might as well change the config to use some other GPU and see what that failure looks like with current logs?

@BenTheElder
Copy link
Member

#32635

@BenTheElder
Copy link
Member

Note that while the job is "green" after #32635 we are not running [Feature:GPUDevicePlugin] run Nvidia GPU Device Plugin tests anymore and the Windows test is skipped so ... no real tests are run.

https://testgrid.k8s.io/sig-release-master-blocking#gce-device-plugin-gpu-master&show-stale-tests=&width=5

@BenTheElder
Copy link
Member

@BenTheElder
Copy link
Member

The job's ARTIFACTS are weird ... docker log but not contained ... and nothing for the GPU driver install / no pod logs.

That's probably the next thing to fix, the worker node artifacts are lacking details that would be helpful.

I'm guessing though we don't actually get the GPU installed and so the test doesn't run, and unlike the windows GPU test it doesn't get marked skipped it just doesn't report.

@endocrimes
Copy link
Member

ci-kubernetes-node-kubelet-serial-containerd is reliably passing now and no tests seem to be dropped from the last-healthy-k80-runs (I'm not actually sure those test selectors result in a GPU test any more, we might be able to drop GPUs from that matrix 🤔)

@aojea
Copy link
Member

aojea commented May 21, 2024

/cc

@upodroid
Copy link
Member

Last time I looked at this(2 months ago), installing nvidia drivers on cos-109 for T4 GPU was not working.

kubernetes/kubernetes#123814
kubernetes/kubernetes#123600

https://kubernetes.slack.com/archives/CCK68P2Q2/p1708914356010229 Long thread with more details

@BenTheElder
Copy link
Member

I think the next step is getting the GPU jobs dumping pod / containerd logs, so we can even see what is happening properly.

@BenTheElder
Copy link
Member

This job still uses kubernetes_e2e.py, so there's that ... the kubelet-serial-containerd job does as well, these should really be using kubetest2 and skipping the ancient deprecated scenarios/* but ...

back on this topic, the jobs running containerd with this old tooling have some additional arugments that shouldn't be necessary on current runners. rather than migrate I'm just going to add the logging args quickly and move on for now, enough other things to deal with :-)

@BenTheElder
Copy link
Member

#32640 adds containerd logs, we probably also need the device plugin pod.

@BenTheElder
Copy link
Member

... or the GPU driver install

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

9 participants