Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pass through hostdevice by pcie path #11815

Open
aep opened this issue Apr 29, 2024 · 14 comments
Open

pass through hostdevice by pcie path #11815

aep opened this issue Apr 29, 2024 · 14 comments

Comments

@aep
Copy link

aep commented Apr 29, 2024

currently a pcie hostdevice is select by vendor/deviceid.
this is fine if you only have one per host, but as soon as you have more than one it falls apart
For example given two vms, with two nvme devices passed through to them, will suddenly swap storage devices at random.

we actually have nvme only nodes with up to 48 nvmes per host, as well as mellanox cards with 32 VFs each.

i would suggest to have something like

configuration:
  permittedHostDevices:
    pciHostDevices:
      - pciPathSelector: "0000:0a:01"
        resourceName: "nvme/first"
      - pciPathSelector: "0000:0a:02"
        resourceName: "nvme/second"

it seems likely that this wasnt implemented because this makes very little sense for GPUs, where they are fungible and the hosts might not actually all have the same PCI layout. However storage and l3 network are non fungible, and really need to be mapped to specific vms

@alicefr
Copy link
Member

alicefr commented Apr 29, 2024

/cc @vladikr @victortoso

@aep
Copy link
Author

aep commented Apr 29, 2024

attempted to work around it with a hook sidecar.
i can add the xml to libvirt, but that doesnt help because its running inside a namespace, so its missing information about the host device.

vfio 0000:01:00.4: failed to open /dev/vfio/43: No such file or directory'

i think there needs to be another hook for modifying the pod. but i'm not even sure what to add to the pod.

@aep
Copy link
Author

aep commented May 1, 2024

i implemented my own device-plugin now that passes

/dev/vfio/123 into the pod and sets the PCI_RESOURCE_ env to the pcie address.
apparantly i'm missing something else tho.

unsupported configuration: host doesn't support passthrough of host PCI devices

the same host works fine with virsh so probably this is yet another thing that needs to be passed into the pod

related issues have been closed without resolution so i cant figured out what the underlying issue is

#5811
#3035

@aep
Copy link
Author

aep commented May 1, 2024

finally figured it out. the plugin is here:

https://github.com/kraudcloud/vf-device-plugin

but i dont think this can be upstreamed. in order to distinguish devices by path, i had to create a k8s resource for each one of them. the request to the plugin already contains a pick. the plugin doesnt have an influence on which one is chosen.

this seems like a pretty bad design from k8s itself. there's also no way to pass any annotations from pod to device plugin so you cant even do any preparations on behalf of the pod.

it could probably be upstreamed by making it less specific to ethernet, i.e. litteraly create a resource per path, as i originally posted. but i'm not sure if thats really all that useful. for storage we will instead create yet another plugin that creates a k8s resource for each chassis bay, instead of hardcoding all the pcie paths

@alicefr
Copy link
Member

alicefr commented May 2, 2024

@aep yes, this is a limitation that we can overcome with DRA. We have a research project which will investigate the integration between DRA and kubevirt: kubevirt/community#254. However, this won't happen soon.

You can try to have a look to the akri project. They should have solved the same problem. Unfortunately, I'm not very familiar with the project and it isn't directly compatible with kubevirt because of environmental variables. AFAIU, they also create a single device resource name per device in order to identify the single device and not to have this random assignement.

@aep
Copy link
Author

aep commented May 2, 2024

thanks for the input.

do you think i should prepare this for upstreaming into kubevirt ( pciPathSelector, create one resource per pcie path in config ) or would it be rejected anyway because DRA is the better long term solution?

@alicefr
Copy link
Member

alicefr commented May 2, 2024

thanks for the input.

do you think i should prepare this for upstreaming into kubevirt ( pciPathSelector, create one resource per pcie path in config ) or would it be rejected anyway because DRA is the better long term solution?

Hard to say. DRA doesn't directly depend on us. It might still make sense to add it.
@vladikr WDYT?

@alicefr
Copy link
Member

alicefr commented May 2, 2024

@aep if you can, you could attend the community meeting on Wend and present the problem. I think, it will be the fastest way to receive feedback

@victortoso
Copy link
Member

For example given two vms, with two nvme devices passed through to them, will suddenly swap storage devices at random.

...

in order to distinguish devices by path, i had to create a k8s resource for each one of them. the request to the plugin already contains a pick. the plugin doesnt have an influence on which one is chosen.

That's my understanding too. Each plugin has an ID and kubectl's device manager will request that ID. If you have two nvme on the same resource name, we can't guarantee which one is requested.

The solution using device-plugins is really around more specific selectors but that ends up being worse experience to the user, as admin will need to populate it (e.g: by path or some other unique metadata)

To my knowledge, this should be solved by DRA but that might take some time to be adopted in KubeVirt.

do you think i should prepare this for upstreaming into kubevirt ( pciPathSelector, create one resource per pcie path in config

IMHO, yes. An optional selector that would solve your problem and not affect current use cases should be considered.

@EdDev
Copy link
Member

EdDev commented May 6, 2024

However storage and l3 network are non fungible, and really need to be mapped to specific vms

Network devices are handled correctly when Multus (CNI) is used.
This is how SR-IOV and vDPA VFs are handled and correctly mapped to the relevant interface in the domain.

I do not know enough to comment regarding storage.

@alicefr
Copy link
Member

alicefr commented May 6, 2024

For storage we are completely missing this mapping, and it is a general problem for all PCI passthrough devices. But it is particularly relevant for storage since the devices definitely have a state and data :)

@dgsardina
Copy link

dgsardina commented May 21, 2024

Network devices are handled correctly when Multus (CNI) is used. This is how SR-IOV and vDPA VFs are handled and correctly mapped to the relevant interface in the domain.

I believe network devices are NOT handled by Multus when trying to do pcie passthrough in a Kubevirt VM ("type": "host-device" in the NetworkAttachmentDefinition). Specially because there is not possible frontend to choose that allows it.

In my particular case, my network devices do not support SR-IOV. So I am able to pass them to the vm with pciHostDevices but every deployment gets the nics in a different order, which makes it unusable.

@aep
Copy link
Author

aep commented May 21, 2024

Unfortunately couldn't really figure out how the community meeting works. Sound was incredibly bad.

Anyway I still think the easiest solution is to just have a PCIe path selector in the config. It's clunky but will probably get most users around the issue until the proper solution is in k8s

@EdDev
Copy link
Member

EdDev commented May 22, 2024

Network devices are handled correctly when Multus (CNI) is used. This is how SR-IOV and vDPA VFs are handled and correctly mapped to the relevant interface in the domain.

I believe network devices are NOT handled by Multus when trying to do pcie passthrough in a Kubevirt VM ("type": "host-device" in the NetworkAttachmentDefinition). Specially because there is not possible frontend to choose that allows it.

In my particular case, my network devices do not support SR-IOV. So I am able to pass them to the vm with pciHostDevices but every deployment gets the nics in a different order, which makes it unusable.

Well, if you try to work with network devices without Kubevirt knowing it is a network device, then it is indeed the case that Multus or any other mechanism available is not involved.

The solution to this case is most likely to create a custom network binding plugin [1] for yourself.
A very similar work has been done with vDPA and the latest SR-IOV is using the same data now.

You will also need a DP and a CNI to reflect the data through Multus, beyond the need to create a binding plugin.

[1] https://github.com/kubevirt/kubevirt/blob/main/docs/network/network-binding-plugin.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants