pass through hostdevice by pcie path #11815

aep · 2024-04-29T12:56:22Z

currently a pcie hostdevice is select by vendor/deviceid.
this is fine if you only have one per host, but as soon as you have more than one it falls apart
For example given two vms, with two nvme devices passed through to them, will suddenly swap storage devices at random.

we actually have nvme only nodes with up to 48 nvmes per host, as well as mellanox cards with 32 VFs each.

i would suggest to have something like

configuration:
  permittedHostDevices:
    pciHostDevices:
      - pciPathSelector: "0000:0a:01"
        resourceName: "nvme/first"
      - pciPathSelector: "0000:0a:02"
        resourceName: "nvme/second"

it seems likely that this wasnt implemented because this makes very little sense for GPUs, where they are fungible and the hosts might not actually all have the same PCI layout. However storage and l3 network are non fungible, and really need to be mapped to specific vms

alicefr · 2024-04-29T12:57:14Z

/cc @vladikr @victortoso

aep · 2024-04-29T17:00:13Z

attempted to work around it with a hook sidecar.
i can add the xml to libvirt, but that doesnt help because its running inside a namespace, so its missing information about the host device.

vfio 0000:01:00.4: failed to open /dev/vfio/43: No such file or directory'

i think there needs to be another hook for modifying the pod. but i'm not even sure what to add to the pod.

aep · 2024-05-01T12:56:34Z

i implemented my own device-plugin now that passes

/dev/vfio/123 into the pod and sets the PCI_RESOURCE_ env to the pcie address.
apparantly i'm missing something else tho.

unsupported configuration: host doesn't support passthrough of host PCI devices

the same host works fine with virsh so probably this is yet another thing that needs to be passed into the pod

related issues have been closed without resolution so i cant figured out what the underlying issue is

#5811
#3035

aep · 2024-05-01T13:19:55Z

finally figured it out. the plugin is here:

https://github.com/kraudcloud/vf-device-plugin

but i dont think this can be upstreamed. in order to distinguish devices by path, i had to create a k8s resource for each one of them. the request to the plugin already contains a pick. the plugin doesnt have an influence on which one is chosen.

this seems like a pretty bad design from k8s itself. there's also no way to pass any annotations from pod to device plugin so you cant even do any preparations on behalf of the pod.

it could probably be upstreamed by making it less specific to ethernet, i.e. litteraly create a resource per path, as i originally posted. but i'm not sure if thats really all that useful. for storage we will instead create yet another plugin that creates a k8s resource for each chassis bay, instead of hardcoding all the pcie paths

alicefr · 2024-05-02T07:32:16Z

@aep yes, this is a limitation that we can overcome with DRA. We have a research project which will investigate the integration between DRA and kubevirt: kubevirt/community#254. However, this won't happen soon.

You can try to have a look to the akri project. They should have solved the same problem. Unfortunately, I'm not very familiar with the project and it isn't directly compatible with kubevirt because of environmental variables. AFAIU, they also create a single device resource name per device in order to identify the single device and not to have this random assignement.

aep · 2024-05-02T07:57:43Z

thanks for the input.

do you think i should prepare this for upstreaming into kubevirt ( pciPathSelector, create one resource per pcie path in config ) or would it be rejected anyway because DRA is the better long term solution?

alicefr · 2024-05-02T08:03:48Z

thanks for the input.

do you think i should prepare this for upstreaming into kubevirt ( pciPathSelector, create one resource per pcie path in config ) or would it be rejected anyway because DRA is the better long term solution?

Hard to say. DRA doesn't directly depend on us. It might still make sense to add it.
@vladikr WDYT?

alicefr · 2024-05-02T08:05:13Z

@aep if you can, you could attend the community meeting on Wend and present the problem. I think, it will be the fastest way to receive feedback

victortoso · 2024-05-03T13:18:51Z

For example given two vms, with two nvme devices passed through to them, will suddenly swap storage devices at random.

...

in order to distinguish devices by path, i had to create a k8s resource for each one of them. the request to the plugin already contains a pick. the plugin doesnt have an influence on which one is chosen.

That's my understanding too. Each plugin has an ID and kubectl's device manager will request that ID. If you have two nvme on the same resource name, we can't guarantee which one is requested.

The solution using device-plugins is really around more specific selectors but that ends up being worse experience to the user, as admin will need to populate it (e.g: by path or some other unique metadata)

To my knowledge, this should be solved by DRA but that might take some time to be adopted in KubeVirt.

do you think i should prepare this for upstreaming into kubevirt ( pciPathSelector, create one resource per pcie path in config

IMHO, yes. An optional selector that would solve your problem and not affect current use cases should be considered.

EdDev · 2024-05-06T13:11:18Z

However storage and l3 network are non fungible, and really need to be mapped to specific vms

Network devices are handled correctly when Multus (CNI) is used.
This is how SR-IOV and vDPA VFs are handled and correctly mapped to the relevant interface in the domain.

I do not know enough to comment regarding storage.

alicefr · 2024-05-06T13:39:51Z

For storage we are completely missing this mapping, and it is a general problem for all PCI passthrough devices. But it is particularly relevant for storage since the devices definitely have a state and data :)

dgsardina · 2024-05-21T13:50:59Z

Network devices are handled correctly when Multus (CNI) is used. This is how SR-IOV and vDPA VFs are handled and correctly mapped to the relevant interface in the domain.

I believe network devices are NOT handled by Multus when trying to do pcie passthrough in a Kubevirt VM ("type": "host-device" in the NetworkAttachmentDefinition). Specially because there is not possible frontend to choose that allows it.

In my particular case, my network devices do not support SR-IOV. So I am able to pass them to the vm with pciHostDevices but every deployment gets the nics in a different order, which makes it unusable.

aep · 2024-05-21T16:43:04Z

Unfortunately couldn't really figure out how the community meeting works. Sound was incredibly bad.

Anyway I still think the easiest solution is to just have a PCIe path selector in the config. It's clunky but will probably get most users around the issue until the proper solution is in k8s

EdDev · 2024-05-22T06:02:18Z

Network devices are handled correctly when Multus (CNI) is used. This is how SR-IOV and vDPA VFs are handled and correctly mapped to the relevant interface in the domain.

I believe network devices are NOT handled by Multus when trying to do pcie passthrough in a Kubevirt VM ("type": "host-device" in the NetworkAttachmentDefinition). Specially because there is not possible frontend to choose that allows it.

In my particular case, my network devices do not support SR-IOV. So I am able to pass them to the vm with pciHostDevices but every deployment gets the nics in a different order, which makes it unusable.

Well, if you try to work with network devices without Kubevirt knowing it is a network device, then it is indeed the case that Multus or any other mechanism available is not involved.

The solution to this case is most likely to create a custom network binding plugin [1] for yourself.
A very similar work has been done with vDPA and the latest SR-IOV is using the same data now.

You will also need a DP and a CNI to reflect the data through Multus, beyond the need to create a binding plugin.

[1] https://github.com/kubevirt/kubevirt/blob/main/docs/network/network-binding-plugin.md

aep added the kind/enhancement label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass through hostdevice by pcie path #11815

pass through hostdevice by pcie path #11815

aep commented Apr 29, 2024 •

edited

alicefr commented Apr 29, 2024

aep commented Apr 29, 2024

aep commented May 1, 2024

aep commented May 1, 2024 •

edited

alicefr commented May 2, 2024

aep commented May 2, 2024

alicefr commented May 2, 2024

alicefr commented May 2, 2024 •

edited

victortoso commented May 3, 2024

EdDev commented May 6, 2024

alicefr commented May 6, 2024

dgsardina commented May 21, 2024 •

edited

aep commented May 21, 2024 •

edited

EdDev commented May 22, 2024

pass through hostdevice by pcie path #11815

pass through hostdevice by pcie path #11815

Comments

aep commented Apr 29, 2024 • edited

alicefr commented Apr 29, 2024

aep commented Apr 29, 2024

aep commented May 1, 2024

aep commented May 1, 2024 • edited

alicefr commented May 2, 2024

aep commented May 2, 2024

alicefr commented May 2, 2024

alicefr commented May 2, 2024 • edited

victortoso commented May 3, 2024

EdDev commented May 6, 2024

alicefr commented May 6, 2024

dgsardina commented May 21, 2024 • edited

aep commented May 21, 2024 • edited

EdDev commented May 22, 2024

aep commented Apr 29, 2024 •

edited

aep commented May 1, 2024 •

edited

alicefr commented May 2, 2024 •

edited

dgsardina commented May 21, 2024 •

edited

aep commented May 21, 2024 •

edited