Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vCPU exposed as threads pinned to non-thread sibling pCPUs on hosts without SMT when using dedicatedCpuPlacement #11749

Open
lyarwood opened this issue Apr 18, 2024 · 3 comments
Assignees
Labels

Comments

@lyarwood
Copy link
Member

lyarwood commented Apr 18, 2024

/cc @vladikr

What happened:

This is likely open to interpretation around the goals of dedicatedCpuPlacement but at present on hosts without SMT enabled vCPUs exposed as threads to the guest OS are pinned to non-sibling pCPUs. Users might be surprised by the performance of workloads across such threads given the request.

What you expected to happen:

The VirtualMachineInstance to not schedule until a SMT enabled compute is present in the environment.

How to reproduce it (as minimally and precisely as possible):

Uses kubevirt/kubevirtci#1171

$ export \
 KUBEVIRT_PROVIDER=k8s-1.28 \
 KUBEVIRT_MEMORY_SIZE=$((16 * 1024))M \
 KUBEVIRT_HUGEPAGES_2M=$((4 * 1024)) \
 KUBEVIRT_DEPLOY_CDI=true \
 KUBEVIRT_CPU_MANAGER_POLICY=static \
 KUBEVIRT_NUM_NUMA_NODES=2 \
 KUBEVIRT_NUM_VCPU=16 \
 KUBEVIRTCI_CONTAINER_SUFFIX=latest \
 KUBEVIRTCI_GOCLI_CONTAINER=quay.io/kubevirtci/gocli:latest
$ cd kubevirtci
$ make cluster-up
$ cd ../kubevirt
$ rsync -av ../kubevirtci/cluster-up/ ./cluster-up/ && rsync -av ../kubevirtci/_ci-configs/ ./_ci-configs/
$ make cluster-sync
[..]
$ ./cluster-up/kubectl.sh patch kv/kubevirt -n kubevirt --type merge -p '{"spec":{"configuration":{"developerConfiguration":{"featureGates": ["CPUManager","NUMA"]}}}}'
[..]
$ ./cluster-up/kubectl.sh apply -f -<<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  name: dedicated-threads
spec:
  domain:
    cpu:
      threads: 2
      dedicatedCpuPlacement: true
    devices:
      disks:
        - disk:
            bus: virtio
          name: containerdisk
        - disk:
            bus: virtio
          name: cloudinitdisk
    resources:
      requests:
        memory: 2Gi
  volumes:
    - containerDisk:
        image: quay.io/containerdisks/fedora:39
      name: containerdisk
    - cloudInitNoCloud:
        userData: |
          #!/bin/sh
          mkdir -p  /home/fedora/.ssh
          curl https://github.com/lyarwood.keys > /home/fedora/.ssh/authorized_keys
          chown -R fedora: /home/fedora/.ssh
      name: cloudinitdisk
EOF

$ ./cluster-up/virtctl.sh ssh -lfedora dedicated-threads lscpu
[..]
Thread(s) per core:  2
Core(s) per socket:   1
Socket(s):                   1
 [..]

$ ./cluster-up/kubectl.sh exec pods/virt-launcher-dedicated-threads-6lwcm -- virsh vcpuinfo 1
selecting podman as container runtime
VCPU:           0
CPU:            1
State:          running
CPU time:       8.0s
CPU Affinity:   -y--------------

VCPU:           1
CPU:            2
State:          running
CPU time:       4.3s
CPU Affinity:   --y-------------

$ ./cluster-up/ssh.sh node01 lscpu
[..]
Thread(s) per core:                 1
Core(s) per socket:                 16
Socket(s):                                 1
[..]

Additional context:
N/A

Environment:

  • KubeVirt version (use virtctl version): N/A
  • Kubernetes version (use kubectl version): N/A
  • VM or VMI specifications: N/A
  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Others: N/A
@lyarwood
Copy link
Member Author

Just for context, guestMappingPassthrough doesn't check for this case either:

./cluster-up/kubectl.sh apply -f -<<EOF
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
  name: dedicated-threads-with-numa
spec:
  domain:
    cpu:
      threads: 2
      dedicatedCpuPlacement: true
      numa:
        guestMappingPassthrough: {}
    memory:
      guest: 1Gi
      hugepages:
        pageSize: "2Mi"
    devices:
      disks:
        - disk:
            bus: virtio
          name: containerdisk
        - disk:
            bus: virtio
          name: cloudinitdisk
    resources:
      requests:
        memory: 2Gi
  volumes:
    - containerDisk:
        image: quay.io/containerdisks/fedora:39
      name: containerdisk
    - cloudInitNoCloud:
        userData: |
          #!/bin/sh
          mkdir -p  /home/fedora/.ssh
          curl https://github.com/lyarwood.keys > /home/fedora/.ssh/authorized_keys
          chown -R fedora: /home/fedora/.ssh
      name: cloudinitdisk
EOF
[..]
$ ./cluster-up/kubectl.sh get vmis
NAME                          AGE   PHASE     IP               NODENAME   READY
dedicated-threads-with-numa   12s   Running   10.244.196.149   node01     True

@vladikr
Copy link
Member

vladikr commented Apr 19, 2024

Thanks @lyarwood.

Indeed, this is open to interpretation. So far, the general intent was to provide dedicated CPUs to the guest however, the topology assignment was on a best-efforts basis. The only exception was with guestMappingPassthrough where we are strict about numa nodes assignment.

We can of course follow up and improve the correct behavior by further enhancing the dedicatedCPUs API.
I would start by reporting whether SMP is enabled on the nodes.

@lyarwood
Copy link
Member Author

We can of course follow up and improve the correct behavior by further enhancing the dedicatedCPUs API.
I would start by reporting whether SMP is enabled on the nodes.

ACK thanks for confirming this is a valid thing to fix. It should be easy enough to label a node given the value in /sys/devices/system/cpu/smt/active and to use that label when scheduling later if threads is greater than 1. I'll try to find some time to work on this in the coming weeks.

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants