diff --git a/docs/design/README.md b/docs/design/README.md index adcffd0196..a5d9dc712f 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -13,6 +13,7 @@ Kata Containers design documents: - [Design for Kata Containers `Lazyload` ability with `nydus`](kata-nydus-design.md) - [Design for direct-assigned volume](direct-blk-device-assignment.md) - [Design for core-scheduling](core-scheduling.md) +- [Virtualization Reference Architecture](kata-vra.md) --- - [Design proposals](proposals) diff --git a/docs/design/kata-vra.md b/docs/design/kata-vra.md new file mode 100644 index 0000000000..ba53c33712 --- /dev/null +++ b/docs/design/kata-vra.md @@ -0,0 +1,434 @@ +# Virtualization Reference Architecture + +## Subject to Change | © 2022 by NVIDIA Corporation. All rights reserved. | For test and development only_ + +Before digging deeper into the virtualization reference architecture, let's +first look at the various GPUDirect use cases in the following table. We’re +distinguishing between two top-tier use cases where the devices are (1) +passthrough and (2) virtualized, where a VM gets assigned a virtual function +(VF) and not the physical function (PF). A combination of PF and VF would also +be possible. + +| Device #1 (passthrough) | Device #2 (passthrough) | P2P Compatibility and Mode | +| ------------------------- | ----------------------- | -------------------------------------------- | +| GPU PF | GPU PF | GPUDirect P2P | +| GPU PF | NIC PF | GPUDirect RDMA | +| MIG-slice | MIG-slice | _No GPUDirect P2P_ | +| MIG-slice | NIC PF | GPUDirect RDMA | +| **PDevice #1 (virtualized)** | **Device #2 (virtualized)** | **P2P Compatibility and Mode** | +| Time-slice vGPU VF | Time-slice vGPU VF | _No GPUDirect P2P but NVLINK P2P available_ | +| Time-slice vGPU VF | NIC VF | GPUDirect RDMA | +| MIG-slice vGPU | MIG-slice vGPU | _No GPUDirect P2P_ | +| MIG-slice vGPU | NIC VF | GPUDirect RDMA | + +In a virtualized environment we have several distinct features that may prevent +Peer-to-peer (P2P) communication of two endpoints in a PCI Express topology. The +IOMMU translates IO virtual addresses (IOVA) to physical addresses (PA). Each +device behind an IOMMU has its own IOVA memory space, usually, no two devices +share the same IOVA memory space but it’s up to the hypervisor or OS how it +chooses to map devices to IOVA spaces. Any PCI Express DMA transactions will +use IOVAs, which the IOMMU must translate. By default, all the traffic is routed +to the root complex and not issued directly to the peer device. + +An IOMMU can be used to isolate and protect devices even if virtualization is +not used; since devices can only access memory regions that are mapped for it, a +DMA from one device to another is not possible. DPDK uses the IOMMU to have +better isolation between devices, another benefit is that IOVA space can be +represented as a contiguous memory even if the PA space is heavily scattered. + +In the case of virtualization, the IOMMU is responsible for isolating the device +and memory between VMs for safe device assignment without compromising the host +and other guest OSes. Without an IOMMU, any device can access the entire system +and perform DMA transactions _anywhere_. + +The second feature is ACS (Access Control Services), which controls which +devices are allowed to communicate with one another and thus avoids improper +routing of packets irrespectively of whether IOMMU is enabled or not. + +When IOMMU is enabled, ACS is normally configured to force all PCI Express DMA +to go through the root complex so IOMMU can translate it, impacting performance +between peers with higher latency and reduced bandwidth. + +A way to avoid the performance hit is to enable Address Translation Services +(ATS). ATS-capable endpoints can prefetch IOVA -> PA translations from the IOMMU +and then perform DMA transactions directly to another endpoint. Hypervisors +enable this by enabling ATS in such endpoints, configuring ACS to enable Direct +Translated P2P, and configuring the IOMMU to allow Address Translation requests. + +Another important factor is that the NVIDIA driver stack will use the PCI +Express topology of the system it is running on to determine whether the +hardware is capable of supporting P2P. The driver stack qualifies specific +chipsets, and PCI Express switches for use with GPUDirect P2P. In virtual +environments, the PCI Express topology is flattened and obfuscated to present a +uniform environment to the software inside the VM, which breaks the GPUDirect +P2P use case. + +On a bare metal machine, the driver stack groups GPUs into cliques that can +perform GPUDirect P2P communication, excluding peer mappings where P2P +communication is not possible, prominently if GPUs are attached to multiple CPU +sockets. + +CPUs and local memory banks are referred to as NUMA nodes. In a two-socket +server, each of the CPUs has a local memory bank for a total of two NUMA nodes. +Some servers provide the ability to configure additional NUMA nodes per CPU, +which means a CPU socket can have two NUMA nodes (some servers support four +NUMA nodes per socket) with local memory banks and L3 NUMA domains for improved +performance. + +One of the current solutions is that the hypervisor provides additional topology +information that the driver stack can pick up and enable GPUDirect P2P between +GPUs, even if the virtualized environment does not directly expose it. The PCI +Express virtual P2P approval capability structure in the PCI configuration space +is entirely emulated by the hypervisor of passthrough GPU devices. + +A clique ID is provided where GPUs with the same clique ID belong to a group of +GPUs capable of P2P communication + +On vSphere, Azure, and other CPSs, the hypervisor lays down a `topologies.xml` +which NCCL can pick up and deduce the right P2P level[^1]. NCCL is leveraging +Infiniband (IB) and/or Unified Communication X (UCX) for communication, and +GPUDirect P2P and GPUDirect RDMA should just work in this case. The only culprit +is that software or applications that do not use the XML file to deduce the +topology will fail and not enable GPUDirect ( [`nccl-p2p-level`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-p2p-level) ) + +## Hypervisor PCI Express Topology + +To enable every part of the accelerator stack, we propose a virtualized +reference architecture to enable GPUDirect P2P and GPUDirect RDMA for any +hypervisor. The idea is split into two parts to enable the right PCI Express +topology. The first part builds upon extending the PCI Express virtual P2P +approval capability structure to every device that wants to do P2P in some way +and groups devices by clique ID. The other part involves replicating a subset of +the host topology so that applications running in the VM do not need to read +additional information and enable the P2P capability like in the bare-metal use +case described above. The driver stack can then deduce automatically if the +topology presented in the VM is capable of P2P communication. + +We will work with the following host topology for the following sections. It is +a system with two converged DPUs, each having an `A100X` GPU and two `ConnectX-6` +network ports connected to the downstream ports of a PCI Express switch. + +```sh ++-00.0-[d8-df]----00.0-[d9-df]--+-00.0-[da-db]--+-00.0 Mellanox Tech MT42822 BlueField-2 integrated ConnectX-6 Dx network + | +-00.1 Mellanox Tech MT42822 BlueField-2 integrated ConnectX-6 Dx network + | \-00.2 Mellanox Tech MT42822 BlueField-2 SoC Management Interface + \-01.0-[dc-df]----00.0-[dd-df]----08.0-[de-df]----00.0 NVIDIA Corporation GA100 [A100X] + ++-00.0-[3b-42]----00.0-[3c-42]--+-00.0-[3d-3e]--+-00.0 Mellanox Tech MT42822 BlueField-2 integrated ConnectX-6 Dx network + | +-00.1 Mellanox Tech MT42822 BlueField-2 integrated ConnectX-6 Dx network + | \-00.2 Mellanox Tech MT42822 BlueField-2 SoC Management Interface + \-01.0-[3f-42]----00.0-[40-42]----08.0-[41-42]----00.0 NVIDIA Corporation GA100 [A100X] +``` + +The green path highlighted above is the optimal and preferred path for +efficient P2P communication. + +## PCI Express Virtual P2P Approval Capability + +Most of the time, the PCI Express topology is flattened and obfuscated to ensure +easy migration of the VM image between different physical hardware topologies. +In Kata, we can configure the hypervisor to use PCI Express root ports to +hotplug the VFIO devices one is passing through. A user can select how many PCI +Express root ports to allocate depending on how many devices are passed through. +A recent addition to Kata will detect the right amount of PCI Express devices +that need hotplugging and bail out if the number of root ports is insufficient. +In Kata, we do not automatically increase the number of root ports, we want the +user to be in full control of the topology. + +```toml +# /etc/kata-containers/configuration.toml + +# VFIO devices are hotplugged on a bridge by default. +# Enable hot-plugging on the root bus. This may be required for devices with +# a large PCI bar, as this is a current limitation with hot-plugging on +# a bridge. +# Default “bridge-port” +hotplug_vfio = "root-port" + +# Before hot plugging a PCIe device, you need to add a pcie_root_port device. +# Use this parameter when using some large PCI bar devices, such as NVIDIA GPU +# The value means the number of pcie_root_port +# This value is valid when hotplug_vfio_on_root_bus is true and machine_type is "q35" +# Default 0 +pcie_root_port = 8 +``` + +VFIO devices are hotplugged on a PCIe-PCI bridge by default. Hotplug of PCI +Express devices is only supported on PCI Express root or downstream ports. With +this configuration set, if we start up a Kata container, we can inspect our +topology and see the allocated PCI Express root ports and the hotplugged +devices. + +```sh +$ lspci -tv + -[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller + +-01.0 Red Hat, Inc. Virtio console + +-02.0 Red Hat, Inc. Virtio SCSI + +-03.0 Red Hat, Inc. Virtio RNG + +-04.0-[01]----00.0 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 + +-05.0-[02]----00.0 Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 + +-06.0-[03]----00.0 NVIDIA Corporation Device 20b8 + +-07.0-[04]----00.0 NVIDIA Corporation Device 20b8 + +-08.0-[05]-- + +-09.0-[06]-- + +-0a.0-[07]-- + +-0b.0-[08]-- + +-0c.0 Red Hat, Inc. Virtio socket + +-0d.0 Red Hat, Inc. Virtio file system + +-1f.0 Intel Corporation 82801IB (ICH9) LPC Interface Controller + +-1f.2 Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller + \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller +``` + +For devices with huge BARs (Base Address Registers) like the GPU (we need to +configure the PCI Express root port properly and allocate enough memory for +mapping), we have added a heuristic to Kata to deduce the right settings. Hence, +the BARs can be mapped correctly. This functionality is added to +[`nvidia/go-nvlib1](https://gitlab.com/nvidia/cloud-native/go-nvlib) which is part +of Kata now. + +```sh +$ sudo dmesg | grep BAR +[ 0.179960] pci 0000:00:04.0: BAR 7: assigned [io 0x1000-0x1fff] +[ 0.179962] pci 0000:00:05.0: BAR 7: assigned [io 0x2000-0x2fff] +[ 0.179963] pci 0000:00:06.0: BAR 7: assigned [io 0x3000-0x3fff] +[ 0.179964] pci 0000:00:07.0: BAR 7: assigned [io 0x4000-0x4fff] +[ 0.179966] pci 0000:00:08.0: BAR 7: assigned [io 0x5000-0x5fff] +[ 0.179967] pci 0000:00:09.0: BAR 7: assigned [io 0x6000-0x6fff] +[ 0.179968] pci 0000:00:0a.0: BAR 7: assigned [io 0x7000-0x7fff] +[ 0.179969] pci 0000:00:0b.0: BAR 7: assigned [io 0x8000-0x8fff] +[ 2.115912] pci 0000:01:00.0: BAR 0: assigned [mem 0x13000000000-0x13001ffffff 64bit pref] +[ 2.116203] pci 0000:01:00.0: BAR 2: assigned [mem 0x13002000000-0x130027fffff 64bit pref] +[ 2.683132] pci 0000:02:00.0: BAR 0: assigned [mem 0x12000000000-0x12001ffffff 64bit pref] +[ 2.683419] pci 0000:02:00.0: BAR 2: assigned [mem 0x12002000000-0x120027fffff 64bit pref] +[ 2.959155] pci 0000:03:00.0: BAR 1: assigned [mem 0x11000000000-0x117ffffffff 64bit pref] +[ 2.959345] pci 0000:03:00.0: BAR 3: assigned [mem 0x11800000000-0x11801ffffff 64bit pref] +[ 2.959523] pci 0000:03:00.0: BAR 0: assigned [mem 0xf9000000-0xf9ffffff] +[ 2.966119] pci 0000:04:00.0: BAR 1: assigned [mem 0x10000000000-0x107ffffffff 64bit pref] +[ 2.966295] pci 0000:04:00.0: BAR 3: assigned [mem 0x10800000000-0x10801ffffff 64bit pref] +[ 2.966472] pci 0000:04:00.0: BAR 0: assigned [mem 0xf7000000-0xf7ffffff] +``` + +The NVIDIA driver stack in this case would refuse to do P2P communication since +(1) the topology is not what it expects, (2) we do not have a qualified +chipset. Since our P2P devices are not connected to a PCI Express switch port, +we need to provide additional information to support the P2P functionality. One +way of providing such meta information would be to annotate the container; most +of the settings in Kata's configuration file can be overridden via annotations, +but this limits the flexibility, and a user would need to update all the +containers that he wants to run with Kata. The goal is to make such things as +transparent as possible, so we also introduced +[CDI](https://github.com/container-orchestrated-devices/container-device-interface) +(Container Device Interface) to Kata. CDI is a[ +specification](https://github.com/container-orchestrated-devices/container-device-interface/blob/master/SPEC.md) +for container runtimes to support third-party devices. + +As written before, we can provide a clique ID for the devices that belong +together and are capable of doing P2P. This information is provided to the +hypervisor, which will set up things in the VM accordingly. Let's suppose the +user wanted to do GPUDirect RDMA with the first GPU and the NIC that reside on +the same DPU, one could provide the specification telling the hypervisor that +they belong to the same clique. + +```yaml +# /etc/cdi/nvidia.yaml +cdiVersion: 0.4.0 +kind: nvidia.com/gpu +devices: +- name: gpu0 + annotations: + bdf: “41:00.0” + clique-id: “0” + containerEdits: + deviceNodes: + - path: “/dev/vfio/71" + +# /etc/cdi/mellanox.yaml +cdiVersion: 0.4.0 +kind: mellanox.com/nic +devices: +- name: nic0 + annotations: + bdf: “3d:00.0” + clique-id: “0” + attach-pci: “true” + containerEdits: + deviceNodes: + - path: "/dev/vfio/66" +``` + +Since this setting is bound to the device and not the container we do not need +to alter the container just allocate the right resource and GPUDirect RDMA would +be set up correctly. Rather than exposing them separately, an idea would be to +expose a GPUDirect RDMA device via NFD (Node Feature Discovery) that combines +both of them; this way, we could make sure that the right pair is allocated and +used more on Kubernetes deployment in the next section. + +The GPU driver stack is leveraging the PCI Express virtual P2P approval +capability, but the NIC stack does not use this now. One of the action items is +to enable MOFED to read the P2P approval capability and enable ATS and ACS +settings as described above. + +This way, we could enable GPUDirect P2P and GPUDirect RDMA on any topology +presented to the VM application. It is the responsibility of the administrator +or infrastructure engineer to provide the right information either via +annotations or a CDI specification. + +## Host Topology Replication + +The other way to represent the PCI Express topology in the VM is to replicate a +subset of the topology needed to support the P2P use case inside the VM. Similar +to the configuration for the root ports, we can easily configure the usage of +PCI Express switch ports to hotplug the devices. + +```toml +# /etc/kata-containers/configuration.toml + +# VFIO devices are hotplugged on a bridge by default. +# Enable hot plugging on the root bus. This may be required for devices with +# a large PCI bar, as this is a current limitation with hot plugging on +# a bridge. +# Default “bridge-port” +hotplug_vfio = "switch-port" + +# Before hot plugging a PCIe device, you need to add a pcie_root_port device. +# Use this parameter when using some large PCI bar devices, such as Nvidia GPU +# The value means the number of pcie_root_port +# This value is valid when hotplug_vfio_on_root_bus is true and machine_type is "q35" +# Default 0 +pcie_switch_port = 8 +``` + +Each device that is passed through is attached to a PCI Express downstream port +as illustrated below. We can even replicate the host’s two DPUs topologies with +added metadata through the CDI. Most of the time, a container only needs one +pair of GPU and NIC for GPUDirect RDMA. This is more of a showcase of what we +can do with the power of Kata and CDI. One could even think of adding groups of +devices that support P2P, even from different CPU sockets or NUMA nodes, into +one container; indeed, the first group is NUMA node 0 (red), and the second +group is NUMA node 1 (green). Since they are grouped correctly, P2P would be +enabled naturally inside a group, aka clique ID. + +```sh +$ lspci -tv + -[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller + +-01.0 Red Hat, Inc. Virtio console + +-02.0 Red Hat, Inc. Virtio SCSI + +-03.0 Red Hat, Inc. Virtio RNG + +-04.0-[01-04]----00.0-[02-04]--+-00.0-[03]----00.0 NVIDIA Corporation Device 20b8 + | \-01.0-[04]----00.0 Mellanox Tech MT42822 BlueField-2 integrated ConnectX-6 Dx + +-05.0-[05-08]----00.0-[06-08]--+-00.0-[07]----00.0 Mellanox Tech MT42822 BlueField-2 integrated ConnectX-6 Dx + | \-01.0-[08]----00.0 NVIDIA Corporation Device 20b8 + +-06.0 Red Hat, Inc. Virtio socket + +-07.0 Red Hat, Inc. Virtio file system + +-1f.0 Intel Corporation 82801IB (ICH9) LPC Interface Controller + +-1f.2 Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] + \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller + \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller +``` + +The configuration of using either the root port or switch port can be applied on +a per Container or Pod basis, meaning we can switch PCI Express topologies on +each run of an application. + +## Hypervisor Resource Limits + +Every hypervisor will have resource limits in terms of how many PCI Express root +ports, switch ports, or bridge ports can be created, especially with devices +that need to reserve a 4K IO range per PCI specification. Each instance of root +or switch port will consume 4K IO of very limited capacity, 64k is the maximum. + +Simple math brings us to the conclusion that we can have a maximum of 16 PCI +Express root ports or 16 PCI Express switch ports in QEMU if devices with IO +BARs are used in the PCI Express hierarchy. + +Additionally, one can have 32 slots on the PCI root bus and a maximum of 256 +slots for the complete PCI(e) topology. + +Per default, QEMU will attach a multi-function device in the last slot on the +PCI root bus, + +```sh + +-1f.0 Intel Corporation 82801IB (ICH9) LPC Interface Controller + +-1f.2 Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] + \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller +``` + +Kata will additionally add `virtio-xxx-pci` devices consuming (5 slots) plus a +PCIe-PCI-bridge (1 slot) and a DRAM controller (1 slot), meaning per default, we +have already eight slots used. This leaves us 24 slots for adding other devices +to the root bus. + +The problem that arises here is one use-case from a customer that uses recent +RTX GPUs with Kata. The user wanted to pass through eight of these GPUs into one +container and ran into issues. The problem is that those cards often consist of +four individual device nodes: GPU, Audio, and two USB controller devices (some +cards have a USB-C output). + +These devices are grouped into one IOMMU group. Since one needs to pass through +the complete IOMMU group into the VM, we need to allocate 32 PCI Express root +ports or 32 PCI Express switch ports, which is technically impossible due to the +resource limits outlined above. Since all the devices appear as PCI Express +devices, we need to hotplug those into a root or switch port. + +The solution to this problem is leveraging CDI. For each device, add the +information if it is going to be hotplugged as a PCI Express or PCI device, +which results in either using a PCI Express root/switch port or an ordinary PCI +bridge. PCI bridges are not affected by the limited IO range. This way, the GPU +is attached as a PCI Express device to a root/switch port and the other three +PCI devices to a PCI bridge, leaving enough resources to create the needed PCI +Express root/switch ports. For example, we’re going to attach the GPUs to a PCI +Express root port and the NICs to a PCI bridge. + +```jsonld +# /etc/cdi/mellanox.json +cdiVersion: 0.4.0 +kind: mellanox.com/nic +devices: +- name: nic0 + annotations: + bdf: “3d:00.0” + clique-id: “0” + attach-pci: “true” + containerEdits: + deviceNodes: + - path: "/dev/vfio/66" +- name: nic1 + annotations: + bdf: “3d:00.1” + clique-id: “1” + attach-pci: “true” + containerEdits: + deviceNodes: + - path: "/dev/vfio/67” +``` + +The configuration is set to use eight root ports for the GPUs and attach the +NICs to a PCI bridge which is connected to a PCI Express-PCI bridge which is the +preferred way of introducing a PCI topology in a PCI Express machine. + +```sh +$ lspci -tv +-[0000:00]-+-00.0 Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller + +-01.0 Red Hat, Inc. Virtio console + +-02.0 Red Hat, Inc. Virtio SCSI + +-03.0 Red Hat, Inc. Virtio RNG + +-04.0-[01]----00.0 NVIDIA Corporation Device 20b8 + +-05.0-[02]----00.0 NVIDIA Corporation Device 20b8 + +-06.0-[03]-- + +-07.0-[04]-- + +-08.0-[05]-- + +-09.0-[06]-- + +-0a.0-[07]-- + +-0b.0-[08]-- + +-0c.0-[09-0a]----00.0-[0a]--+-00.0 Mellanox Tech MT42822 BlueField-2 ConnectX-6 + | \-01.0 Mellanox Tech MT42822 BlueField-2 ConnectX-6 + +-0d.0 Red Hat, Inc. Virtio socket + +-0e.0 Red Hat, Inc. Virtio file system + +-1f.0 Intel Corporation 82801IB (ICH9) LPC Interface Controller + +-1f.2 Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller + \-1f.3 Intel Corporation 82801I (ICH9 Family) SMBus Controller +``` + +The PCI devices will consume a slot of which we have 256 in the PCI(e) topology +and leave scarce resources for the needed PCI Express devices.