diff --git a/docs/.nav.yml b/docs/.nav.yml index 15dab995cc..7dc1b12238 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -15,7 +15,7 @@ nav: - Use Cases: - NVIDIA GPU Passthrough: use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md - NVIDIA vGPU: use-cases/NVIDIA-GPU-passthrough-and-Kata.md - - Intel Discrete GPU: use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md + - Intel QAT: use-cases/using-Intel-QAT-and-kata.md - Contributing: - Documentation: doc-contributing.md - Misc: diff --git a/docs/how-to/how-to-load-kernel-modules-with-kata.md b/docs/how-to/how-to-load-kernel-modules-with-kata.md index 24a3546012..978078916d 100644 --- a/docs/how-to/how-to-load-kernel-modules-with-kata.md +++ b/docs/how-to/how-to-load-kernel-modules-with-kata.md @@ -1,73 +1,90 @@ -# Loading kernel modules +# Loading kernel modules in Kata Containers -A new feature for loading kernel modules was introduced in Kata Containers 1.9. -The list of kernel modules and their parameters can be provided using the -configuration file or OCI annotations. The [Kata runtime][1] gives that -information to the [Kata Agent][2] through gRPC when the sandbox is created. -The [Kata Agent][2] will insert the kernel modules using `modprobe(8)`, hence -modules dependencies are resolved automatically. +This document describes how to load kernel modules inside Kata Containers guest VM. -The sandbox will not be started when: +## Overview - * A kernel module is specified and the `modprobe(8)` command is not installed in - the guest or it fails loading the module. - * The module is not available in the guest or it doesn't meet the guest kernel - requirements, like architecture and version. +The kernel modules feature allows you to load specific kernel modules into the guest VM kernel when a sandbox is created. This is useful when your containerized applications require specific kernel functionality that is not built into the guest kernel. -In the following sections are documented the different ways that exist for -loading kernel modules in Kata Containers. +**How it works:** + +1. You specify kernel modules and their parameters via configuration file or OCI annotations +2. The Kata runtime passes this information to the Kata Agent through agent RPC during sandbox creation (gRPC in runtime-go, ttrpc in runtime-rs) +3. The Kata Agent loads the modules using `modprobe(8)`, which automatically resolves module dependencies + +**Failure conditions:** + +The sandbox will fail to start if: + +- A kernel module is specified but `modprobe(8)` is not installed in the guest, or it fails to load the module +- The module is not available in the guest or doesn't meet guest kernel requirements (architecture, version, etc.) + +## Configuration Methods - [Using Kata Configuration file](#using-kata-configuration-file) - [Using annotations](#using-annotations) -# Using Kata Configuration file +## Using Kata Configuration file -``` -NOTE: Use this method, only if you need to pass the kernel modules to all -containers. Please use annotations described below to set per pod annotations. -``` +> **Note**: Use this method when you need the kernel modules loaded for all containers. For per-pod configuration, use annotations instead. -The list of kernel modules and parameters can be set in the `kernel_modules` -option as a coma separated list, where each entry in the list specifies a kernel -module and its parameters. Each list element comprises one or more space separated -fields. The first field specifies the module name and subsequent fields specify -individual parameters for the module. +The `kernel_modules` option accepts a list of kernel modules with their parameters. Each list element specifies a module name followed by space-separated parameters. -The following example specifies two modules to load: `e1000e` and `i915`. Two parameters -are specified for the `e1000` module: `InterruptThrottleRate` (which takes an array -of integer values) and `EEE` (which requires a single integer value). +### Configuration Format + +**For runtime-go** (`configuration-qemu.toml`, etc.): ```toml -kernel_modules=["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] +[agent.kata] +kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] ``` -Not all the container managers allow users provide custom annotations, hence -this is the only way that Kata Containers provide for loading modules when -custom annotations are not supported. +**For runtime-rs** (`configuration-qemu-runtime-rs.toml`, etc.): -There are some limitations with this approach: +```toml +[agent.kata] +kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] +``` -* Write access to the Kata configuration file is required. -* The configuration file must be updated when a new container is created, - otherwise the same list of modules is used, even if they are not needed in the - container. +### Example -# Using annotations +The following example loads two modules: -As was mentioned above, not all containers need the same modules, therefore using -the configuration file for specifying the list of kernel modules per [POD][3] can -be a pain. -Unlike the configuration file, [annotations](how-to-set-sandbox-config-kata.md) -provide a way to specify custom configurations per POD. +- `e1000e` with parameters `InterruptThrottleRate=3000,3000,3000` and `EEE=1` +- `i915` with no parameters -The list of kernel modules and parameters can be set using the annotation -`io.katacontainers.config.agent.kernel_modules` as a semicolon separated -list, where the first word of each element is considered as the module name and -the rest as its parameters. +```toml +kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] +``` -In the following example two PODs are created, but the kernel modules `e1000e` -and `i915` are inserted only in the POD `pod1`. +### Limitations +- Write access to the Kata configuration file is required +- All containers will use the same module list, even if some containers don't need them +- Configuration changes require service restart to take effect + +## Using annotations + +Annotations provide a way to specify kernel modules per pod, which is more flexible than the configuration file approach. + +### Annotation Key + +``` +io.katacontainers.config.agent.kernel_modules +``` + +### Format + +The annotation value uses **semicolon (`;`)** as the separator between modules. Each module specification consists of: + +- Module name (first word) +- Parameters (subsequent words, space-separated) + +Example: `"e1000e EEE=1; i915 enable_ppgtt=0"` + +### Kubernetes Example + +The following example creates two pods, where only `pod1` will have the kernel modules `e1000e` and `i915` loaded: ```yaml apiVersion: v1 @@ -104,6 +121,53 @@ spec: > **Note**: To pass annotations to Kata containers, [CRI-O must be configured correctly](how-to-set-sandbox-config-kata.md#cri-o-configuration) -[1]: ../../src/runtime -[2]: ../../src/agent -[3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/ +## Technical Details + +### Data Flow + +``` + Configuration File / Annotation + │ + ▼ + SandboxConfig.AgentConfig.KernelModules + │ + ▼ + Converted to gRPC KernelModule messages + │ + ▼ + CreateSandboxRequest sent to Agent + │ + ▼ + Agent executes modprobe in guest VM +``` + +### Implementation in Runtimes + +**runtime-go:** + +- Config parsing: `src/runtime/pkg/katautils/config.go` +- Annotation handling: `src/runtime/pkg/oci/utils.go` (`addAgentConfigOverrides()`) +- Module parsing: `src/runtime/virtcontainers/kata_agent.go` (`setupKernelModules()`) + +**runtime-rs:** + +- Config structure: `src/libs/kata-types/src/config/agent.rs` +- Annotation handling: `src/libs/kata-types/src/annotations/mod.rs` (`update_config_by_annotation()`) +- Module parsing: `src/runtime-rs/crates/agent/src/types.rs` (`KernelModule::set_kernel_modules()`) + +## Debugging + +To verify kernel modules are loaded in the guest VM: + +```bash +# Inside the container, run: +lsmod | grep + +# Or check modprobe output in guest VM journal +``` + +If module loading fails, check: + +1. Module is available in guest kernel modules directory (`/lib/modules/$(uname -r)`) +2. Module dependencies are satisfied +3. Guest kernel version matches module requirements diff --git a/docs/use-cases/GPU-passthrough-and-Kata.md b/docs/use-cases/GPU-passthrough-and-Kata.md index 40b5297eea..8fdbfad6f4 100644 --- a/docs/use-cases/GPU-passthrough-and-Kata.md +++ b/docs/use-cases/GPU-passthrough-and-Kata.md @@ -2,5 +2,5 @@ Kata Containers supports passing certain GPUs from the host into the container. Select the GPU vendor for detailed information: -- [Intel Discrete GPUs](Intel-Discrete-GPU-passthrough-and-Kata.md)/[Intel Integrated GPUs](Intel-GPU-passthrough-and-Kata.md) - [NVIDIA GPUs](NVIDIA-GPU-passthrough-and-Kata.md) and [Enabling NVIDIA GPU workloads using GPU passthrough with Kata Containers](NVIDIA-GPU-passthrough-and-Kata-QEMU.md) +- PLACE HOLDER: for other GPU vendors (e.g., AMD, Intel) diff --git a/docs/use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md b/docs/use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md deleted file mode 100644 index 73ccf613c5..0000000000 --- a/docs/use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md +++ /dev/null @@ -1,274 +0,0 @@ -# Using Intel Discrete GPU device with Kata Containers - -This guide covers the use case for passing Intel Discrete GPUs to Kata. -These include the Intel® Data Center GPU Max Series and Intel® Data Center GPU Flex Series. -For integrated GPUs please refer to [Integrate-Intel-GPUs-with-Kata](Intel-GPU-passthrough-and-Kata.md) - -> **Note:** These instructions are for a system that has an x86_64 CPU. - -An Intel Discrete GPU can be passed to a Kata Container using GPU passthrough, -or SR-IOV passthrough. - -In Intel GPU pass-through mode, an entire physical GPU is directly assigned to one VM. -In this mode of operation, the GPU is accessed exclusively by the Intel driver running in -the VM to which it is assigned. The GPU is not shared among VMs. - -With SR-IOV mode, it is possible to pass a Virtual GPU instance to a virtual machine. -With this, multiple Virtual GPU instances can be carved out of a single physical GPU -and be passed to different VMs, allowing the GPU to be shared. - -| Technology | Description | -|-|-| -| GPU passthrough | Physical GPU assigned to a single VM | -| SR-IOV passthrough | Physical GPU shared by multiple VMs | - -## Hardware Requirements - -Intel GPUs Recommended for Virtualization: - -- Intel® Data Center GPU Max Series (`Ponte Vecchio`) -- Intel® Data Center GPU Flex Series (`Arctic Sound-M`) -- Intel® Data Center GPU Arc Series - -The following steps outline the workflow for using an Intel Graphics device with Kata Containers. - -## Host BIOS requirements - -Hardware such as Intel Max and Flex series require larger PCI BARs. - -For large BAR devices, MMIO mapping above the 4GB address space should be enabled in the PCI configuration of the BIOS. - -Some hardware vendors use a different name in the BIOS, such as: - -- Above 4GB Decoding -- Memory Hole for PCI MMIO -- Memory Mapped I/O above 4GB - -## Host Kernel Requirements - -For device passthrough to work with the Max and Flex Series, an out of tree kernel driver is required. - -For Ubuntu 22.04 server, follow these instructions to install the out of tree GPU driver: -```bash -$ sudo apt update -$ sudo apt install -y gpg-agent wget -$ wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \ - sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg -$ source /etc/os-release -$ echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" | \ - sudo tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list -$ sudo apt update -$ sudo apt install -y linux-headers-"$(uname -r)" flex bison intel-fw-gpu intel-i915-dkms xpu-smi -$ sudo reboot -``` -For support on other distributions, please refer to [DGPU-docs](https://dgpu-docs.intel.com/driver/installation.html) - -You can also install the driver from source which is maintained at [intel-gpu-i915-backports](https://github.com/intel-gpu/intel-gpu-i915-backports) -Detailed instructions for reference can be found at: https://github.com/intel-gpu/intel-gpu-i915-backports/blob/backport/main/docs/README_ubuntu.md. - -Below are the steps for installing the driver from source on an Ubuntu 22.04 LTS system: -```bash -$ export I915_BRANCH="backport/main" -$ git clone -b ${I915_BRANCH} --depth 1 https://github.com/intel-gpu/intel-gpu-i915-backports.git -$ cd intel-gpu-i915-backports/ -$ sudo apt install -y dkms make debhelper devscripts build-essential flex bison mawk -$ sudo apt install -y linux-headers-"$(uname -r)" linux-image-unsigned-"$(uname -r)" -$ make i915dkmsdeb-pkg -``` -The above make command will create Debian package in parent folder: `intel-i915-dkms_..deb` -Install the package as: -```bash -$ sudo dpkg -i intel-i915-dkms_..deb -$ sudo reboot -``` - -Additionally, verify that the following kernel configs are enabled for your host kernel: -``` -CONFIG_VFIO -CONFIG_VFIO_IOMMU_TYPE1 -CONFIG_VFIO_PCI -``` - -## Host kernel command line - -Your host kernel needs to be booted with `intel_iommu=on` and `i915.enable_iaf=0` on the kernel command -line. - -1. Run the following to change the kernel command line using grub: -```bash -$ sudo vim /etc/default/grub -``` - -2. At the end of the GRUB_CMDLINE_LINUX_DEFAULT append the below line: - -`intel_iommu=on iommu=pt i915.max_vfs=63 i915.enable_iaf=0` - -3. Update grub as per OS distribution: - -For Ubuntu: -```bash -$ sudo update-grub -``` - -For CentOS/RHEL: -```bash -$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg -``` - -4. Reboot the system -```bash -$ sudo reboot -``` - -## Install and configure Kata Containers - -To use this feature, you need Kata version 1.3.0 or above. -Follow the [Kata Containers setup instructions](../install/README.md) -to install the latest version of Kata. - -To use large BARs devices (for example, NVIDIA Tesla P100), you need Kata version 1.11.0 or above. - -In order to pass a GPU to a Kata Container, you need to enable the `hotplug_vfio_on_root_bus` -configuration in the Kata `configuration.toml` file as shown below. - -```bash -$ sudo sed -i -e 's/^# *\(hotplug_vfio_on_root_bus\).*=.*$/\1 = true/g' /usr/share/defaults/kata-containers/configuration.toml -``` - -Make sure you are using the `q35` machine type by verifying `machine_type = "q35"` is -set in the `configuration.toml`. Make sure `pcie_root_port` is set to a positive value. - -After making the above changes, configuration in the `configuration.toml` should look like this: -``` -machine_type = "q35" - -hotplug_vfio_on_root_bus = true -pcie_root_port = 1 -``` - -## GPU passthrough with Kata Containers - -Use the following steps to pass an Intel discrete GPU with Kata: - -1. Find the Bus-Device-Function (BDF) for GPU device: - - ``` - $ sudo lspci -nn -D | grep Display - ``` - - Run the previous command to determine the BDF for the GPU device on host.
- From the previous output, PCI address `0000:29:00.0` is assigned to the hardware GPU device.
- We choose this BDF to use it later to unbind the GPU device from the host for the purpose of demonstration.
- -2. Find the IOMMU group for the GPU device: - - ```bash - $ BDF="0000:29:00.0" - $ readlink -e /sys/bus/pci/devices/$BDF/iommu_group - /sys/kernel/iommu_groups/27 - ``` - - The previous output shows that the GPU belongs to IOMMU group 27. - -3. Bind the GPU to the `vfio-pci` device driver: - - ```bash - $ BDF="0000:29:00.0" - $ DEV="/sys/bus/pci/devices/$BDF" - $ echo "vfio-pci" | sudo tee "$DEV"/driver_override - $ echo $BDF | sudo tee "$DEV"/driver/unbind - $ echo "$BDF" | sudo tee "/sys/bus/pci/drivers_probe" - ``` - - After you run the previous commands, the GPU is bound to `vfio-pci` driver.
- A new directory with the IOMMU group number is created under `/dev/vfio`: - - ```bash - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 15:38 27 - crw-rw-rw- 1 root root 10, 196 May 18 15:37 vfio - ``` - - Later, to return the device to the standard driver, we simply clear the - `driver_override` and re-probe the device, ex: - - ```bash - $ echo | sudo tee "$DEV/preferred_driver" - $ echo $BDF | sudo tee $DEV/driver/unbind - $ echo $BDF | sudo tee /sys/bus/pci/drivers_probe - ``` - -5. Start a Kata container with GPU device: - - ```bash - $ sudo ctr --debug run --runtime "io.containerd.kata.v2" --device "/dev/vfio/27" --rm -t "docker.io/library/archlinux:latest" arch uname -r - - ``` - - Run `lspci` within the container to verify the GPU device is seen in the list of - the PCI devices. Note the vendor-device id of the GPU ("8086:0bd5") in the `lspci` output. - -## SR-IOV mode for Intel Discrete GPUs - -Use the following steps to pass an Intel Graphics device in SR-IOV mode to a Kata Container: - -1. Find the BDF for GPU device: - - ```sh - $ sudo lspci -nn -D | grep Display - 0000:29:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - 0000:3a:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - 0000:9a:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - 0000:ca:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - ``` - - Run the previous command to find out the BDF for the GPU device on host. - We choose the GPU with PCI address "0000:3a:00.0" to assign a GPU SR-IOV interface. - -2. Carve out SR-IOV slice for the GPU: - - List our total possible SR-IOV virtual interfaces for the GPU: - - ```bash - $ BDF="0000:3a:00.0" - $ cat "/sys/bus/pci/devices/$BDF/sriov_totalvfs" - 63 - ``` - - Create SR-IOV interfaces for the GPU: - ```sh - $ echo 4 | sudo tee /sys/bus/pci/devices/$BDF/sriov_numvfs - 4 - $ sudo lspci | grep Display - 29:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.1 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.2 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.3 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.4 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 9a:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - ca:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - ``` - The above output shows the SR-IOV interfaces created for the GPU. - -3. Find the IOMMU group for the GPU SR-IOV interface(VGPU): - - ```bash - $ BDF="0000:3a:00:1" - $ readlink -e "/sys/bus/pci/devices/$BDF/iommu_group" - /sys/kernel/iommu_groups/437 - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 11:30 437 - crw-rw-rw- 1 root root 10, 196 May 18 11:29 vfio - ``` - - Now you can use the device node `/dev/vfio/437` in docker command line to pass - the VGPU to a Kata Container. - -4. Start a Kata Containers container with GPU device enabled: - - ```bash - $ sudo ctr --debug run --runtime "io.containerd.kata.v2" --device /dev/vfio/437 --rm -t "docker.io/library/archlinux:latest" arch uname -r - ``` diff --git a/docs/use-cases/Intel-GPU-passthrough-and-Kata.md b/docs/use-cases/Intel-GPU-passthrough-and-Kata.md deleted file mode 100644 index ce18ead804..0000000000 --- a/docs/use-cases/Intel-GPU-passthrough-and-Kata.md +++ /dev/null @@ -1,287 +0,0 @@ -# Using Intel GPU device with Kata Containers - -An Intel Graphics device can be passed to a Kata Containers container using GPU -passthrough (Intel GVT-d) as well as GPU mediated passthrough (Intel GVT-g). - -Intel GVT-d (one VM to one physical GPU) also named as Intel-Graphics-Device -passthrough feature is one flavor of graphics virtualization approach. -This flavor allows direct assignment of an entire GPU to a single user, -passing the native driver capabilities through the hypervisor without any limitations. - -Intel GVT-g (multiple VMs to one physical GPU) is a full GPU virtualization solution -with mediated pass-through.
-A virtual GPU instance is maintained for each VM, with part of performance critical -resources, directly assigned. The ability to run a native graphics driver inside a -VM without hypervisor intervention in performance critical paths, achieves a good -balance among performance, feature, and sharing capability. - -| Technology | Description | Behaviour | Detail | -|-|-|-|-| -| Intel GVT-d | GPU passthrough | Physical GPU assigned to a single VM | Direct GPU assignment to VM without limitation | -| Intel GVT-g | GPU sharing | Physical GPU shared by multiple VMs | Mediated passthrough | - -## Hardware Requirements - - - For client platforms, 5th generation Intel® Core Processor Graphics or higher are required. - - For server platforms, E3_v4 or higher Xeon Processor Graphics are required. - -The following steps outline the workflow for using an Intel Graphics device with Kata. - -## Host Kernel Requirements - -The following configurations need to be enabled on your host kernel: - -``` -CONFIG_VFIO_IOMMU_TYPE1=m -CONFIG_VFIO=m -CONFIG_VFIO_PCI=m -CONFIG_VFIO_MDEV=m -CONFIG_VFIO_MDEV_DEVICE=m -CONFIG_DRM_I915_GVT=m -CONFIG_DRM_I915_GVT_KVMGT=m -``` - -Your host kernel needs to be booted with `intel_iommu=on` on the kernel command -line. - -## Install and configure Kata Containers - -To use this feature, you need Kata version 1.3.0 or above. -Follow the [Kata Containers setup instructions](../install/README.md) -to install the latest version of Kata. - -In order to pass a GPU to a Kata Container, you need to enable the `hotplug_vfio_on_root_bus` -configuration in the Kata `configuration.toml` file as shown below. - -``` -$ sudo sed -i -e 's/^# *\(hotplug_vfio_on_root_bus\).*=.*$/\1 = true/g' /usr/share/defaults/kata-containers/configuration.toml -``` - -Make sure you are using the `q35` machine type by verifying `machine_type = "q35"` is -set in the `configuration.toml`. Make sure `pcie_root_port` is set to a positive value. - -## Build Kata Containers kernel with GPU support - -The default guest kernel installed with Kata Containers does not provide GPU support. -To use an Intel GPU with Kata Containers, you need to build a kernel with the necessary -GPU support. - -The following i915 kernel config options need to be enabled: -``` -CONFIG_DRM=y -CONFIG_DRM_I915=y -CONFIG_DRM_I915_USERPTR=y -``` - -Build the Kata Containers kernel with the previous config options, using the instructions -described in [Building Kata Containers kernel](../../tools/packaging/kernel). -For further details on building and installing guest kernels, see [the developer guide](../Developer-Guide.md#install-guest-kernel-images). - -There is an easy way to build a guest kernel that supports Intel GPU: -``` -## Build guest kernel with ../../tools/packaging/kernel - -# Prepare (download guest kernel source, generate .config) -$ ./build-kernel.sh -g intel -f setup - -# Build guest kernel -$ ./build-kernel.sh -g intel build - -# Install guest kernel -$ sudo -E ./build-kernel.sh -g intel install -/usr/share/kata-containers/vmlinux-intel-gpu.container -> vmlinux-5.4.15-70-intel-gpu -/usr/share/kata-containers/vmlinuz-intel-gpu.container -> vmlinuz-5.4.15-70-intel-gpu -``` - -Before using the new guest kernel, please update the `kernel` parameters in `configuration.toml`. -``` -kernel = "/usr/share/kata-containers/vmlinuz-intel-gpu.container" -``` - -## GVT-d with Kata Containers - -Use the following steps to pass an Intel Graphics device in GVT-d mode with Kata: - -1. Find the Bus-Device-Function (BDF) for GPU device: - - ``` - $ sudo lspci -nn -D | grep Graphics - 0000:00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) - ``` - - Run the previous command to determine the BDF for the GPU device on host.
- From the previous output, PCI address `0000:00:02.0` is assigned to the hardware GPU device.
- This BDF is used later to unbind the GPU device from the host.
- "8086 1616" is the device ID of the hardware GPU device. It is used later to - rebind the GPU device to `vfio-pci` driver. - -2. Find the IOMMU group for the GPU device: - - ``` - $ BDF="0000:00:02.0" - $ readlink -e /sys/bus/pci/devices/$BDF/iommu_group - /sys/kernel/iommu_groups/1 - ``` - - The previous output shows that the GPU belongs to IOMMU group 1. - -3. Unbind the GPU: - - ``` - $ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind - ``` - -4. Bind the GPU to the `vfio-pci` device driver: - - ``` - $ sudo modprobe vfio-pci - $ echo 8086 1616 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id - $ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind - ``` - - After you run the previous commands, the GPU is bound to `vfio-pci` driver.
- A new directory with the IOMMU group number is created under `/dev/vfio`: - - ``` - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 15:38 1 - crw-rw-rw- 1 root root 10, 196 May 18 15:37 vfio - ``` - -5. Start a Kata container with GPU device: - - ``` - $ sudo docker run -it --runtime=kata-runtime --rm --device /dev/vfio/1 -v /dev:/dev debian /bin/bash - ``` - - Run `lspci` within the container to verify the GPU device is seen in the list of - the PCI devices. Note the vendor-device id of the GPU ("8086:1616") in the `lspci` output. - - ``` - $ lspci -nn -D - 0000:00:00.0 Class [0600]: Device [8086:1237] (rev 02) - 0000:00:01.0 Class [0601]: Device [8086:7000] - 0000:00:01.1 Class [0101]: Device [8086:7010] - 0000:00:01.3 Class [0680]: Device [8086:7113] (rev 03) - 0000:00:02.0 Class [0604]: Device [1b36:0001] - 0000:00:03.0 Class [0780]: Device [1af4:1003] - 0000:00:04.0 Class [0100]: Device [1af4:1004] - 0000:00:05.0 Class [0002]: Device [1af4:1009] - 0000:00:06.0 Class [0200]: Device [1af4:1000] - 0000:00:0f.0 Class [0300]: Device [8086:1616] (rev 09) - ``` - - Additionally, you can access the device node for the graphics device: - - ``` - $ ls /dev/dri - card0 renderD128 - ``` - -## GVT-g with Kata Containers - -For GVT-g, you append `i915.enable_gvt=1` in addition to `intel_iommu=on` -on your host kernel command line and then reboot your host. - -Use the following steps to pass an Intel Graphics device in GVT-g mode to a Kata Container: - -1. Find the BDF for GPU device: - - ``` - $ sudo lspci -nn -D | grep Graphics - 0000:00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) - ``` - - Run the previous command to find out the BDF for the GPU device on host. - The previous output shows PCI address "0000:00:02.0" is assigned to the GPU device. - -2. Choose the MDEV (Mediated Device) type for VGPU (Virtual GPU): - - For background on `mdev` types, please follow this [kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/driver-api/vfio-mediated-device.rst). - - * List out the `mdev` types for the VGPU: - - ``` - $ BDF="0000:00:02.0" - - $ ls /sys/devices/pci0000:00/$BDF/mdev_supported_types - i915-GVTg_V4_1 i915-GVTg_V4_2 i915-GVTg_V4_4 i915-GVTg_V4_8 - ``` - - * Inspect the `mdev` types and choose one that fits your requirement: - - ``` - $ cd /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8 && ls - available_instances create description device_api devices - - $ cat description - low_gm_size: 64MB - high_gm_size: 384MB - fence: 4 - resolution: 1024x768 - weight: 2 - - $ cat available_instances - 7 - ``` - - The output of file `description` represents the GPU resources that are - assigned to the VGPU with specified MDEV type.The output of file `available_instances` - represents the remaining amount of VGPUs you can create with specified MDEV type. - -3. Create a VGPU: - - * Generate a UUID: - - ``` - $ gpu_uuid=$(uuid) - ``` - - * Write the UUID to the `create` file under the chosen `mdev` type: - - ``` - $ echo $(gpu_uuid) | sudo tee /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8/create - ``` - -4. Find the IOMMU group for the VGPU: - - ``` - $ ls -la /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8/devices/${gpu_uuid}/iommu_group - lrwxrwxrwx 1 root root 0 May 18 14:35 devices/bbc4aafe-5807-11e8-a43e-03533cceae7d/iommu_group -> ../../../../kernel/iommu_groups/0 - - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 11:30 0 - crw-rw-rw- 1 root root 10, 196 May 18 11:29 vfio - ``` - - The IOMMU group "0" is created from the previous output.
- Now you can use the device node `/dev/vfio/0` in docker command line to pass - the VGPU to a Kata Container. - -5. Start Kata container with GPU device enabled: - - ``` - $ sudo docker run -it --runtime=kata-runtime --rm --device /dev/vfio/0 -v /dev:/dev debian /bin/bash - $ lspci -nn -D - 0000:00:00.0 Class [0600]: Device [8086:1237] (rev 02) - 0000:00:01.0 Class [0601]: Device [8086:7000] - 0000:00:01.1 Class [0101]: Device [8086:7010] - 0000:00:01.3 Class [0680]: Device [8086:7113] (rev 03) - 0000:00:02.0 Class [0604]: Device [1b36:0001] - 0000:00:03.0 Class [0780]: Device [1af4:1003] - 0000:00:04.0 Class [0100]: Device [1af4:1004] - 0000:00:05.0 Class [0002]: Device [1af4:1009] - 0000:00:06.0 Class [0200]: Device [1af4:1000] - 0000:00:0f.0 Class [0300]: Device [8086:1616] (rev 09) - ``` - - BDF "0000:00:0f.0" is assigned to the VGPU device. - - Additionally, you can access the device node for the graphics device: - - ``` - $ ls /dev/dri - card0 renderD128 - ``` diff --git a/docs/use-cases/using-SPDK-vhostuser-and-kata.md b/docs/use-cases/using-SPDK-vhostuser-and-kata.md index ae75930aeb..da07b71196 100644 --- a/docs/use-cases/using-SPDK-vhostuser-and-kata.md +++ b/docs/use-cases/using-SPDK-vhostuser-and-kata.md @@ -1,19 +1,15 @@ # Setup to run SPDK vhost-user devices with Kata Containers -> **Note:** This guide only applies to QEMU, since the vhost-user storage -> device is only available for QEMU now. The enablement work on other -> hypervisors is still ongoing. +> **Note:** This guide applies to both **runtime-rs with Dragonball** and **QEMU** hypervisors. For runtime-rs, the procedure is simplified as there is no need to manually create device nodes. ## SPDK vhost-user Target Overview -The Storage Performance Development Kit (SPDK) provides a set of tools and -libraries for writing high performance, scalable, user-mode storage applications. +The Storage Performance Development Kit (SPDK) provides a set of tools and libraries for writing high performance, scalable, user-mode storage applications. virtio, vhost and vhost-user: -- virtio is an efficient way to transport data for virtual environments and -guests. It is most commonly used in QEMU VMs, where the VM itself exposes a -virtual PCI device and the guest OS communicates with it using a specific virtio -PCI driver. Its diagram is: + +- virtio is an efficient way to transport data for virtual environments and guests. It is most commonly used in QEMU VMs, where the VM itself exposes a virtual PCI device and the guest OS communicates with it using a specific virtio PCI driver. Its diagram is: + ``` +---------+------+--------+----------+--+ | +------+-------------------+ | @@ -42,6 +38,7 @@ uses the same virtio queue layout as virtio to allow vhost devices to be mapped directly to virtio devices. The initial vhost implementation is a part of the Linux kernel and uses an ioctl interface to communicate with userspace applications. Its diagram is: + ``` +---------+------+--------+----------+--+ | +------+-------------------+ | @@ -65,9 +62,8 @@ applications. Its diagram is: +---------------------------------------+ ``` -- vhost-user implements the control plane through Unix domain socket to establish -virtio queue sharing with a user space process on the same host. SPDK exposes -vhost devices via the vhost-user protocol. Its diagram is: +- vhost-user implements the control plane through Unix domain socket to establish virtio queue sharing with a user space process on the same host. SPDK exposes vhost devices via the vhost-user protocol. Its diagram is: + ``` +----------------+------+--+----------+-+ | +------+-------------+ | @@ -95,169 +91,159 @@ vhost devices via the vhost-user protocol. Its diagram is: +---------------------------------------+ ``` -SPDK vhost is a vhost-user slave server. It exposes Unix domain sockets and -allows external applications to connect. It is capable of exposing virtualized -storage devices to QEMU instances or other arbitrary processes. +SPDK vhost is a vhost-user slave server. It exposes Unix domain sockets and allows external applications to connect. It is capable of exposing virtualized storage devices to QEMU instances or other arbitrary processes. -Currently, the SPDK vhost-user target can exposes these types of virtualized -devices: +Currently, the SPDK vhost-user target can expose several types of virtualized devices, but the most commonly used one in Kata Containers is the block device, which is supported by both runtime-rs with Dragonball and QEMU hypervisors: - `vhost-user-blk` -- `vhost-user-scsi` -- `vhost-user-nvme` (deprecated from SPDK 21.07 release) + +A block device that can be used as a regular block device in the guest. It is suitable for workloads that require high performance and low latency, such as databases or high I/O applications. For more information, visit [SPDK](https://spdk.io) and [SPDK vhost-user target](https://spdk.io/doc/vhost.html). -## Install and setup SPDK vhost-user target +## Prerequisites -### Get source code and build SPDK +- A Kubernetes cluster with Kata Containers enabled (runtime-rs with Dragonball or QEMU) +- SPDK built and `spdk_tgt` available +- For Kubernetes CSI integration: `csi-kata-directvolume` deployed -Following the SPDK [getting started guide](https://spdk.io/doc/getting_started.html). +## Method 1: Using CSI Driver (Recommended for Kubernetes) -### Run SPDK vhost-user target +This is the recommended method for Kubernetes environments, leveraging the `csi-kata-directvolume` CSI driver. -First, run the SPDK `setup.sh` script to setup some hugepages for the SPDK vhost -target application. We recommend you use a minimum of 4GiB, enough for the SPDK -vhost target and the virtual machine. -This will allocate 4096MiB (4GiB) of hugepages, and avoid binding PCI devices: +### 1. Start SPDK Service + +```sh +export SPDK_DEVEL= +export VHU_UDS_PATH=/var/lib/spdk/vhost + +# Reset and allocate hugepages +$ cd $SPDK_DEVEL +$ sudo ./scripts/setup.sh reset +$ sudo sysctl -w vm.nr_hugepages=2048 +$ sudo HUGEMEM=4096 ./scripts/setup.sh + +# Start SPDK vhost target +$ sudo mkdir -p $VHU_UDS_PATH +$ sudo $SPDK_DEVEL/build/bin/spdk_tgt -S $VHU_UDS_PATH -s 1024 -m 0x3 & +``` + +> **Notes:** + +> - `-s 1024`: size of the hugepage memory pool in MB. +> - `-m 0x3`: CPU mask specifying which cores SPDK will use. +> - If `vfio-pci` driver is supported, use `DRIVER_OVERRIDE=vfio-pci` with setup.sh. + +### 2. Deploy CSI Driver and Kubernetes Resources + +Deploy the CSI driver following the [deployment guide](../../src/tools/csi-kata-directvolume/docs/deploy-csi-kata-directvol.md). + +Create StorageClass, PVC, and Pod: + +```sh +$ cd kata-containers/src/tools/csi-kata-directvolume/examples/pod-with-spdkvol +$ kubectl apply -f csi-storageclass.yaml +$ kubectl apply -f csi-pvc.yaml +$ kubectl apply -f csi-app.yaml +``` + +This creates: + +- Storage Class `spdk-test-adapted` with `volumetype=spdkvol` +- PVC `kata-spdk-directvolume-pvc` +- Pod `spdk-pod-test` + +### 3. Verify the Volume + +Check the mounted block device inside the pod: + +```sh +$ kubectl exec -it spdk-pod-test -- /bin/sh + +$ lsblk +NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS +vda 254:0 0 256M 1 disk +└─vda1 254:1 0 253M 1 part +vdb 254:16 0 2G 0 disk /data + +$ echo "hello spdk" > /data/test.txt +$ cat /data/test.txt +hello spdk +``` + +The SPDK-backed volume `/dev/vdb` is mounted to `/data` inside the container. + +### 4. Cleanup + +```sh +$ kubectl delete -f csi-app.yaml +$ kubectl delete -f csi-pvc.yaml +$ kubectl delete -f csi-storageclass.yaml +``` + +## Method 2: Using kata-ctl direct-volume (For Manual Setup) + +This method is suitable for manual testing or non-Kubernetes environments using containerd. + +### 1. Start SPDK vhost target and Create Block Device ```bash -$ sudo HUGEMEM=4096 PCI_WHITELIST="none" scripts/setup.sh +$ export SPDK_DEVEL= +$ export VHU_UDS_PATH=/tmp/vhu-targets +$ export RAW_DISKS= # e.g., export RAW_DISKS=/tmp/rawdisks + +# Reset and setup hugepages +$ sudo ${SPDK_DEVEL}/scripts/setup.sh reset +$ sudo sysctl -w vm.nr_hugepages=2048 +$ sudo HUGEMEM=4096 DRIVER_OVERRIDE=vfio-pci ${SPDK_DEVEL}/scripts/setup.sh + +# Start SPDK vhost target +$ sudo ${SPDK_DEVEL}/build/bin/spdk_tgt -S $VHU_UDS_PATH -s 1024 -m 0x3 & ``` -Then, take directory `/var/run/kata-containers/vhost-user` as Kata's vhost-user -device directory. Make subdirectories for vhost-user sockets and device nodes: +Create a vhost controller: ```bash -$ sudo mkdir -p /var/run/kata-containers/vhost-user/ -$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/ -$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/sockets/ -$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/devices/ +# Create raw disk +$ mkdir -p "${RAW_DISKS}" # ensure the directory exists +$ sudo dd if=/dev/zero of=${RAW_DISKS}/rawdisk01.20g bs=1M count=20480 + +# Create AIO bdev +$ sudo ${SPDK_DEVEL}/scripts/rpc.py bdev_aio_create ${RAW_DISKS}/rawdisk01.20g vhu-rawdisk01.20g 512 + +# Create vhost-user-blk controller +$ sudo ${SPDK_DEVEL}/scripts/rpc.py vhost_create_blk_controller vhost-blk-rawdisk01.sock vhu-rawdisk01.20g ``` -For more details, see section [Host setup for vhost-user devices](#host-setup-for-vhost-user-devices). +A vhost controller `vhost-blk-rawdisk01.sock` is created under `$VHU_UDS_PATH/`. -Next, start the SPDK vhost target application. The following command will start -vhost on the first CPU core with all future socket files placed in -`/var/run/kata-containers/vhost-user/block/sockets/`: +### 2. Configure Direct Volume with kata-ctl + +For runtime-rs with Dragonball, there is no need to manually create device nodes. Use `kata-ctl direct-volume add`: ```bash -$ sudo app/spdk_tgt/spdk_tgt -S /var/run/kata-containers/vhost-user/block/sockets/ & +# Add direct volume +$ sudo kata-ctl direct-volume add /kubelet/kata-test-vol-001/volume001 "{\"device\": \"${VHU_UDS_PATH}/vhost-blk-rawdisk01.sock\", \"volume_type\":\"spdkvol\", \"fs_type\": \"ext4\", \"metadata\":{}, \"options\": []}" ``` -To list all available vhost options run the following command: +The volume info is stored at `/run/kata-containers/shared/direct-volumes/` with encoded path. + +### 3. Run a Kata Container ```bash -$ app/spdk_tgt/spdk_tgt -h +# For runtime-rs with Dragonball +# IMAGE=docker.io/library/ubuntu:latest +$ sudo ctr run -t --rm --runtime io.containerd.kata.v2 \ + --mount type=spdkvol,src=/kubelet/kata-test-vol-001/volume001,dst=/disk001,options=rbind:rw \ + "$IMAGE" kata-spdk-vol-test /bin/bash ``` -Create an experimental `vhost-user-blk` device based on memory directly: +Inside the container, the SPDK volume will be available at `/disk001`. -- The following RPC will create a 64MB memory block device named `Malloc0` -with 4096-byte block size: +## Additional Resources -```bash -$ sudo scripts/rpc.py bdev_malloc_create 64 4096 -b Malloc0 -``` +- [How to run Kata Containers with Kinds of Block Volumes](../how-to/how-to-run-kata-containers-with-kinds-of-Block-Volumes.md) +- [CSI Direct Volume Driver README](../../src/tools/csi-kata-directvolume/README.md) +- [SPDK Usage Guide for CSI](../../src/tools/csi-kata-directvolume/docs/spdk-usage.md) +- [Direct Block Device Assignment Design](../design/direct-blk-device-assignment.md) -- The following RPC will create a `vhost-user-blk` device exposing `Malloc0` -block device. The device will be accessible via -`/var/run/kata-containers/vhost-user/block/sockets/vhostblk0`: - -```bash -$ sudo scripts/rpc.py vhost_create_blk_controller vhostblk0 Malloc0 -``` - -## Host setup for vhost-user devices - -Considering the OCI specification and characteristics of vhost-user device, -Kata has chosen to use Linux reserved the block major range `240-254` -to map each vhost-user block type to a major. Also a specific directory is -used for vhost-user devices. - -The base directory for vhost-user device is a configurable value, -with the default being `/var/run/kata-containers/vhost-user`. It can be -configured by parameter `vhost_user_store_path` in [Kata TOML configuration file](../../src/runtime/README.md#configuration). - -Currently, the vhost-user storage device is not enabled by default, so -the user should enable it explicitly inside the Kata TOML configuration -file by setting `enable_vhost_user_store = true`. Since SPDK vhost-user target -requires hugepages, hugepages should also be enabled inside the Kata TOML -configuration file by setting `enable_hugepages = true`. -Here is the conclusion of parameter setting for vhost-user storage device: - -```toml -enable_hugepages = true -enable_vhost_user_store = true -vhost_user_store_path = "" -``` - -> **Note:** These parameters are under `[hypervisor.qemu]` section in Kata -> TOML configuration file. If they are absent, users should still add them -> under `[hypervisor.qemu]` section. - - -For the subdirectories of `vhost_user_store_path`: -- `block` is used for block device; -- `block/sockets` is where we expect UNIX domain sockets for vhost-user -block devices to live; -- `block/devices` is where simulated block device nodes for vhost-user -block devices are created. - -For example, if using the default directory `/var/run/kata-containers/vhost-user`, -UNIX domain sockets for vhost-user block device are under `/var/run/kata-containers/vhost-user/block/sockets/`. -Device nodes for vhost-user block device are under `/var/run/kata-containers/vhost-user/block/devices/`. - -Currently, Kata has chosen major number 241 to map to `vhost-user-blk` devices. -For `vhost-user-blk` device named `vhostblk0`, a UNIX domain socket is already -created by SPDK vhost target, and a block device node with major `241` and -minor `0` should be created for it, in order to be recognized by Kata runtime: - -```bash -$ sudo mknod /var/run/kata-containers/vhost-user/block/devices/vhostblk0 b 241 0 -``` - -## Launch a Kata container with SPDK vhost-user block device - -To use `vhost-user-blk` device, use `ctr` to pass a host `vhost-user-blk` -device to the container. In your `config.json`, you should use `devices` -to pass a host device to the container. - -For example (only `vhost-user-blk` listed): - -```json -{ - "linux": { - "devices": [ - { - "path": "/dev/vda", - "type": "b", - "major": 241, - "minor": 0, - "fileMode": 420, - "uid": 0, - "gid": 0 - } - ] - } -} -``` - -With `rootfs` provisioned under `bundle` directory, you can run your SPDK container: - -```bash -$ sudo ctr run -d --runtime io.containerd.run.kata.v2 --config bundle/config.json spdk_container -``` - -Example of performing I/O operations on the `vhost-user-blk` device inside -container: - -``` -$ sudo ctr t exec --exec-id 1 -t spdk_container sh -/ # ls -l /dev/vda -brw-r--r-- 1 root root 254, 0 Jan 20 03:54 /dev/vda -/ # dd if=/dev/vda of=/tmp/ddtest bs=4k count=20 -20+0 records in -20+0 records out -81920 bytes (80.0KB) copied, 0.002996 seconds, 26.1MB/s -```