From d6308ffb8c4d156ccb9182a0c3147aa5f8b641dd Mon Sep 17 00:00:00 2001 From: Alex Lyn Date: Mon, 23 Mar 2026 14:55:04 +0800 Subject: [PATCH 1/3] docs: Update SPDK vhost-user guide with CSI driver - Add support for runtime-rs with Dragonball - Add CSI driver integration method for Kubernetes - Add kata-ctl direct-volume method for manual setup - Preserve SPDK vhost-user Target Overview principles - Fix minor typo (can exposes -> can expose) Signed-off-by: Alex Lyn --- .../using-SPDK-vhostuser-and-kata.md | 282 +++++++++--------- 1 file changed, 134 insertions(+), 148 deletions(-) diff --git a/docs/use-cases/using-SPDK-vhostuser-and-kata.md b/docs/use-cases/using-SPDK-vhostuser-and-kata.md index ae75930aeb..da07b71196 100644 --- a/docs/use-cases/using-SPDK-vhostuser-and-kata.md +++ b/docs/use-cases/using-SPDK-vhostuser-and-kata.md @@ -1,19 +1,15 @@ # Setup to run SPDK vhost-user devices with Kata Containers -> **Note:** This guide only applies to QEMU, since the vhost-user storage -> device is only available for QEMU now. The enablement work on other -> hypervisors is still ongoing. +> **Note:** This guide applies to both **runtime-rs with Dragonball** and **QEMU** hypervisors. For runtime-rs, the procedure is simplified as there is no need to manually create device nodes. ## SPDK vhost-user Target Overview -The Storage Performance Development Kit (SPDK) provides a set of tools and -libraries for writing high performance, scalable, user-mode storage applications. +The Storage Performance Development Kit (SPDK) provides a set of tools and libraries for writing high performance, scalable, user-mode storage applications. virtio, vhost and vhost-user: -- virtio is an efficient way to transport data for virtual environments and -guests. It is most commonly used in QEMU VMs, where the VM itself exposes a -virtual PCI device and the guest OS communicates with it using a specific virtio -PCI driver. Its diagram is: + +- virtio is an efficient way to transport data for virtual environments and guests. It is most commonly used in QEMU VMs, where the VM itself exposes a virtual PCI device and the guest OS communicates with it using a specific virtio PCI driver. Its diagram is: + ``` +---------+------+--------+----------+--+ | +------+-------------------+ | @@ -42,6 +38,7 @@ uses the same virtio queue layout as virtio to allow vhost devices to be mapped directly to virtio devices. The initial vhost implementation is a part of the Linux kernel and uses an ioctl interface to communicate with userspace applications. Its diagram is: + ``` +---------+------+--------+----------+--+ | +------+-------------------+ | @@ -65,9 +62,8 @@ applications. Its diagram is: +---------------------------------------+ ``` -- vhost-user implements the control plane through Unix domain socket to establish -virtio queue sharing with a user space process on the same host. SPDK exposes -vhost devices via the vhost-user protocol. Its diagram is: +- vhost-user implements the control plane through Unix domain socket to establish virtio queue sharing with a user space process on the same host. SPDK exposes vhost devices via the vhost-user protocol. Its diagram is: + ``` +----------------+------+--+----------+-+ | +------+-------------+ | @@ -95,169 +91,159 @@ vhost devices via the vhost-user protocol. Its diagram is: +---------------------------------------+ ``` -SPDK vhost is a vhost-user slave server. It exposes Unix domain sockets and -allows external applications to connect. It is capable of exposing virtualized -storage devices to QEMU instances or other arbitrary processes. +SPDK vhost is a vhost-user slave server. It exposes Unix domain sockets and allows external applications to connect. It is capable of exposing virtualized storage devices to QEMU instances or other arbitrary processes. -Currently, the SPDK vhost-user target can exposes these types of virtualized -devices: +Currently, the SPDK vhost-user target can expose several types of virtualized devices, but the most commonly used one in Kata Containers is the block device, which is supported by both runtime-rs with Dragonball and QEMU hypervisors: - `vhost-user-blk` -- `vhost-user-scsi` -- `vhost-user-nvme` (deprecated from SPDK 21.07 release) + +A block device that can be used as a regular block device in the guest. It is suitable for workloads that require high performance and low latency, such as databases or high I/O applications. For more information, visit [SPDK](https://spdk.io) and [SPDK vhost-user target](https://spdk.io/doc/vhost.html). -## Install and setup SPDK vhost-user target +## Prerequisites -### Get source code and build SPDK +- A Kubernetes cluster with Kata Containers enabled (runtime-rs with Dragonball or QEMU) +- SPDK built and `spdk_tgt` available +- For Kubernetes CSI integration: `csi-kata-directvolume` deployed -Following the SPDK [getting started guide](https://spdk.io/doc/getting_started.html). +## Method 1: Using CSI Driver (Recommended for Kubernetes) -### Run SPDK vhost-user target +This is the recommended method for Kubernetes environments, leveraging the `csi-kata-directvolume` CSI driver. -First, run the SPDK `setup.sh` script to setup some hugepages for the SPDK vhost -target application. We recommend you use a minimum of 4GiB, enough for the SPDK -vhost target and the virtual machine. -This will allocate 4096MiB (4GiB) of hugepages, and avoid binding PCI devices: +### 1. Start SPDK Service + +```sh +export SPDK_DEVEL= +export VHU_UDS_PATH=/var/lib/spdk/vhost + +# Reset and allocate hugepages +$ cd $SPDK_DEVEL +$ sudo ./scripts/setup.sh reset +$ sudo sysctl -w vm.nr_hugepages=2048 +$ sudo HUGEMEM=4096 ./scripts/setup.sh + +# Start SPDK vhost target +$ sudo mkdir -p $VHU_UDS_PATH +$ sudo $SPDK_DEVEL/build/bin/spdk_tgt -S $VHU_UDS_PATH -s 1024 -m 0x3 & +``` + +> **Notes:** + +> - `-s 1024`: size of the hugepage memory pool in MB. +> - `-m 0x3`: CPU mask specifying which cores SPDK will use. +> - If `vfio-pci` driver is supported, use `DRIVER_OVERRIDE=vfio-pci` with setup.sh. + +### 2. Deploy CSI Driver and Kubernetes Resources + +Deploy the CSI driver following the [deployment guide](../../src/tools/csi-kata-directvolume/docs/deploy-csi-kata-directvol.md). + +Create StorageClass, PVC, and Pod: + +```sh +$ cd kata-containers/src/tools/csi-kata-directvolume/examples/pod-with-spdkvol +$ kubectl apply -f csi-storageclass.yaml +$ kubectl apply -f csi-pvc.yaml +$ kubectl apply -f csi-app.yaml +``` + +This creates: + +- Storage Class `spdk-test-adapted` with `volumetype=spdkvol` +- PVC `kata-spdk-directvolume-pvc` +- Pod `spdk-pod-test` + +### 3. Verify the Volume + +Check the mounted block device inside the pod: + +```sh +$ kubectl exec -it spdk-pod-test -- /bin/sh + +$ lsblk +NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS +vda 254:0 0 256M 1 disk +└─vda1 254:1 0 253M 1 part +vdb 254:16 0 2G 0 disk /data + +$ echo "hello spdk" > /data/test.txt +$ cat /data/test.txt +hello spdk +``` + +The SPDK-backed volume `/dev/vdb` is mounted to `/data` inside the container. + +### 4. Cleanup + +```sh +$ kubectl delete -f csi-app.yaml +$ kubectl delete -f csi-pvc.yaml +$ kubectl delete -f csi-storageclass.yaml +``` + +## Method 2: Using kata-ctl direct-volume (For Manual Setup) + +This method is suitable for manual testing or non-Kubernetes environments using containerd. + +### 1. Start SPDK vhost target and Create Block Device ```bash -$ sudo HUGEMEM=4096 PCI_WHITELIST="none" scripts/setup.sh +$ export SPDK_DEVEL= +$ export VHU_UDS_PATH=/tmp/vhu-targets +$ export RAW_DISKS= # e.g., export RAW_DISKS=/tmp/rawdisks + +# Reset and setup hugepages +$ sudo ${SPDK_DEVEL}/scripts/setup.sh reset +$ sudo sysctl -w vm.nr_hugepages=2048 +$ sudo HUGEMEM=4096 DRIVER_OVERRIDE=vfio-pci ${SPDK_DEVEL}/scripts/setup.sh + +# Start SPDK vhost target +$ sudo ${SPDK_DEVEL}/build/bin/spdk_tgt -S $VHU_UDS_PATH -s 1024 -m 0x3 & ``` -Then, take directory `/var/run/kata-containers/vhost-user` as Kata's vhost-user -device directory. Make subdirectories for vhost-user sockets and device nodes: +Create a vhost controller: ```bash -$ sudo mkdir -p /var/run/kata-containers/vhost-user/ -$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/ -$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/sockets/ -$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/devices/ +# Create raw disk +$ mkdir -p "${RAW_DISKS}" # ensure the directory exists +$ sudo dd if=/dev/zero of=${RAW_DISKS}/rawdisk01.20g bs=1M count=20480 + +# Create AIO bdev +$ sudo ${SPDK_DEVEL}/scripts/rpc.py bdev_aio_create ${RAW_DISKS}/rawdisk01.20g vhu-rawdisk01.20g 512 + +# Create vhost-user-blk controller +$ sudo ${SPDK_DEVEL}/scripts/rpc.py vhost_create_blk_controller vhost-blk-rawdisk01.sock vhu-rawdisk01.20g ``` -For more details, see section [Host setup for vhost-user devices](#host-setup-for-vhost-user-devices). +A vhost controller `vhost-blk-rawdisk01.sock` is created under `$VHU_UDS_PATH/`. -Next, start the SPDK vhost target application. The following command will start -vhost on the first CPU core with all future socket files placed in -`/var/run/kata-containers/vhost-user/block/sockets/`: +### 2. Configure Direct Volume with kata-ctl + +For runtime-rs with Dragonball, there is no need to manually create device nodes. Use `kata-ctl direct-volume add`: ```bash -$ sudo app/spdk_tgt/spdk_tgt -S /var/run/kata-containers/vhost-user/block/sockets/ & +# Add direct volume +$ sudo kata-ctl direct-volume add /kubelet/kata-test-vol-001/volume001 "{\"device\": \"${VHU_UDS_PATH}/vhost-blk-rawdisk01.sock\", \"volume_type\":\"spdkvol\", \"fs_type\": \"ext4\", \"metadata\":{}, \"options\": []}" ``` -To list all available vhost options run the following command: +The volume info is stored at `/run/kata-containers/shared/direct-volumes/` with encoded path. + +### 3. Run a Kata Container ```bash -$ app/spdk_tgt/spdk_tgt -h +# For runtime-rs with Dragonball +# IMAGE=docker.io/library/ubuntu:latest +$ sudo ctr run -t --rm --runtime io.containerd.kata.v2 \ + --mount type=spdkvol,src=/kubelet/kata-test-vol-001/volume001,dst=/disk001,options=rbind:rw \ + "$IMAGE" kata-spdk-vol-test /bin/bash ``` -Create an experimental `vhost-user-blk` device based on memory directly: +Inside the container, the SPDK volume will be available at `/disk001`. -- The following RPC will create a 64MB memory block device named `Malloc0` -with 4096-byte block size: +## Additional Resources -```bash -$ sudo scripts/rpc.py bdev_malloc_create 64 4096 -b Malloc0 -``` +- [How to run Kata Containers with Kinds of Block Volumes](../how-to/how-to-run-kata-containers-with-kinds-of-Block-Volumes.md) +- [CSI Direct Volume Driver README](../../src/tools/csi-kata-directvolume/README.md) +- [SPDK Usage Guide for CSI](../../src/tools/csi-kata-directvolume/docs/spdk-usage.md) +- [Direct Block Device Assignment Design](../design/direct-blk-device-assignment.md) -- The following RPC will create a `vhost-user-blk` device exposing `Malloc0` -block device. The device will be accessible via -`/var/run/kata-containers/vhost-user/block/sockets/vhostblk0`: - -```bash -$ sudo scripts/rpc.py vhost_create_blk_controller vhostblk0 Malloc0 -``` - -## Host setup for vhost-user devices - -Considering the OCI specification and characteristics of vhost-user device, -Kata has chosen to use Linux reserved the block major range `240-254` -to map each vhost-user block type to a major. Also a specific directory is -used for vhost-user devices. - -The base directory for vhost-user device is a configurable value, -with the default being `/var/run/kata-containers/vhost-user`. It can be -configured by parameter `vhost_user_store_path` in [Kata TOML configuration file](../../src/runtime/README.md#configuration). - -Currently, the vhost-user storage device is not enabled by default, so -the user should enable it explicitly inside the Kata TOML configuration -file by setting `enable_vhost_user_store = true`. Since SPDK vhost-user target -requires hugepages, hugepages should also be enabled inside the Kata TOML -configuration file by setting `enable_hugepages = true`. -Here is the conclusion of parameter setting for vhost-user storage device: - -```toml -enable_hugepages = true -enable_vhost_user_store = true -vhost_user_store_path = "" -``` - -> **Note:** These parameters are under `[hypervisor.qemu]` section in Kata -> TOML configuration file. If they are absent, users should still add them -> under `[hypervisor.qemu]` section. - - -For the subdirectories of `vhost_user_store_path`: -- `block` is used for block device; -- `block/sockets` is where we expect UNIX domain sockets for vhost-user -block devices to live; -- `block/devices` is where simulated block device nodes for vhost-user -block devices are created. - -For example, if using the default directory `/var/run/kata-containers/vhost-user`, -UNIX domain sockets for vhost-user block device are under `/var/run/kata-containers/vhost-user/block/sockets/`. -Device nodes for vhost-user block device are under `/var/run/kata-containers/vhost-user/block/devices/`. - -Currently, Kata has chosen major number 241 to map to `vhost-user-blk` devices. -For `vhost-user-blk` device named `vhostblk0`, a UNIX domain socket is already -created by SPDK vhost target, and a block device node with major `241` and -minor `0` should be created for it, in order to be recognized by Kata runtime: - -```bash -$ sudo mknod /var/run/kata-containers/vhost-user/block/devices/vhostblk0 b 241 0 -``` - -## Launch a Kata container with SPDK vhost-user block device - -To use `vhost-user-blk` device, use `ctr` to pass a host `vhost-user-blk` -device to the container. In your `config.json`, you should use `devices` -to pass a host device to the container. - -For example (only `vhost-user-blk` listed): - -```json -{ - "linux": { - "devices": [ - { - "path": "/dev/vda", - "type": "b", - "major": 241, - "minor": 0, - "fileMode": 420, - "uid": 0, - "gid": 0 - } - ] - } -} -``` - -With `rootfs` provisioned under `bundle` directory, you can run your SPDK container: - -```bash -$ sudo ctr run -d --runtime io.containerd.run.kata.v2 --config bundle/config.json spdk_container -``` - -Example of performing I/O operations on the `vhost-user-blk` device inside -container: - -``` -$ sudo ctr t exec --exec-id 1 -t spdk_container sh -/ # ls -l /dev/vda -brw-r--r-- 1 root root 254, 0 Jan 20 03:54 /dev/vda -/ # dd if=/dev/vda of=/tmp/ddtest bs=4k count=20 -20+0 records in -20+0 records out -81920 bytes (80.0KB) copied, 0.002996 seconds, 26.1MB/s -``` From 59609463e0fcd8cdf7b1b05ad1b7730646dc55bb Mon Sep 17 00:00:00 2001 From: Alex Lyn Date: Mon, 23 Mar 2026 16:43:08 +0800 Subject: [PATCH 2/3] docs: Update kernel modules loading document - Restructure document with clearer sections and better readability - Add configuration format examples for both runtimes - Add technical details including data flow and implementation references - Add debugging section for troubleshooting Signed-off-by: Alex Lyn --- .../how-to-load-kernel-modules-with-kata.md | 166 ++++++++++++------ 1 file changed, 115 insertions(+), 51 deletions(-) diff --git a/docs/how-to/how-to-load-kernel-modules-with-kata.md b/docs/how-to/how-to-load-kernel-modules-with-kata.md index 24a3546012..978078916d 100644 --- a/docs/how-to/how-to-load-kernel-modules-with-kata.md +++ b/docs/how-to/how-to-load-kernel-modules-with-kata.md @@ -1,73 +1,90 @@ -# Loading kernel modules +# Loading kernel modules in Kata Containers -A new feature for loading kernel modules was introduced in Kata Containers 1.9. -The list of kernel modules and their parameters can be provided using the -configuration file or OCI annotations. The [Kata runtime][1] gives that -information to the [Kata Agent][2] through gRPC when the sandbox is created. -The [Kata Agent][2] will insert the kernel modules using `modprobe(8)`, hence -modules dependencies are resolved automatically. +This document describes how to load kernel modules inside Kata Containers guest VM. -The sandbox will not be started when: +## Overview - * A kernel module is specified and the `modprobe(8)` command is not installed in - the guest or it fails loading the module. - * The module is not available in the guest or it doesn't meet the guest kernel - requirements, like architecture and version. +The kernel modules feature allows you to load specific kernel modules into the guest VM kernel when a sandbox is created. This is useful when your containerized applications require specific kernel functionality that is not built into the guest kernel. -In the following sections are documented the different ways that exist for -loading kernel modules in Kata Containers. +**How it works:** + +1. You specify kernel modules and their parameters via configuration file or OCI annotations +2. The Kata runtime passes this information to the Kata Agent through agent RPC during sandbox creation (gRPC in runtime-go, ttrpc in runtime-rs) +3. The Kata Agent loads the modules using `modprobe(8)`, which automatically resolves module dependencies + +**Failure conditions:** + +The sandbox will fail to start if: + +- A kernel module is specified but `modprobe(8)` is not installed in the guest, or it fails to load the module +- The module is not available in the guest or doesn't meet guest kernel requirements (architecture, version, etc.) + +## Configuration Methods - [Using Kata Configuration file](#using-kata-configuration-file) - [Using annotations](#using-annotations) -# Using Kata Configuration file +## Using Kata Configuration file -``` -NOTE: Use this method, only if you need to pass the kernel modules to all -containers. Please use annotations described below to set per pod annotations. -``` +> **Note**: Use this method when you need the kernel modules loaded for all containers. For per-pod configuration, use annotations instead. -The list of kernel modules and parameters can be set in the `kernel_modules` -option as a coma separated list, where each entry in the list specifies a kernel -module and its parameters. Each list element comprises one or more space separated -fields. The first field specifies the module name and subsequent fields specify -individual parameters for the module. +The `kernel_modules` option accepts a list of kernel modules with their parameters. Each list element specifies a module name followed by space-separated parameters. -The following example specifies two modules to load: `e1000e` and `i915`. Two parameters -are specified for the `e1000` module: `InterruptThrottleRate` (which takes an array -of integer values) and `EEE` (which requires a single integer value). +### Configuration Format + +**For runtime-go** (`configuration-qemu.toml`, etc.): ```toml -kernel_modules=["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] +[agent.kata] +kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] ``` -Not all the container managers allow users provide custom annotations, hence -this is the only way that Kata Containers provide for loading modules when -custom annotations are not supported. +**For runtime-rs** (`configuration-qemu-runtime-rs.toml`, etc.): -There are some limitations with this approach: +```toml +[agent.kata] +kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] +``` -* Write access to the Kata configuration file is required. -* The configuration file must be updated when a new container is created, - otherwise the same list of modules is used, even if they are not needed in the - container. +### Example -# Using annotations +The following example loads two modules: -As was mentioned above, not all containers need the same modules, therefore using -the configuration file for specifying the list of kernel modules per [POD][3] can -be a pain. -Unlike the configuration file, [annotations](how-to-set-sandbox-config-kata.md) -provide a way to specify custom configurations per POD. +- `e1000e` with parameters `InterruptThrottleRate=3000,3000,3000` and `EEE=1` +- `i915` with no parameters -The list of kernel modules and parameters can be set using the annotation -`io.katacontainers.config.agent.kernel_modules` as a semicolon separated -list, where the first word of each element is considered as the module name and -the rest as its parameters. +```toml +kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"] +``` -In the following example two PODs are created, but the kernel modules `e1000e` -and `i915` are inserted only in the POD `pod1`. +### Limitations +- Write access to the Kata configuration file is required +- All containers will use the same module list, even if some containers don't need them +- Configuration changes require service restart to take effect + +## Using annotations + +Annotations provide a way to specify kernel modules per pod, which is more flexible than the configuration file approach. + +### Annotation Key + +``` +io.katacontainers.config.agent.kernel_modules +``` + +### Format + +The annotation value uses **semicolon (`;`)** as the separator between modules. Each module specification consists of: + +- Module name (first word) +- Parameters (subsequent words, space-separated) + +Example: `"e1000e EEE=1; i915 enable_ppgtt=0"` + +### Kubernetes Example + +The following example creates two pods, where only `pod1` will have the kernel modules `e1000e` and `i915` loaded: ```yaml apiVersion: v1 @@ -104,6 +121,53 @@ spec: > **Note**: To pass annotations to Kata containers, [CRI-O must be configured correctly](how-to-set-sandbox-config-kata.md#cri-o-configuration) -[1]: ../../src/runtime -[2]: ../../src/agent -[3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/ +## Technical Details + +### Data Flow + +``` + Configuration File / Annotation + │ + ▼ + SandboxConfig.AgentConfig.KernelModules + │ + ▼ + Converted to gRPC KernelModule messages + │ + ▼ + CreateSandboxRequest sent to Agent + │ + ▼ + Agent executes modprobe in guest VM +``` + +### Implementation in Runtimes + +**runtime-go:** + +- Config parsing: `src/runtime/pkg/katautils/config.go` +- Annotation handling: `src/runtime/pkg/oci/utils.go` (`addAgentConfigOverrides()`) +- Module parsing: `src/runtime/virtcontainers/kata_agent.go` (`setupKernelModules()`) + +**runtime-rs:** + +- Config structure: `src/libs/kata-types/src/config/agent.rs` +- Annotation handling: `src/libs/kata-types/src/annotations/mod.rs` (`update_config_by_annotation()`) +- Module parsing: `src/runtime-rs/crates/agent/src/types.rs` (`KernelModule::set_kernel_modules()`) + +## Debugging + +To verify kernel modules are loaded in the guest VM: + +```bash +# Inside the container, run: +lsmod | grep + +# Or check modprobe output in guest VM journal +``` + +If module loading fails, check: + +1. Module is available in guest kernel modules directory (`/lib/modules/$(uname -r)`) +2. Module dependencies are satisfied +3. Guest kernel version matches module requirements From 978f40d631d25b18f258baacfd4db497dacdedf3 Mon Sep 17 00:00:00 2001 From: Alex Lyn Date: Mon, 20 Apr 2026 15:31:24 +0800 Subject: [PATCH 3/3] docs: Remove obsolete and update documentation index This commit prunes the documentation tree by removing file that are either no longer relevant to the current architecture or have been superseded by newer guides. Specifically, the doc Intel-Discrete-GPU-passthrough-and-Kata.md and update using-Intel-QAT-and-kata.md index in nav.yaml Refining the documentation helps ensure that new contributors find accurate and up-to-date information. Signed-off-by: Alex Lyn --- docs/.nav.yml | 2 +- docs/use-cases/GPU-passthrough-and-Kata.md | 2 +- ...Intel-Discrete-GPU-passthrough-and-Kata.md | 274 ----------------- .../Intel-GPU-passthrough-and-Kata.md | 287 ------------------ 4 files changed, 2 insertions(+), 563 deletions(-) delete mode 100644 docs/use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md delete mode 100644 docs/use-cases/Intel-GPU-passthrough-and-Kata.md diff --git a/docs/.nav.yml b/docs/.nav.yml index 15dab995cc..7dc1b12238 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -15,7 +15,7 @@ nav: - Use Cases: - NVIDIA GPU Passthrough: use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md - NVIDIA vGPU: use-cases/NVIDIA-GPU-passthrough-and-Kata.md - - Intel Discrete GPU: use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md + - Intel QAT: use-cases/using-Intel-QAT-and-kata.md - Contributing: - Documentation: doc-contributing.md - Misc: diff --git a/docs/use-cases/GPU-passthrough-and-Kata.md b/docs/use-cases/GPU-passthrough-and-Kata.md index 40b5297eea..8fdbfad6f4 100644 --- a/docs/use-cases/GPU-passthrough-and-Kata.md +++ b/docs/use-cases/GPU-passthrough-and-Kata.md @@ -2,5 +2,5 @@ Kata Containers supports passing certain GPUs from the host into the container. Select the GPU vendor for detailed information: -- [Intel Discrete GPUs](Intel-Discrete-GPU-passthrough-and-Kata.md)/[Intel Integrated GPUs](Intel-GPU-passthrough-and-Kata.md) - [NVIDIA GPUs](NVIDIA-GPU-passthrough-and-Kata.md) and [Enabling NVIDIA GPU workloads using GPU passthrough with Kata Containers](NVIDIA-GPU-passthrough-and-Kata-QEMU.md) +- PLACE HOLDER: for other GPU vendors (e.g., AMD, Intel) diff --git a/docs/use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md b/docs/use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md deleted file mode 100644 index 73ccf613c5..0000000000 --- a/docs/use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md +++ /dev/null @@ -1,274 +0,0 @@ -# Using Intel Discrete GPU device with Kata Containers - -This guide covers the use case for passing Intel Discrete GPUs to Kata. -These include the Intel® Data Center GPU Max Series and Intel® Data Center GPU Flex Series. -For integrated GPUs please refer to [Integrate-Intel-GPUs-with-Kata](Intel-GPU-passthrough-and-Kata.md) - -> **Note:** These instructions are for a system that has an x86_64 CPU. - -An Intel Discrete GPU can be passed to a Kata Container using GPU passthrough, -or SR-IOV passthrough. - -In Intel GPU pass-through mode, an entire physical GPU is directly assigned to one VM. -In this mode of operation, the GPU is accessed exclusively by the Intel driver running in -the VM to which it is assigned. The GPU is not shared among VMs. - -With SR-IOV mode, it is possible to pass a Virtual GPU instance to a virtual machine. -With this, multiple Virtual GPU instances can be carved out of a single physical GPU -and be passed to different VMs, allowing the GPU to be shared. - -| Technology | Description | -|-|-| -| GPU passthrough | Physical GPU assigned to a single VM | -| SR-IOV passthrough | Physical GPU shared by multiple VMs | - -## Hardware Requirements - -Intel GPUs Recommended for Virtualization: - -- Intel® Data Center GPU Max Series (`Ponte Vecchio`) -- Intel® Data Center GPU Flex Series (`Arctic Sound-M`) -- Intel® Data Center GPU Arc Series - -The following steps outline the workflow for using an Intel Graphics device with Kata Containers. - -## Host BIOS requirements - -Hardware such as Intel Max and Flex series require larger PCI BARs. - -For large BAR devices, MMIO mapping above the 4GB address space should be enabled in the PCI configuration of the BIOS. - -Some hardware vendors use a different name in the BIOS, such as: - -- Above 4GB Decoding -- Memory Hole for PCI MMIO -- Memory Mapped I/O above 4GB - -## Host Kernel Requirements - -For device passthrough to work with the Max and Flex Series, an out of tree kernel driver is required. - -For Ubuntu 22.04 server, follow these instructions to install the out of tree GPU driver: -```bash -$ sudo apt update -$ sudo apt install -y gpg-agent wget -$ wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \ - sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg -$ source /etc/os-release -$ echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" | \ - sudo tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list -$ sudo apt update -$ sudo apt install -y linux-headers-"$(uname -r)" flex bison intel-fw-gpu intel-i915-dkms xpu-smi -$ sudo reboot -``` -For support on other distributions, please refer to [DGPU-docs](https://dgpu-docs.intel.com/driver/installation.html) - -You can also install the driver from source which is maintained at [intel-gpu-i915-backports](https://github.com/intel-gpu/intel-gpu-i915-backports) -Detailed instructions for reference can be found at: https://github.com/intel-gpu/intel-gpu-i915-backports/blob/backport/main/docs/README_ubuntu.md. - -Below are the steps for installing the driver from source on an Ubuntu 22.04 LTS system: -```bash -$ export I915_BRANCH="backport/main" -$ git clone -b ${I915_BRANCH} --depth 1 https://github.com/intel-gpu/intel-gpu-i915-backports.git -$ cd intel-gpu-i915-backports/ -$ sudo apt install -y dkms make debhelper devscripts build-essential flex bison mawk -$ sudo apt install -y linux-headers-"$(uname -r)" linux-image-unsigned-"$(uname -r)" -$ make i915dkmsdeb-pkg -``` -The above make command will create Debian package in parent folder: `intel-i915-dkms_..deb` -Install the package as: -```bash -$ sudo dpkg -i intel-i915-dkms_..deb -$ sudo reboot -``` - -Additionally, verify that the following kernel configs are enabled for your host kernel: -``` -CONFIG_VFIO -CONFIG_VFIO_IOMMU_TYPE1 -CONFIG_VFIO_PCI -``` - -## Host kernel command line - -Your host kernel needs to be booted with `intel_iommu=on` and `i915.enable_iaf=0` on the kernel command -line. - -1. Run the following to change the kernel command line using grub: -```bash -$ sudo vim /etc/default/grub -``` - -2. At the end of the GRUB_CMDLINE_LINUX_DEFAULT append the below line: - -`intel_iommu=on iommu=pt i915.max_vfs=63 i915.enable_iaf=0` - -3. Update grub as per OS distribution: - -For Ubuntu: -```bash -$ sudo update-grub -``` - -For CentOS/RHEL: -```bash -$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg -``` - -4. Reboot the system -```bash -$ sudo reboot -``` - -## Install and configure Kata Containers - -To use this feature, you need Kata version 1.3.0 or above. -Follow the [Kata Containers setup instructions](../install/README.md) -to install the latest version of Kata. - -To use large BARs devices (for example, NVIDIA Tesla P100), you need Kata version 1.11.0 or above. - -In order to pass a GPU to a Kata Container, you need to enable the `hotplug_vfio_on_root_bus` -configuration in the Kata `configuration.toml` file as shown below. - -```bash -$ sudo sed -i -e 's/^# *\(hotplug_vfio_on_root_bus\).*=.*$/\1 = true/g' /usr/share/defaults/kata-containers/configuration.toml -``` - -Make sure you are using the `q35` machine type by verifying `machine_type = "q35"` is -set in the `configuration.toml`. Make sure `pcie_root_port` is set to a positive value. - -After making the above changes, configuration in the `configuration.toml` should look like this: -``` -machine_type = "q35" - -hotplug_vfio_on_root_bus = true -pcie_root_port = 1 -``` - -## GPU passthrough with Kata Containers - -Use the following steps to pass an Intel discrete GPU with Kata: - -1. Find the Bus-Device-Function (BDF) for GPU device: - - ``` - $ sudo lspci -nn -D | grep Display - ``` - - Run the previous command to determine the BDF for the GPU device on host.
- From the previous output, PCI address `0000:29:00.0` is assigned to the hardware GPU device.
- We choose this BDF to use it later to unbind the GPU device from the host for the purpose of demonstration.
- -2. Find the IOMMU group for the GPU device: - - ```bash - $ BDF="0000:29:00.0" - $ readlink -e /sys/bus/pci/devices/$BDF/iommu_group - /sys/kernel/iommu_groups/27 - ``` - - The previous output shows that the GPU belongs to IOMMU group 27. - -3. Bind the GPU to the `vfio-pci` device driver: - - ```bash - $ BDF="0000:29:00.0" - $ DEV="/sys/bus/pci/devices/$BDF" - $ echo "vfio-pci" | sudo tee "$DEV"/driver_override - $ echo $BDF | sudo tee "$DEV"/driver/unbind - $ echo "$BDF" | sudo tee "/sys/bus/pci/drivers_probe" - ``` - - After you run the previous commands, the GPU is bound to `vfio-pci` driver.
- A new directory with the IOMMU group number is created under `/dev/vfio`: - - ```bash - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 15:38 27 - crw-rw-rw- 1 root root 10, 196 May 18 15:37 vfio - ``` - - Later, to return the device to the standard driver, we simply clear the - `driver_override` and re-probe the device, ex: - - ```bash - $ echo | sudo tee "$DEV/preferred_driver" - $ echo $BDF | sudo tee $DEV/driver/unbind - $ echo $BDF | sudo tee /sys/bus/pci/drivers_probe - ``` - -5. Start a Kata container with GPU device: - - ```bash - $ sudo ctr --debug run --runtime "io.containerd.kata.v2" --device "/dev/vfio/27" --rm -t "docker.io/library/archlinux:latest" arch uname -r - - ``` - - Run `lspci` within the container to verify the GPU device is seen in the list of - the PCI devices. Note the vendor-device id of the GPU ("8086:0bd5") in the `lspci` output. - -## SR-IOV mode for Intel Discrete GPUs - -Use the following steps to pass an Intel Graphics device in SR-IOV mode to a Kata Container: - -1. Find the BDF for GPU device: - - ```sh - $ sudo lspci -nn -D | grep Display - 0000:29:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - 0000:3a:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - 0000:9a:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - 0000:ca:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f) - ``` - - Run the previous command to find out the BDF for the GPU device on host. - We choose the GPU with PCI address "0000:3a:00.0" to assign a GPU SR-IOV interface. - -2. Carve out SR-IOV slice for the GPU: - - List our total possible SR-IOV virtual interfaces for the GPU: - - ```bash - $ BDF="0000:3a:00.0" - $ cat "/sys/bus/pci/devices/$BDF/sriov_totalvfs" - 63 - ``` - - Create SR-IOV interfaces for the GPU: - ```sh - $ echo 4 | sudo tee /sys/bus/pci/devices/$BDF/sriov_numvfs - 4 - $ sudo lspci | grep Display - 29:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.1 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.2 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.3 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 3a:00.4 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - 9a:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - ca:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f) - ``` - The above output shows the SR-IOV interfaces created for the GPU. - -3. Find the IOMMU group for the GPU SR-IOV interface(VGPU): - - ```bash - $ BDF="0000:3a:00:1" - $ readlink -e "/sys/bus/pci/devices/$BDF/iommu_group" - /sys/kernel/iommu_groups/437 - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 11:30 437 - crw-rw-rw- 1 root root 10, 196 May 18 11:29 vfio - ``` - - Now you can use the device node `/dev/vfio/437` in docker command line to pass - the VGPU to a Kata Container. - -4. Start a Kata Containers container with GPU device enabled: - - ```bash - $ sudo ctr --debug run --runtime "io.containerd.kata.v2" --device /dev/vfio/437 --rm -t "docker.io/library/archlinux:latest" arch uname -r - ``` diff --git a/docs/use-cases/Intel-GPU-passthrough-and-Kata.md b/docs/use-cases/Intel-GPU-passthrough-and-Kata.md deleted file mode 100644 index ce18ead804..0000000000 --- a/docs/use-cases/Intel-GPU-passthrough-and-Kata.md +++ /dev/null @@ -1,287 +0,0 @@ -# Using Intel GPU device with Kata Containers - -An Intel Graphics device can be passed to a Kata Containers container using GPU -passthrough (Intel GVT-d) as well as GPU mediated passthrough (Intel GVT-g). - -Intel GVT-d (one VM to one physical GPU) also named as Intel-Graphics-Device -passthrough feature is one flavor of graphics virtualization approach. -This flavor allows direct assignment of an entire GPU to a single user, -passing the native driver capabilities through the hypervisor without any limitations. - -Intel GVT-g (multiple VMs to one physical GPU) is a full GPU virtualization solution -with mediated pass-through.
-A virtual GPU instance is maintained for each VM, with part of performance critical -resources, directly assigned. The ability to run a native graphics driver inside a -VM without hypervisor intervention in performance critical paths, achieves a good -balance among performance, feature, and sharing capability. - -| Technology | Description | Behaviour | Detail | -|-|-|-|-| -| Intel GVT-d | GPU passthrough | Physical GPU assigned to a single VM | Direct GPU assignment to VM without limitation | -| Intel GVT-g | GPU sharing | Physical GPU shared by multiple VMs | Mediated passthrough | - -## Hardware Requirements - - - For client platforms, 5th generation Intel® Core Processor Graphics or higher are required. - - For server platforms, E3_v4 or higher Xeon Processor Graphics are required. - -The following steps outline the workflow for using an Intel Graphics device with Kata. - -## Host Kernel Requirements - -The following configurations need to be enabled on your host kernel: - -``` -CONFIG_VFIO_IOMMU_TYPE1=m -CONFIG_VFIO=m -CONFIG_VFIO_PCI=m -CONFIG_VFIO_MDEV=m -CONFIG_VFIO_MDEV_DEVICE=m -CONFIG_DRM_I915_GVT=m -CONFIG_DRM_I915_GVT_KVMGT=m -``` - -Your host kernel needs to be booted with `intel_iommu=on` on the kernel command -line. - -## Install and configure Kata Containers - -To use this feature, you need Kata version 1.3.0 or above. -Follow the [Kata Containers setup instructions](../install/README.md) -to install the latest version of Kata. - -In order to pass a GPU to a Kata Container, you need to enable the `hotplug_vfio_on_root_bus` -configuration in the Kata `configuration.toml` file as shown below. - -``` -$ sudo sed -i -e 's/^# *\(hotplug_vfio_on_root_bus\).*=.*$/\1 = true/g' /usr/share/defaults/kata-containers/configuration.toml -``` - -Make sure you are using the `q35` machine type by verifying `machine_type = "q35"` is -set in the `configuration.toml`. Make sure `pcie_root_port` is set to a positive value. - -## Build Kata Containers kernel with GPU support - -The default guest kernel installed with Kata Containers does not provide GPU support. -To use an Intel GPU with Kata Containers, you need to build a kernel with the necessary -GPU support. - -The following i915 kernel config options need to be enabled: -``` -CONFIG_DRM=y -CONFIG_DRM_I915=y -CONFIG_DRM_I915_USERPTR=y -``` - -Build the Kata Containers kernel with the previous config options, using the instructions -described in [Building Kata Containers kernel](../../tools/packaging/kernel). -For further details on building and installing guest kernels, see [the developer guide](../Developer-Guide.md#install-guest-kernel-images). - -There is an easy way to build a guest kernel that supports Intel GPU: -``` -## Build guest kernel with ../../tools/packaging/kernel - -# Prepare (download guest kernel source, generate .config) -$ ./build-kernel.sh -g intel -f setup - -# Build guest kernel -$ ./build-kernel.sh -g intel build - -# Install guest kernel -$ sudo -E ./build-kernel.sh -g intel install -/usr/share/kata-containers/vmlinux-intel-gpu.container -> vmlinux-5.4.15-70-intel-gpu -/usr/share/kata-containers/vmlinuz-intel-gpu.container -> vmlinuz-5.4.15-70-intel-gpu -``` - -Before using the new guest kernel, please update the `kernel` parameters in `configuration.toml`. -``` -kernel = "/usr/share/kata-containers/vmlinuz-intel-gpu.container" -``` - -## GVT-d with Kata Containers - -Use the following steps to pass an Intel Graphics device in GVT-d mode with Kata: - -1. Find the Bus-Device-Function (BDF) for GPU device: - - ``` - $ sudo lspci -nn -D | grep Graphics - 0000:00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) - ``` - - Run the previous command to determine the BDF for the GPU device on host.
- From the previous output, PCI address `0000:00:02.0` is assigned to the hardware GPU device.
- This BDF is used later to unbind the GPU device from the host.
- "8086 1616" is the device ID of the hardware GPU device. It is used later to - rebind the GPU device to `vfio-pci` driver. - -2. Find the IOMMU group for the GPU device: - - ``` - $ BDF="0000:00:02.0" - $ readlink -e /sys/bus/pci/devices/$BDF/iommu_group - /sys/kernel/iommu_groups/1 - ``` - - The previous output shows that the GPU belongs to IOMMU group 1. - -3. Unbind the GPU: - - ``` - $ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind - ``` - -4. Bind the GPU to the `vfio-pci` device driver: - - ``` - $ sudo modprobe vfio-pci - $ echo 8086 1616 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id - $ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind - ``` - - After you run the previous commands, the GPU is bound to `vfio-pci` driver.
- A new directory with the IOMMU group number is created under `/dev/vfio`: - - ``` - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 15:38 1 - crw-rw-rw- 1 root root 10, 196 May 18 15:37 vfio - ``` - -5. Start a Kata container with GPU device: - - ``` - $ sudo docker run -it --runtime=kata-runtime --rm --device /dev/vfio/1 -v /dev:/dev debian /bin/bash - ``` - - Run `lspci` within the container to verify the GPU device is seen in the list of - the PCI devices. Note the vendor-device id of the GPU ("8086:1616") in the `lspci` output. - - ``` - $ lspci -nn -D - 0000:00:00.0 Class [0600]: Device [8086:1237] (rev 02) - 0000:00:01.0 Class [0601]: Device [8086:7000] - 0000:00:01.1 Class [0101]: Device [8086:7010] - 0000:00:01.3 Class [0680]: Device [8086:7113] (rev 03) - 0000:00:02.0 Class [0604]: Device [1b36:0001] - 0000:00:03.0 Class [0780]: Device [1af4:1003] - 0000:00:04.0 Class [0100]: Device [1af4:1004] - 0000:00:05.0 Class [0002]: Device [1af4:1009] - 0000:00:06.0 Class [0200]: Device [1af4:1000] - 0000:00:0f.0 Class [0300]: Device [8086:1616] (rev 09) - ``` - - Additionally, you can access the device node for the graphics device: - - ``` - $ ls /dev/dri - card0 renderD128 - ``` - -## GVT-g with Kata Containers - -For GVT-g, you append `i915.enable_gvt=1` in addition to `intel_iommu=on` -on your host kernel command line and then reboot your host. - -Use the following steps to pass an Intel Graphics device in GVT-g mode to a Kata Container: - -1. Find the BDF for GPU device: - - ``` - $ sudo lspci -nn -D | grep Graphics - 0000:00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) - ``` - - Run the previous command to find out the BDF for the GPU device on host. - The previous output shows PCI address "0000:00:02.0" is assigned to the GPU device. - -2. Choose the MDEV (Mediated Device) type for VGPU (Virtual GPU): - - For background on `mdev` types, please follow this [kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/driver-api/vfio-mediated-device.rst). - - * List out the `mdev` types for the VGPU: - - ``` - $ BDF="0000:00:02.0" - - $ ls /sys/devices/pci0000:00/$BDF/mdev_supported_types - i915-GVTg_V4_1 i915-GVTg_V4_2 i915-GVTg_V4_4 i915-GVTg_V4_8 - ``` - - * Inspect the `mdev` types and choose one that fits your requirement: - - ``` - $ cd /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8 && ls - available_instances create description device_api devices - - $ cat description - low_gm_size: 64MB - high_gm_size: 384MB - fence: 4 - resolution: 1024x768 - weight: 2 - - $ cat available_instances - 7 - ``` - - The output of file `description` represents the GPU resources that are - assigned to the VGPU with specified MDEV type.The output of file `available_instances` - represents the remaining amount of VGPUs you can create with specified MDEV type. - -3. Create a VGPU: - - * Generate a UUID: - - ``` - $ gpu_uuid=$(uuid) - ``` - - * Write the UUID to the `create` file under the chosen `mdev` type: - - ``` - $ echo $(gpu_uuid) | sudo tee /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8/create - ``` - -4. Find the IOMMU group for the VGPU: - - ``` - $ ls -la /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8/devices/${gpu_uuid}/iommu_group - lrwxrwxrwx 1 root root 0 May 18 14:35 devices/bbc4aafe-5807-11e8-a43e-03533cceae7d/iommu_group -> ../../../../kernel/iommu_groups/0 - - $ ls -l /dev/vfio - total 0 - crw------- 1 root root 241, 0 May 18 11:30 0 - crw-rw-rw- 1 root root 10, 196 May 18 11:29 vfio - ``` - - The IOMMU group "0" is created from the previous output.
- Now you can use the device node `/dev/vfio/0` in docker command line to pass - the VGPU to a Kata Container. - -5. Start Kata container with GPU device enabled: - - ``` - $ sudo docker run -it --runtime=kata-runtime --rm --device /dev/vfio/0 -v /dev:/dev debian /bin/bash - $ lspci -nn -D - 0000:00:00.0 Class [0600]: Device [8086:1237] (rev 02) - 0000:00:01.0 Class [0601]: Device [8086:7000] - 0000:00:01.1 Class [0101]: Device [8086:7010] - 0000:00:01.3 Class [0680]: Device [8086:7113] (rev 03) - 0000:00:02.0 Class [0604]: Device [1b36:0001] - 0000:00:03.0 Class [0780]: Device [1af4:1003] - 0000:00:04.0 Class [0100]: Device [1af4:1004] - 0000:00:05.0 Class [0002]: Device [1af4:1009] - 0000:00:06.0 Class [0200]: Device [1af4:1000] - 0000:00:0f.0 Class [0300]: Device [8086:1616] (rev 09) - ``` - - BDF "0000:00:0f.0" is assigned to the VGPU device. - - Additionally, you can access the device node for the graphics device: - - ``` - $ ls /dev/dri - card0 renderD128 - ```