mirror of
https://github.com/kata-containers/kata-containers.git
synced 2026-04-26 10:32:28 +00:00
Merge pull request #12702 from Apokleos/update-docs2
docs: Update docs of kata-containers
This commit is contained in:
@@ -15,7 +15,7 @@ nav:
|
||||
- Use Cases:
|
||||
- NVIDIA GPU Passthrough: use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md
|
||||
- NVIDIA vGPU: use-cases/NVIDIA-GPU-passthrough-and-Kata.md
|
||||
- Intel Discrete GPU: use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md
|
||||
- Intel QAT: use-cases/using-Intel-QAT-and-kata.md
|
||||
- Contributing:
|
||||
- Documentation: doc-contributing.md
|
||||
- Misc:
|
||||
|
||||
@@ -1,73 +1,90 @@
|
||||
# Loading kernel modules
|
||||
# Loading kernel modules in Kata Containers
|
||||
|
||||
A new feature for loading kernel modules was introduced in Kata Containers 1.9.
|
||||
The list of kernel modules and their parameters can be provided using the
|
||||
configuration file or OCI annotations. The [Kata runtime][1] gives that
|
||||
information to the [Kata Agent][2] through gRPC when the sandbox is created.
|
||||
The [Kata Agent][2] will insert the kernel modules using `modprobe(8)`, hence
|
||||
modules dependencies are resolved automatically.
|
||||
This document describes how to load kernel modules inside Kata Containers guest VM.
|
||||
|
||||
The sandbox will not be started when:
|
||||
## Overview
|
||||
|
||||
* A kernel module is specified and the `modprobe(8)` command is not installed in
|
||||
the guest or it fails loading the module.
|
||||
* The module is not available in the guest or it doesn't meet the guest kernel
|
||||
requirements, like architecture and version.
|
||||
The kernel modules feature allows you to load specific kernel modules into the guest VM kernel when a sandbox is created. This is useful when your containerized applications require specific kernel functionality that is not built into the guest kernel.
|
||||
|
||||
In the following sections are documented the different ways that exist for
|
||||
loading kernel modules in Kata Containers.
|
||||
**How it works:**
|
||||
|
||||
1. You specify kernel modules and their parameters via configuration file or OCI annotations
|
||||
2. The Kata runtime passes this information to the Kata Agent through agent RPC during sandbox creation (gRPC in runtime-go, ttrpc in runtime-rs)
|
||||
3. The Kata Agent loads the modules using `modprobe(8)`, which automatically resolves module dependencies
|
||||
|
||||
**Failure conditions:**
|
||||
|
||||
The sandbox will fail to start if:
|
||||
|
||||
- A kernel module is specified but `modprobe(8)` is not installed in the guest, or it fails to load the module
|
||||
- The module is not available in the guest or doesn't meet guest kernel requirements (architecture, version, etc.)
|
||||
|
||||
## Configuration Methods
|
||||
|
||||
- [Using Kata Configuration file](#using-kata-configuration-file)
|
||||
- [Using annotations](#using-annotations)
|
||||
|
||||
# Using Kata Configuration file
|
||||
## Using Kata Configuration file
|
||||
|
||||
```
|
||||
NOTE: Use this method, only if you need to pass the kernel modules to all
|
||||
containers. Please use annotations described below to set per pod annotations.
|
||||
```
|
||||
> **Note**: Use this method when you need the kernel modules loaded for all containers. For per-pod configuration, use annotations instead.
|
||||
|
||||
The list of kernel modules and parameters can be set in the `kernel_modules`
|
||||
option as a coma separated list, where each entry in the list specifies a kernel
|
||||
module and its parameters. Each list element comprises one or more space separated
|
||||
fields. The first field specifies the module name and subsequent fields specify
|
||||
individual parameters for the module.
|
||||
The `kernel_modules` option accepts a list of kernel modules with their parameters. Each list element specifies a module name followed by space-separated parameters.
|
||||
|
||||
The following example specifies two modules to load: `e1000e` and `i915`. Two parameters
|
||||
are specified for the `e1000` module: `InterruptThrottleRate` (which takes an array
|
||||
of integer values) and `EEE` (which requires a single integer value).
|
||||
### Configuration Format
|
||||
|
||||
**For runtime-go** (`configuration-qemu.toml`, etc.):
|
||||
|
||||
```toml
|
||||
kernel_modules=["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"]
|
||||
[agent.kata]
|
||||
kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"]
|
||||
```
|
||||
|
||||
Not all the container managers allow users provide custom annotations, hence
|
||||
this is the only way that Kata Containers provide for loading modules when
|
||||
custom annotations are not supported.
|
||||
**For runtime-rs** (`configuration-qemu-runtime-rs.toml`, etc.):
|
||||
|
||||
There are some limitations with this approach:
|
||||
```toml
|
||||
[agent.kata]
|
||||
kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"]
|
||||
```
|
||||
|
||||
* Write access to the Kata configuration file is required.
|
||||
* The configuration file must be updated when a new container is created,
|
||||
otherwise the same list of modules is used, even if they are not needed in the
|
||||
container.
|
||||
### Example
|
||||
|
||||
# Using annotations
|
||||
The following example loads two modules:
|
||||
|
||||
As was mentioned above, not all containers need the same modules, therefore using
|
||||
the configuration file for specifying the list of kernel modules per [POD][3] can
|
||||
be a pain.
|
||||
Unlike the configuration file, [annotations](how-to-set-sandbox-config-kata.md)
|
||||
provide a way to specify custom configurations per POD.
|
||||
- `e1000e` with parameters `InterruptThrottleRate=3000,3000,3000` and `EEE=1`
|
||||
- `i915` with no parameters
|
||||
|
||||
The list of kernel modules and parameters can be set using the annotation
|
||||
`io.katacontainers.config.agent.kernel_modules` as a semicolon separated
|
||||
list, where the first word of each element is considered as the module name and
|
||||
the rest as its parameters.
|
||||
```toml
|
||||
kernel_modules = ["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915"]
|
||||
```
|
||||
|
||||
In the following example two PODs are created, but the kernel modules `e1000e`
|
||||
and `i915` are inserted only in the POD `pod1`.
|
||||
### Limitations
|
||||
|
||||
- Write access to the Kata configuration file is required
|
||||
- All containers will use the same module list, even if some containers don't need them
|
||||
- Configuration changes require service restart to take effect
|
||||
|
||||
## Using annotations
|
||||
|
||||
Annotations provide a way to specify kernel modules per pod, which is more flexible than the configuration file approach.
|
||||
|
||||
### Annotation Key
|
||||
|
||||
```
|
||||
io.katacontainers.config.agent.kernel_modules
|
||||
```
|
||||
|
||||
### Format
|
||||
|
||||
The annotation value uses **semicolon (`;`)** as the separator between modules. Each module specification consists of:
|
||||
|
||||
- Module name (first word)
|
||||
- Parameters (subsequent words, space-separated)
|
||||
|
||||
Example: `"e1000e EEE=1; i915 enable_ppgtt=0"`
|
||||
|
||||
### Kubernetes Example
|
||||
|
||||
The following example creates two pods, where only `pod1` will have the kernel modules `e1000e` and `i915` loaded:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
@@ -104,6 +121,53 @@ spec:
|
||||
|
||||
> **Note**: To pass annotations to Kata containers, [CRI-O must be configured correctly](how-to-set-sandbox-config-kata.md#cri-o-configuration)
|
||||
|
||||
[1]: ../../src/runtime
|
||||
[2]: ../../src/agent
|
||||
[3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/
|
||||
## Technical Details
|
||||
|
||||
### Data Flow
|
||||
|
||||
```
|
||||
Configuration File / Annotation
|
||||
│
|
||||
▼
|
||||
SandboxConfig.AgentConfig.KernelModules
|
||||
│
|
||||
▼
|
||||
Converted to gRPC KernelModule messages
|
||||
│
|
||||
▼
|
||||
CreateSandboxRequest sent to Agent
|
||||
│
|
||||
▼
|
||||
Agent executes modprobe in guest VM
|
||||
```
|
||||
|
||||
### Implementation in Runtimes
|
||||
|
||||
**runtime-go:**
|
||||
|
||||
- Config parsing: `src/runtime/pkg/katautils/config.go`
|
||||
- Annotation handling: `src/runtime/pkg/oci/utils.go` (`addAgentConfigOverrides()`)
|
||||
- Module parsing: `src/runtime/virtcontainers/kata_agent.go` (`setupKernelModules()`)
|
||||
|
||||
**runtime-rs:**
|
||||
|
||||
- Config structure: `src/libs/kata-types/src/config/agent.rs`
|
||||
- Annotation handling: `src/libs/kata-types/src/annotations/mod.rs` (`update_config_by_annotation()`)
|
||||
- Module parsing: `src/runtime-rs/crates/agent/src/types.rs` (`KernelModule::set_kernel_modules()`)
|
||||
|
||||
## Debugging
|
||||
|
||||
To verify kernel modules are loaded in the guest VM:
|
||||
|
||||
```bash
|
||||
# Inside the container, run:
|
||||
lsmod | grep <module_name>
|
||||
|
||||
# Or check modprobe output in guest VM journal
|
||||
```
|
||||
|
||||
If module loading fails, check:
|
||||
|
||||
1. Module is available in guest kernel modules directory (`/lib/modules/$(uname -r)`)
|
||||
2. Module dependencies are satisfied
|
||||
3. Guest kernel version matches module requirements
|
||||
|
||||
@@ -2,5 +2,5 @@
|
||||
|
||||
Kata Containers supports passing certain GPUs from the host into the container. Select the GPU vendor for detailed information:
|
||||
|
||||
- [Intel Discrete GPUs](Intel-Discrete-GPU-passthrough-and-Kata.md)/[Intel Integrated GPUs](Intel-GPU-passthrough-and-Kata.md)
|
||||
- [NVIDIA GPUs](NVIDIA-GPU-passthrough-and-Kata.md) and [Enabling NVIDIA GPU workloads using GPU passthrough with Kata Containers](NVIDIA-GPU-passthrough-and-Kata-QEMU.md)
|
||||
- PLACE HOLDER: for other GPU vendors (e.g., AMD, Intel)
|
||||
|
||||
@@ -1,274 +0,0 @@
|
||||
# Using Intel Discrete GPU device with Kata Containers
|
||||
|
||||
This guide covers the use case for passing Intel Discrete GPUs to Kata.
|
||||
These include the Intel® Data Center GPU Max Series and Intel® Data Center GPU Flex Series.
|
||||
For integrated GPUs please refer to [Integrate-Intel-GPUs-with-Kata](Intel-GPU-passthrough-and-Kata.md)
|
||||
|
||||
> **Note:** These instructions are for a system that has an x86_64 CPU.
|
||||
|
||||
An Intel Discrete GPU can be passed to a Kata Container using GPU passthrough,
|
||||
or SR-IOV passthrough.
|
||||
|
||||
In Intel GPU pass-through mode, an entire physical GPU is directly assigned to one VM.
|
||||
In this mode of operation, the GPU is accessed exclusively by the Intel driver running in
|
||||
the VM to which it is assigned. The GPU is not shared among VMs.
|
||||
|
||||
With SR-IOV mode, it is possible to pass a Virtual GPU instance to a virtual machine.
|
||||
With this, multiple Virtual GPU instances can be carved out of a single physical GPU
|
||||
and be passed to different VMs, allowing the GPU to be shared.
|
||||
|
||||
| Technology | Description |
|
||||
|-|-|
|
||||
| GPU passthrough | Physical GPU assigned to a single VM |
|
||||
| SR-IOV passthrough | Physical GPU shared by multiple VMs |
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
Intel GPUs Recommended for Virtualization:
|
||||
|
||||
- Intel® Data Center GPU Max Series (`Ponte Vecchio`)
|
||||
- Intel® Data Center GPU Flex Series (`Arctic Sound-M`)
|
||||
- Intel® Data Center GPU Arc Series
|
||||
|
||||
The following steps outline the workflow for using an Intel Graphics device with Kata Containers.
|
||||
|
||||
## Host BIOS requirements
|
||||
|
||||
Hardware such as Intel Max and Flex series require larger PCI BARs.
|
||||
|
||||
For large BAR devices, MMIO mapping above the 4GB address space should be enabled in the PCI configuration of the BIOS.
|
||||
|
||||
Some hardware vendors use a different name in the BIOS, such as:
|
||||
|
||||
- Above 4GB Decoding
|
||||
- Memory Hole for PCI MMIO
|
||||
- Memory Mapped I/O above 4GB
|
||||
|
||||
## Host Kernel Requirements
|
||||
|
||||
For device passthrough to work with the Max and Flex Series, an out of tree kernel driver is required.
|
||||
|
||||
For Ubuntu 22.04 server, follow these instructions to install the out of tree GPU driver:
|
||||
```bash
|
||||
$ sudo apt update
|
||||
$ sudo apt install -y gpg-agent wget
|
||||
$ wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
|
||||
sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
|
||||
$ source /etc/os-release
|
||||
$ echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" | \
|
||||
sudo tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list
|
||||
$ sudo apt update
|
||||
$ sudo apt install -y linux-headers-"$(uname -r)" flex bison intel-fw-gpu intel-i915-dkms xpu-smi
|
||||
$ sudo reboot
|
||||
```
|
||||
For support on other distributions, please refer to [DGPU-docs](https://dgpu-docs.intel.com/driver/installation.html)
|
||||
|
||||
You can also install the driver from source which is maintained at [intel-gpu-i915-backports](https://github.com/intel-gpu/intel-gpu-i915-backports)
|
||||
Detailed instructions for reference can be found at: https://github.com/intel-gpu/intel-gpu-i915-backports/blob/backport/main/docs/README_ubuntu.md.
|
||||
|
||||
Below are the steps for installing the driver from source on an Ubuntu 22.04 LTS system:
|
||||
```bash
|
||||
$ export I915_BRANCH="backport/main"
|
||||
$ git clone -b ${I915_BRANCH} --depth 1 https://github.com/intel-gpu/intel-gpu-i915-backports.git
|
||||
$ cd intel-gpu-i915-backports/
|
||||
$ sudo apt install -y dkms make debhelper devscripts build-essential flex bison mawk
|
||||
$ sudo apt install -y linux-headers-"$(uname -r)" linux-image-unsigned-"$(uname -r)"
|
||||
$ make i915dkmsdeb-pkg
|
||||
```
|
||||
The above make command will create Debian package in parent folder: `intel-i915-dkms_<release version>.<kernel-version>.deb`
|
||||
Install the package as:
|
||||
```bash
|
||||
$ sudo dpkg -i intel-i915-dkms_<release version>.<kernel-version>.deb
|
||||
$ sudo reboot
|
||||
```
|
||||
|
||||
Additionally, verify that the following kernel configs are enabled for your host kernel:
|
||||
```
|
||||
CONFIG_VFIO
|
||||
CONFIG_VFIO_IOMMU_TYPE1
|
||||
CONFIG_VFIO_PCI
|
||||
```
|
||||
|
||||
## Host kernel command line
|
||||
|
||||
Your host kernel needs to be booted with `intel_iommu=on` and `i915.enable_iaf=0` on the kernel command
|
||||
line.
|
||||
|
||||
1. Run the following to change the kernel command line using grub:
|
||||
```bash
|
||||
$ sudo vim /etc/default/grub
|
||||
```
|
||||
|
||||
2. At the end of the GRUB_CMDLINE_LINUX_DEFAULT append the below line:
|
||||
|
||||
`intel_iommu=on iommu=pt i915.max_vfs=63 i915.enable_iaf=0`
|
||||
|
||||
3. Update grub as per OS distribution:
|
||||
|
||||
For Ubuntu:
|
||||
```bash
|
||||
$ sudo update-grub
|
||||
```
|
||||
|
||||
For CentOS/RHEL:
|
||||
```bash
|
||||
$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg
|
||||
```
|
||||
|
||||
4. Reboot the system
|
||||
```bash
|
||||
$ sudo reboot
|
||||
```
|
||||
|
||||
## Install and configure Kata Containers
|
||||
|
||||
To use this feature, you need Kata version 1.3.0 or above.
|
||||
Follow the [Kata Containers setup instructions](../install/README.md)
|
||||
to install the latest version of Kata.
|
||||
|
||||
To use large BARs devices (for example, NVIDIA Tesla P100), you need Kata version 1.11.0 or above.
|
||||
|
||||
In order to pass a GPU to a Kata Container, you need to enable the `hotplug_vfio_on_root_bus`
|
||||
configuration in the Kata `configuration.toml` file as shown below.
|
||||
|
||||
```bash
|
||||
$ sudo sed -i -e 's/^# *\(hotplug_vfio_on_root_bus\).*=.*$/\1 = true/g' /usr/share/defaults/kata-containers/configuration.toml
|
||||
```
|
||||
|
||||
Make sure you are using the `q35` machine type by verifying `machine_type = "q35"` is
|
||||
set in the `configuration.toml`. Make sure `pcie_root_port` is set to a positive value.
|
||||
|
||||
After making the above changes, configuration in the `configuration.toml` should look like this:
|
||||
```
|
||||
machine_type = "q35"
|
||||
|
||||
hotplug_vfio_on_root_bus = true
|
||||
pcie_root_port = 1
|
||||
```
|
||||
|
||||
## GPU passthrough with Kata Containers
|
||||
|
||||
Use the following steps to pass an Intel discrete GPU with Kata:
|
||||
|
||||
1. Find the Bus-Device-Function (BDF) for GPU device:
|
||||
|
||||
```
|
||||
$ sudo lspci -nn -D | grep Display
|
||||
```
|
||||
|
||||
Run the previous command to determine the BDF for the GPU device on host.<br/>
|
||||
From the previous output, PCI address `0000:29:00.0` is assigned to the hardware GPU device.<br/>
|
||||
We choose this BDF to use it later to unbind the GPU device from the host for the purpose of demonstration.<br/>
|
||||
|
||||
2. Find the IOMMU group for the GPU device:
|
||||
|
||||
```bash
|
||||
$ BDF="0000:29:00.0"
|
||||
$ readlink -e /sys/bus/pci/devices/$BDF/iommu_group
|
||||
/sys/kernel/iommu_groups/27
|
||||
```
|
||||
|
||||
The previous output shows that the GPU belongs to IOMMU group 27.
|
||||
|
||||
3. Bind the GPU to the `vfio-pci` device driver:
|
||||
|
||||
```bash
|
||||
$ BDF="0000:29:00.0"
|
||||
$ DEV="/sys/bus/pci/devices/$BDF"
|
||||
$ echo "vfio-pci" | sudo tee "$DEV"/driver_override
|
||||
$ echo $BDF | sudo tee "$DEV"/driver/unbind
|
||||
$ echo "$BDF" | sudo tee "/sys/bus/pci/drivers_probe"
|
||||
```
|
||||
|
||||
After you run the previous commands, the GPU is bound to `vfio-pci` driver.<br/>
|
||||
A new directory with the IOMMU group number is created under `/dev/vfio`:
|
||||
|
||||
```bash
|
||||
$ ls -l /dev/vfio
|
||||
total 0
|
||||
crw------- 1 root root 241, 0 May 18 15:38 27
|
||||
crw-rw-rw- 1 root root 10, 196 May 18 15:37 vfio
|
||||
```
|
||||
|
||||
Later, to return the device to the standard driver, we simply clear the
|
||||
`driver_override` and re-probe the device, ex:
|
||||
|
||||
```bash
|
||||
$ echo | sudo tee "$DEV/preferred_driver"
|
||||
$ echo $BDF | sudo tee $DEV/driver/unbind
|
||||
$ echo $BDF | sudo tee /sys/bus/pci/drivers_probe
|
||||
```
|
||||
|
||||
5. Start a Kata container with GPU device:
|
||||
|
||||
```bash
|
||||
$ sudo ctr --debug run --runtime "io.containerd.kata.v2" --device "/dev/vfio/27" --rm -t "docker.io/library/archlinux:latest" arch uname -r
|
||||
|
||||
```
|
||||
|
||||
Run `lspci` within the container to verify the GPU device is seen in the list of
|
||||
the PCI devices. Note the vendor-device id of the GPU ("8086:0bd5") in the `lspci` output.
|
||||
|
||||
## SR-IOV mode for Intel Discrete GPUs
|
||||
|
||||
Use the following steps to pass an Intel Graphics device in SR-IOV mode to a Kata Container:
|
||||
|
||||
1. Find the BDF for GPU device:
|
||||
|
||||
```sh
|
||||
$ sudo lspci -nn -D | grep Display
|
||||
0000:29:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f)
|
||||
0000:3a:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f)
|
||||
0000:9a:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f)
|
||||
0000:ca:00.0 Display controller [0380]: Intel Corporation Ponte Vecchio 1T [8086:0bd5] (rev 2f)
|
||||
```
|
||||
|
||||
Run the previous command to find out the BDF for the GPU device on host.
|
||||
We choose the GPU with PCI address "0000:3a:00.0" to assign a GPU SR-IOV interface.
|
||||
|
||||
2. Carve out SR-IOV slice for the GPU:
|
||||
|
||||
List our total possible SR-IOV virtual interfaces for the GPU:
|
||||
|
||||
```bash
|
||||
$ BDF="0000:3a:00.0"
|
||||
$ cat "/sys/bus/pci/devices/$BDF/sriov_totalvfs"
|
||||
63
|
||||
```
|
||||
|
||||
Create SR-IOV interfaces for the GPU:
|
||||
```sh
|
||||
$ echo 4 | sudo tee /sys/bus/pci/devices/$BDF/sriov_numvfs
|
||||
4
|
||||
$ sudo lspci | grep Display
|
||||
29:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
3a:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
3a:00.1 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
3a:00.2 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
3a:00.3 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
3a:00.4 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
9a:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
ca:00.0 Display controller: Intel Corporation Ponte Vecchio 1T (rev 2f)
|
||||
```
|
||||
The above output shows the SR-IOV interfaces created for the GPU.
|
||||
|
||||
3. Find the IOMMU group for the GPU SR-IOV interface(VGPU):
|
||||
|
||||
```bash
|
||||
$ BDF="0000:3a:00:1"
|
||||
$ readlink -e "/sys/bus/pci/devices/$BDF/iommu_group"
|
||||
/sys/kernel/iommu_groups/437
|
||||
$ ls -l /dev/vfio
|
||||
total 0
|
||||
crw------- 1 root root 241, 0 May 18 11:30 437
|
||||
crw-rw-rw- 1 root root 10, 196 May 18 11:29 vfio
|
||||
```
|
||||
|
||||
Now you can use the device node `/dev/vfio/437` in docker command line to pass
|
||||
the VGPU to a Kata Container.
|
||||
|
||||
4. Start a Kata Containers container with GPU device enabled:
|
||||
|
||||
```bash
|
||||
$ sudo ctr --debug run --runtime "io.containerd.kata.v2" --device /dev/vfio/437 --rm -t "docker.io/library/archlinux:latest" arch uname -r
|
||||
```
|
||||
@@ -1,287 +0,0 @@
|
||||
# Using Intel GPU device with Kata Containers
|
||||
|
||||
An Intel Graphics device can be passed to a Kata Containers container using GPU
|
||||
passthrough (Intel GVT-d) as well as GPU mediated passthrough (Intel GVT-g).
|
||||
|
||||
Intel GVT-d (one VM to one physical GPU) also named as Intel-Graphics-Device
|
||||
passthrough feature is one flavor of graphics virtualization approach.
|
||||
This flavor allows direct assignment of an entire GPU to a single user,
|
||||
passing the native driver capabilities through the hypervisor without any limitations.
|
||||
|
||||
Intel GVT-g (multiple VMs to one physical GPU) is a full GPU virtualization solution
|
||||
with mediated pass-through.<br/>
|
||||
A virtual GPU instance is maintained for each VM, with part of performance critical
|
||||
resources, directly assigned. The ability to run a native graphics driver inside a
|
||||
VM without hypervisor intervention in performance critical paths, achieves a good
|
||||
balance among performance, feature, and sharing capability.
|
||||
|
||||
| Technology | Description | Behaviour | Detail |
|
||||
|-|-|-|-|
|
||||
| Intel GVT-d | GPU passthrough | Physical GPU assigned to a single VM | Direct GPU assignment to VM without limitation |
|
||||
| Intel GVT-g | GPU sharing | Physical GPU shared by multiple VMs | Mediated passthrough |
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
- For client platforms, 5th generation Intel® Core Processor Graphics or higher are required.
|
||||
- For server platforms, E3_v4 or higher Xeon Processor Graphics are required.
|
||||
|
||||
The following steps outline the workflow for using an Intel Graphics device with Kata.
|
||||
|
||||
## Host Kernel Requirements
|
||||
|
||||
The following configurations need to be enabled on your host kernel:
|
||||
|
||||
```
|
||||
CONFIG_VFIO_IOMMU_TYPE1=m
|
||||
CONFIG_VFIO=m
|
||||
CONFIG_VFIO_PCI=m
|
||||
CONFIG_VFIO_MDEV=m
|
||||
CONFIG_VFIO_MDEV_DEVICE=m
|
||||
CONFIG_DRM_I915_GVT=m
|
||||
CONFIG_DRM_I915_GVT_KVMGT=m
|
||||
```
|
||||
|
||||
Your host kernel needs to be booted with `intel_iommu=on` on the kernel command
|
||||
line.
|
||||
|
||||
## Install and configure Kata Containers
|
||||
|
||||
To use this feature, you need Kata version 1.3.0 or above.
|
||||
Follow the [Kata Containers setup instructions](../install/README.md)
|
||||
to install the latest version of Kata.
|
||||
|
||||
In order to pass a GPU to a Kata Container, you need to enable the `hotplug_vfio_on_root_bus`
|
||||
configuration in the Kata `configuration.toml` file as shown below.
|
||||
|
||||
```
|
||||
$ sudo sed -i -e 's/^# *\(hotplug_vfio_on_root_bus\).*=.*$/\1 = true/g' /usr/share/defaults/kata-containers/configuration.toml
|
||||
```
|
||||
|
||||
Make sure you are using the `q35` machine type by verifying `machine_type = "q35"` is
|
||||
set in the `configuration.toml`. Make sure `pcie_root_port` is set to a positive value.
|
||||
|
||||
## Build Kata Containers kernel with GPU support
|
||||
|
||||
The default guest kernel installed with Kata Containers does not provide GPU support.
|
||||
To use an Intel GPU with Kata Containers, you need to build a kernel with the necessary
|
||||
GPU support.
|
||||
|
||||
The following i915 kernel config options need to be enabled:
|
||||
```
|
||||
CONFIG_DRM=y
|
||||
CONFIG_DRM_I915=y
|
||||
CONFIG_DRM_I915_USERPTR=y
|
||||
```
|
||||
|
||||
Build the Kata Containers kernel with the previous config options, using the instructions
|
||||
described in [Building Kata Containers kernel](../../tools/packaging/kernel).
|
||||
For further details on building and installing guest kernels, see [the developer guide](../Developer-Guide.md#install-guest-kernel-images).
|
||||
|
||||
There is an easy way to build a guest kernel that supports Intel GPU:
|
||||
```
|
||||
## Build guest kernel with ../../tools/packaging/kernel
|
||||
|
||||
# Prepare (download guest kernel source, generate .config)
|
||||
$ ./build-kernel.sh -g intel -f setup
|
||||
|
||||
# Build guest kernel
|
||||
$ ./build-kernel.sh -g intel build
|
||||
|
||||
# Install guest kernel
|
||||
$ sudo -E ./build-kernel.sh -g intel install
|
||||
/usr/share/kata-containers/vmlinux-intel-gpu.container -> vmlinux-5.4.15-70-intel-gpu
|
||||
/usr/share/kata-containers/vmlinuz-intel-gpu.container -> vmlinuz-5.4.15-70-intel-gpu
|
||||
```
|
||||
|
||||
Before using the new guest kernel, please update the `kernel` parameters in `configuration.toml`.
|
||||
```
|
||||
kernel = "/usr/share/kata-containers/vmlinuz-intel-gpu.container"
|
||||
```
|
||||
|
||||
## GVT-d with Kata Containers
|
||||
|
||||
Use the following steps to pass an Intel Graphics device in GVT-d mode with Kata:
|
||||
|
||||
1. Find the Bus-Device-Function (BDF) for GPU device:
|
||||
|
||||
```
|
||||
$ sudo lspci -nn -D | grep Graphics
|
||||
0000:00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09)
|
||||
```
|
||||
|
||||
Run the previous command to determine the BDF for the GPU device on host.<br/>
|
||||
From the previous output, PCI address `0000:00:02.0` is assigned to the hardware GPU device.<br/>
|
||||
This BDF is used later to unbind the GPU device from the host.<br/>
|
||||
"8086 1616" is the device ID of the hardware GPU device. It is used later to
|
||||
rebind the GPU device to `vfio-pci` driver.
|
||||
|
||||
2. Find the IOMMU group for the GPU device:
|
||||
|
||||
```
|
||||
$ BDF="0000:00:02.0"
|
||||
$ readlink -e /sys/bus/pci/devices/$BDF/iommu_group
|
||||
/sys/kernel/iommu_groups/1
|
||||
```
|
||||
|
||||
The previous output shows that the GPU belongs to IOMMU group 1.
|
||||
|
||||
3. Unbind the GPU:
|
||||
|
||||
```
|
||||
$ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind
|
||||
```
|
||||
|
||||
4. Bind the GPU to the `vfio-pci` device driver:
|
||||
|
||||
```
|
||||
$ sudo modprobe vfio-pci
|
||||
$ echo 8086 1616 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
|
||||
$ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind
|
||||
```
|
||||
|
||||
After you run the previous commands, the GPU is bound to `vfio-pci` driver.<br/>
|
||||
A new directory with the IOMMU group number is created under `/dev/vfio`:
|
||||
|
||||
```
|
||||
$ ls -l /dev/vfio
|
||||
total 0
|
||||
crw------- 1 root root 241, 0 May 18 15:38 1
|
||||
crw-rw-rw- 1 root root 10, 196 May 18 15:37 vfio
|
||||
```
|
||||
|
||||
5. Start a Kata container with GPU device:
|
||||
|
||||
```
|
||||
$ sudo docker run -it --runtime=kata-runtime --rm --device /dev/vfio/1 -v /dev:/dev debian /bin/bash
|
||||
```
|
||||
|
||||
Run `lspci` within the container to verify the GPU device is seen in the list of
|
||||
the PCI devices. Note the vendor-device id of the GPU ("8086:1616") in the `lspci` output.
|
||||
|
||||
```
|
||||
$ lspci -nn -D
|
||||
0000:00:00.0 Class [0600]: Device [8086:1237] (rev 02)
|
||||
0000:00:01.0 Class [0601]: Device [8086:7000]
|
||||
0000:00:01.1 Class [0101]: Device [8086:7010]
|
||||
0000:00:01.3 Class [0680]: Device [8086:7113] (rev 03)
|
||||
0000:00:02.0 Class [0604]: Device [1b36:0001]
|
||||
0000:00:03.0 Class [0780]: Device [1af4:1003]
|
||||
0000:00:04.0 Class [0100]: Device [1af4:1004]
|
||||
0000:00:05.0 Class [0002]: Device [1af4:1009]
|
||||
0000:00:06.0 Class [0200]: Device [1af4:1000]
|
||||
0000:00:0f.0 Class [0300]: Device [8086:1616] (rev 09)
|
||||
```
|
||||
|
||||
Additionally, you can access the device node for the graphics device:
|
||||
|
||||
```
|
||||
$ ls /dev/dri
|
||||
card0 renderD128
|
||||
```
|
||||
|
||||
## GVT-g with Kata Containers
|
||||
|
||||
For GVT-g, you append `i915.enable_gvt=1` in addition to `intel_iommu=on`
|
||||
on your host kernel command line and then reboot your host.
|
||||
|
||||
Use the following steps to pass an Intel Graphics device in GVT-g mode to a Kata Container:
|
||||
|
||||
1. Find the BDF for GPU device:
|
||||
|
||||
```
|
||||
$ sudo lspci -nn -D | grep Graphics
|
||||
0000:00:02.0 VGA compatible controller [0300]: Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09)
|
||||
```
|
||||
|
||||
Run the previous command to find out the BDF for the GPU device on host.
|
||||
The previous output shows PCI address "0000:00:02.0" is assigned to the GPU device.
|
||||
|
||||
2. Choose the MDEV (Mediated Device) type for VGPU (Virtual GPU):
|
||||
|
||||
For background on `mdev` types, please follow this [kernel documentation](https://github.com/torvalds/linux/blob/master/Documentation/driver-api/vfio-mediated-device.rst).
|
||||
|
||||
* List out the `mdev` types for the VGPU:
|
||||
|
||||
```
|
||||
$ BDF="0000:00:02.0"
|
||||
|
||||
$ ls /sys/devices/pci0000:00/$BDF/mdev_supported_types
|
||||
i915-GVTg_V4_1 i915-GVTg_V4_2 i915-GVTg_V4_4 i915-GVTg_V4_8
|
||||
```
|
||||
|
||||
* Inspect the `mdev` types and choose one that fits your requirement:
|
||||
|
||||
```
|
||||
$ cd /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8 && ls
|
||||
available_instances create description device_api devices
|
||||
|
||||
$ cat description
|
||||
low_gm_size: 64MB
|
||||
high_gm_size: 384MB
|
||||
fence: 4
|
||||
resolution: 1024x768
|
||||
weight: 2
|
||||
|
||||
$ cat available_instances
|
||||
7
|
||||
```
|
||||
|
||||
The output of file `description` represents the GPU resources that are
|
||||
assigned to the VGPU with specified MDEV type.The output of file `available_instances`
|
||||
represents the remaining amount of VGPUs you can create with specified MDEV type.
|
||||
|
||||
3. Create a VGPU:
|
||||
|
||||
* Generate a UUID:
|
||||
|
||||
```
|
||||
$ gpu_uuid=$(uuid)
|
||||
```
|
||||
|
||||
* Write the UUID to the `create` file under the chosen `mdev` type:
|
||||
|
||||
```
|
||||
$ echo $(gpu_uuid) | sudo tee /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8/create
|
||||
```
|
||||
|
||||
4. Find the IOMMU group for the VGPU:
|
||||
|
||||
```
|
||||
$ ls -la /sys/devices/pci0000:00/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_8/devices/${gpu_uuid}/iommu_group
|
||||
lrwxrwxrwx 1 root root 0 May 18 14:35 devices/bbc4aafe-5807-11e8-a43e-03533cceae7d/iommu_group -> ../../../../kernel/iommu_groups/0
|
||||
|
||||
$ ls -l /dev/vfio
|
||||
total 0
|
||||
crw------- 1 root root 241, 0 May 18 11:30 0
|
||||
crw-rw-rw- 1 root root 10, 196 May 18 11:29 vfio
|
||||
```
|
||||
|
||||
The IOMMU group "0" is created from the previous output.<br/>
|
||||
Now you can use the device node `/dev/vfio/0` in docker command line to pass
|
||||
the VGPU to a Kata Container.
|
||||
|
||||
5. Start Kata container with GPU device enabled:
|
||||
|
||||
```
|
||||
$ sudo docker run -it --runtime=kata-runtime --rm --device /dev/vfio/0 -v /dev:/dev debian /bin/bash
|
||||
$ lspci -nn -D
|
||||
0000:00:00.0 Class [0600]: Device [8086:1237] (rev 02)
|
||||
0000:00:01.0 Class [0601]: Device [8086:7000]
|
||||
0000:00:01.1 Class [0101]: Device [8086:7010]
|
||||
0000:00:01.3 Class [0680]: Device [8086:7113] (rev 03)
|
||||
0000:00:02.0 Class [0604]: Device [1b36:0001]
|
||||
0000:00:03.0 Class [0780]: Device [1af4:1003]
|
||||
0000:00:04.0 Class [0100]: Device [1af4:1004]
|
||||
0000:00:05.0 Class [0002]: Device [1af4:1009]
|
||||
0000:00:06.0 Class [0200]: Device [1af4:1000]
|
||||
0000:00:0f.0 Class [0300]: Device [8086:1616] (rev 09)
|
||||
```
|
||||
|
||||
BDF "0000:00:0f.0" is assigned to the VGPU device.
|
||||
|
||||
Additionally, you can access the device node for the graphics device:
|
||||
|
||||
```
|
||||
$ ls /dev/dri
|
||||
card0 renderD128
|
||||
```
|
||||
@@ -1,19 +1,15 @@
|
||||
# Setup to run SPDK vhost-user devices with Kata Containers
|
||||
|
||||
> **Note:** This guide only applies to QEMU, since the vhost-user storage
|
||||
> device is only available for QEMU now. The enablement work on other
|
||||
> hypervisors is still ongoing.
|
||||
> **Note:** This guide applies to both **runtime-rs with Dragonball** and **QEMU** hypervisors. For runtime-rs, the procedure is simplified as there is no need to manually create device nodes.
|
||||
|
||||
## SPDK vhost-user Target Overview
|
||||
|
||||
The Storage Performance Development Kit (SPDK) provides a set of tools and
|
||||
libraries for writing high performance, scalable, user-mode storage applications.
|
||||
The Storage Performance Development Kit (SPDK) provides a set of tools and libraries for writing high performance, scalable, user-mode storage applications.
|
||||
|
||||
virtio, vhost and vhost-user:
|
||||
- virtio is an efficient way to transport data for virtual environments and
|
||||
guests. It is most commonly used in QEMU VMs, where the VM itself exposes a
|
||||
virtual PCI device and the guest OS communicates with it using a specific virtio
|
||||
PCI driver. Its diagram is:
|
||||
|
||||
- virtio is an efficient way to transport data for virtual environments and guests. It is most commonly used in QEMU VMs, where the VM itself exposes a virtual PCI device and the guest OS communicates with it using a specific virtio PCI driver. Its diagram is:
|
||||
|
||||
```
|
||||
+---------+------+--------+----------+--+
|
||||
| +------+-------------------+ |
|
||||
@@ -42,6 +38,7 @@ uses the same virtio queue layout as virtio to allow vhost devices to be mapped
|
||||
directly to virtio devices. The initial vhost implementation is a part of the
|
||||
Linux kernel and uses an ioctl interface to communicate with userspace
|
||||
applications. Its diagram is:
|
||||
|
||||
```
|
||||
+---------+------+--------+----------+--+
|
||||
| +------+-------------------+ |
|
||||
@@ -65,9 +62,8 @@ applications. Its diagram is:
|
||||
+---------------------------------------+
|
||||
```
|
||||
|
||||
- vhost-user implements the control plane through Unix domain socket to establish
|
||||
virtio queue sharing with a user space process on the same host. SPDK exposes
|
||||
vhost devices via the vhost-user protocol. Its diagram is:
|
||||
- vhost-user implements the control plane through Unix domain socket to establish virtio queue sharing with a user space process on the same host. SPDK exposes vhost devices via the vhost-user protocol. Its diagram is:
|
||||
|
||||
```
|
||||
+----------------+------+--+----------+-+
|
||||
| +------+-------------+ |
|
||||
@@ -95,169 +91,159 @@ vhost devices via the vhost-user protocol. Its diagram is:
|
||||
+---------------------------------------+
|
||||
```
|
||||
|
||||
SPDK vhost is a vhost-user slave server. It exposes Unix domain sockets and
|
||||
allows external applications to connect. It is capable of exposing virtualized
|
||||
storage devices to QEMU instances or other arbitrary processes.
|
||||
SPDK vhost is a vhost-user slave server. It exposes Unix domain sockets and allows external applications to connect. It is capable of exposing virtualized storage devices to QEMU instances or other arbitrary processes.
|
||||
|
||||
Currently, the SPDK vhost-user target can exposes these types of virtualized
|
||||
devices:
|
||||
Currently, the SPDK vhost-user target can expose several types of virtualized devices, but the most commonly used one in Kata Containers is the block device, which is supported by both runtime-rs with Dragonball and QEMU hypervisors:
|
||||
|
||||
- `vhost-user-blk`
|
||||
- `vhost-user-scsi`
|
||||
- `vhost-user-nvme` (deprecated from SPDK 21.07 release)
|
||||
|
||||
A block device that can be used as a regular block device in the guest. It is suitable for workloads that require high performance and low latency, such as databases or high I/O applications.
|
||||
|
||||
For more information, visit [SPDK](https://spdk.io) and [SPDK vhost-user target](https://spdk.io/doc/vhost.html).
|
||||
|
||||
## Install and setup SPDK vhost-user target
|
||||
## Prerequisites
|
||||
|
||||
### Get source code and build SPDK
|
||||
- A Kubernetes cluster with Kata Containers enabled (runtime-rs with Dragonball or QEMU)
|
||||
- SPDK built and `spdk_tgt` available
|
||||
- For Kubernetes CSI integration: `csi-kata-directvolume` deployed
|
||||
|
||||
Following the SPDK [getting started guide](https://spdk.io/doc/getting_started.html).
|
||||
## Method 1: Using CSI Driver (Recommended for Kubernetes)
|
||||
|
||||
### Run SPDK vhost-user target
|
||||
This is the recommended method for Kubernetes environments, leveraging the `csi-kata-directvolume` CSI driver.
|
||||
|
||||
First, run the SPDK `setup.sh` script to setup some hugepages for the SPDK vhost
|
||||
target application. We recommend you use a minimum of 4GiB, enough for the SPDK
|
||||
vhost target and the virtual machine.
|
||||
This will allocate 4096MiB (4GiB) of hugepages, and avoid binding PCI devices:
|
||||
### 1. Start SPDK Service
|
||||
|
||||
```sh
|
||||
export SPDK_DEVEL=<path-to-your-spdk>
|
||||
export VHU_UDS_PATH=/var/lib/spdk/vhost
|
||||
|
||||
# Reset and allocate hugepages
|
||||
$ cd $SPDK_DEVEL
|
||||
$ sudo ./scripts/setup.sh reset
|
||||
$ sudo sysctl -w vm.nr_hugepages=2048
|
||||
$ sudo HUGEMEM=4096 ./scripts/setup.sh
|
||||
|
||||
# Start SPDK vhost target
|
||||
$ sudo mkdir -p $VHU_UDS_PATH
|
||||
$ sudo $SPDK_DEVEL/build/bin/spdk_tgt -S $VHU_UDS_PATH -s 1024 -m 0x3 &
|
||||
```
|
||||
|
||||
> **Notes:**
|
||||
|
||||
> - `-s 1024`: size of the hugepage memory pool in MB.
|
||||
> - `-m 0x3`: CPU mask specifying which cores SPDK will use.
|
||||
> - If `vfio-pci` driver is supported, use `DRIVER_OVERRIDE=vfio-pci` with setup.sh.
|
||||
|
||||
### 2. Deploy CSI Driver and Kubernetes Resources
|
||||
|
||||
Deploy the CSI driver following the [deployment guide](../../src/tools/csi-kata-directvolume/docs/deploy-csi-kata-directvol.md).
|
||||
|
||||
Create StorageClass, PVC, and Pod:
|
||||
|
||||
```sh
|
||||
$ cd kata-containers/src/tools/csi-kata-directvolume/examples/pod-with-spdkvol
|
||||
$ kubectl apply -f csi-storageclass.yaml
|
||||
$ kubectl apply -f csi-pvc.yaml
|
||||
$ kubectl apply -f csi-app.yaml
|
||||
```
|
||||
|
||||
This creates:
|
||||
|
||||
- Storage Class `spdk-test-adapted` with `volumetype=spdkvol`
|
||||
- PVC `kata-spdk-directvolume-pvc`
|
||||
- Pod `spdk-pod-test`
|
||||
|
||||
### 3. Verify the Volume
|
||||
|
||||
Check the mounted block device inside the pod:
|
||||
|
||||
```sh
|
||||
$ kubectl exec -it spdk-pod-test -- /bin/sh
|
||||
|
||||
$ lsblk
|
||||
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
|
||||
vda 254:0 0 256M 1 disk
|
||||
└─vda1 254:1 0 253M 1 part
|
||||
vdb 254:16 0 2G 0 disk /data
|
||||
|
||||
$ echo "hello spdk" > /data/test.txt
|
||||
$ cat /data/test.txt
|
||||
hello spdk
|
||||
```
|
||||
|
||||
The SPDK-backed volume `/dev/vdb` is mounted to `/data` inside the container.
|
||||
|
||||
### 4. Cleanup
|
||||
|
||||
```sh
|
||||
$ kubectl delete -f csi-app.yaml
|
||||
$ kubectl delete -f csi-pvc.yaml
|
||||
$ kubectl delete -f csi-storageclass.yaml
|
||||
```
|
||||
|
||||
## Method 2: Using kata-ctl direct-volume (For Manual Setup)
|
||||
|
||||
This method is suitable for manual testing or non-Kubernetes environments using containerd.
|
||||
|
||||
### 1. Start SPDK vhost target and Create Block Device
|
||||
|
||||
```bash
|
||||
$ sudo HUGEMEM=4096 PCI_WHITELIST="none" scripts/setup.sh
|
||||
$ export SPDK_DEVEL=<path-to-your-spdk>
|
||||
$ export VHU_UDS_PATH=/tmp/vhu-targets
|
||||
$ export RAW_DISKS=<your-rawdisk-path> # e.g., export RAW_DISKS=/tmp/rawdisks
|
||||
|
||||
# Reset and setup hugepages
|
||||
$ sudo ${SPDK_DEVEL}/scripts/setup.sh reset
|
||||
$ sudo sysctl -w vm.nr_hugepages=2048
|
||||
$ sudo HUGEMEM=4096 DRIVER_OVERRIDE=vfio-pci ${SPDK_DEVEL}/scripts/setup.sh
|
||||
|
||||
# Start SPDK vhost target
|
||||
$ sudo ${SPDK_DEVEL}/build/bin/spdk_tgt -S $VHU_UDS_PATH -s 1024 -m 0x3 &
|
||||
```
|
||||
|
||||
Then, take directory `/var/run/kata-containers/vhost-user` as Kata's vhost-user
|
||||
device directory. Make subdirectories for vhost-user sockets and device nodes:
|
||||
Create a vhost controller:
|
||||
|
||||
```bash
|
||||
$ sudo mkdir -p /var/run/kata-containers/vhost-user/
|
||||
$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/
|
||||
$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/sockets/
|
||||
$ sudo mkdir -p /var/run/kata-containers/vhost-user/block/devices/
|
||||
# Create raw disk
|
||||
$ mkdir -p "${RAW_DISKS}" # ensure the directory exists
|
||||
$ sudo dd if=/dev/zero of=${RAW_DISKS}/rawdisk01.20g bs=1M count=20480
|
||||
|
||||
# Create AIO bdev
|
||||
$ sudo ${SPDK_DEVEL}/scripts/rpc.py bdev_aio_create ${RAW_DISKS}/rawdisk01.20g vhu-rawdisk01.20g 512
|
||||
|
||||
# Create vhost-user-blk controller
|
||||
$ sudo ${SPDK_DEVEL}/scripts/rpc.py vhost_create_blk_controller vhost-blk-rawdisk01.sock vhu-rawdisk01.20g
|
||||
```
|
||||
|
||||
For more details, see section [Host setup for vhost-user devices](#host-setup-for-vhost-user-devices).
|
||||
A vhost controller `vhost-blk-rawdisk01.sock` is created under `$VHU_UDS_PATH/`.
|
||||
|
||||
Next, start the SPDK vhost target application. The following command will start
|
||||
vhost on the first CPU core with all future socket files placed in
|
||||
`/var/run/kata-containers/vhost-user/block/sockets/`:
|
||||
### 2. Configure Direct Volume with kata-ctl
|
||||
|
||||
For runtime-rs with Dragonball, there is no need to manually create device nodes. Use `kata-ctl direct-volume add`:
|
||||
|
||||
```bash
|
||||
$ sudo app/spdk_tgt/spdk_tgt -S /var/run/kata-containers/vhost-user/block/sockets/ &
|
||||
# Add direct volume
|
||||
$ sudo kata-ctl direct-volume add /kubelet/kata-test-vol-001/volume001 "{\"device\": \"${VHU_UDS_PATH}/vhost-blk-rawdisk01.sock\", \"volume_type\":\"spdkvol\", \"fs_type\": \"ext4\", \"metadata\":{}, \"options\": []}"
|
||||
```
|
||||
|
||||
To list all available vhost options run the following command:
|
||||
The volume info is stored at `/run/kata-containers/shared/direct-volumes/` with encoded path.
|
||||
|
||||
### 3. Run a Kata Container
|
||||
|
||||
```bash
|
||||
$ app/spdk_tgt/spdk_tgt -h
|
||||
# For runtime-rs with Dragonball
|
||||
# IMAGE=docker.io/library/ubuntu:latest
|
||||
$ sudo ctr run -t --rm --runtime io.containerd.kata.v2 \
|
||||
--mount type=spdkvol,src=/kubelet/kata-test-vol-001/volume001,dst=/disk001,options=rbind:rw \
|
||||
"$IMAGE" kata-spdk-vol-test /bin/bash
|
||||
```
|
||||
|
||||
Create an experimental `vhost-user-blk` device based on memory directly:
|
||||
Inside the container, the SPDK volume will be available at `/disk001`.
|
||||
|
||||
- The following RPC will create a 64MB memory block device named `Malloc0`
|
||||
with 4096-byte block size:
|
||||
## Additional Resources
|
||||
|
||||
```bash
|
||||
$ sudo scripts/rpc.py bdev_malloc_create 64 4096 -b Malloc0
|
||||
```
|
||||
- [How to run Kata Containers with Kinds of Block Volumes](../how-to/how-to-run-kata-containers-with-kinds-of-Block-Volumes.md)
|
||||
- [CSI Direct Volume Driver README](../../src/tools/csi-kata-directvolume/README.md)
|
||||
- [SPDK Usage Guide for CSI](../../src/tools/csi-kata-directvolume/docs/spdk-usage.md)
|
||||
- [Direct Block Device Assignment Design](../design/direct-blk-device-assignment.md)
|
||||
|
||||
- The following RPC will create a `vhost-user-blk` device exposing `Malloc0`
|
||||
block device. The device will be accessible via
|
||||
`/var/run/kata-containers/vhost-user/block/sockets/vhostblk0`:
|
||||
|
||||
```bash
|
||||
$ sudo scripts/rpc.py vhost_create_blk_controller vhostblk0 Malloc0
|
||||
```
|
||||
|
||||
## Host setup for vhost-user devices
|
||||
|
||||
Considering the OCI specification and characteristics of vhost-user device,
|
||||
Kata has chosen to use Linux reserved the block major range `240-254`
|
||||
to map each vhost-user block type to a major. Also a specific directory is
|
||||
used for vhost-user devices.
|
||||
|
||||
The base directory for vhost-user device is a configurable value,
|
||||
with the default being `/var/run/kata-containers/vhost-user`. It can be
|
||||
configured by parameter `vhost_user_store_path` in [Kata TOML configuration file](../../src/runtime/README.md#configuration).
|
||||
|
||||
Currently, the vhost-user storage device is not enabled by default, so
|
||||
the user should enable it explicitly inside the Kata TOML configuration
|
||||
file by setting `enable_vhost_user_store = true`. Since SPDK vhost-user target
|
||||
requires hugepages, hugepages should also be enabled inside the Kata TOML
|
||||
configuration file by setting `enable_hugepages = true`.
|
||||
Here is the conclusion of parameter setting for vhost-user storage device:
|
||||
|
||||
```toml
|
||||
enable_hugepages = true
|
||||
enable_vhost_user_store = true
|
||||
vhost_user_store_path = "<Path of the base directory for vhost-user device>"
|
||||
```
|
||||
|
||||
> **Note:** These parameters are under `[hypervisor.qemu]` section in Kata
|
||||
> TOML configuration file. If they are absent, users should still add them
|
||||
> under `[hypervisor.qemu]` section.
|
||||
|
||||
|
||||
For the subdirectories of `vhost_user_store_path`:
|
||||
- `block` is used for block device;
|
||||
- `block/sockets` is where we expect UNIX domain sockets for vhost-user
|
||||
block devices to live;
|
||||
- `block/devices` is where simulated block device nodes for vhost-user
|
||||
block devices are created.
|
||||
|
||||
For example, if using the default directory `/var/run/kata-containers/vhost-user`,
|
||||
UNIX domain sockets for vhost-user block device are under `/var/run/kata-containers/vhost-user/block/sockets/`.
|
||||
Device nodes for vhost-user block device are under `/var/run/kata-containers/vhost-user/block/devices/`.
|
||||
|
||||
Currently, Kata has chosen major number 241 to map to `vhost-user-blk` devices.
|
||||
For `vhost-user-blk` device named `vhostblk0`, a UNIX domain socket is already
|
||||
created by SPDK vhost target, and a block device node with major `241` and
|
||||
minor `0` should be created for it, in order to be recognized by Kata runtime:
|
||||
|
||||
```bash
|
||||
$ sudo mknod /var/run/kata-containers/vhost-user/block/devices/vhostblk0 b 241 0
|
||||
```
|
||||
|
||||
## Launch a Kata container with SPDK vhost-user block device
|
||||
|
||||
To use `vhost-user-blk` device, use `ctr` to pass a host `vhost-user-blk`
|
||||
device to the container. In your `config.json`, you should use `devices`
|
||||
to pass a host device to the container.
|
||||
|
||||
For example (only `vhost-user-blk` listed):
|
||||
|
||||
```json
|
||||
{
|
||||
"linux": {
|
||||
"devices": [
|
||||
{
|
||||
"path": "/dev/vda",
|
||||
"type": "b",
|
||||
"major": 241,
|
||||
"minor": 0,
|
||||
"fileMode": 420,
|
||||
"uid": 0,
|
||||
"gid": 0
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
With `rootfs` provisioned under `bundle` directory, you can run your SPDK container:
|
||||
|
||||
```bash
|
||||
$ sudo ctr run -d --runtime io.containerd.run.kata.v2 --config bundle/config.json spdk_container
|
||||
```
|
||||
|
||||
Example of performing I/O operations on the `vhost-user-blk` device inside
|
||||
container:
|
||||
|
||||
```
|
||||
$ sudo ctr t exec --exec-id 1 -t spdk_container sh
|
||||
/ # ls -l /dev/vda
|
||||
brw-r--r-- 1 root root 254, 0 Jan 20 03:54 /dev/vda
|
||||
/ # dd if=/dev/vda of=/tmp/ddtest bs=4k count=20
|
||||
20+0 records in
|
||||
20+0 records out
|
||||
81920 bytes (80.0KB) copied, 0.002996 seconds, 26.1MB/s
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user