mirror of
				https://github.com/kata-containers/kata-containers.git
				synced 2025-10-25 14:23:11 +00:00 
			
		
		
		
	https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html Signed-off-by: Balint Tobik <btobik@redhat.com>
		
			
				
	
	
		
			311 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			311 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Setup to use SR-IOV with Kata Containers and Docker*
 | |
| 
 | |
| Single Root I/O Virtualization (SR-IOV) enables splitting a physical device into
 | |
| virtual functions (VFs). Virtual functions enable direct passthrough to virtual
 | |
| machines or containers. For Kata Containers, we enabled a Container Network
 | |
| Model (CNM) plugin. Additionally, we made the necessary changes in the
 | |
| runtime to detect virtual functions in a container's network namespace to use
 | |
| SR-IOV for network based devices.
 | |
| 
 | |
| ## Install the SR-IOV Docker\* plugin
 | |
| 
 | |
| To create a network with associated VFs, which can be passed to
 | |
| Kata Containers, you must install a SR-IOV Docker plugin. The
 | |
| created network is based on a physical function (PF) device. The network can
 | |
| create `n` containers, where `n` is the number of VFs associated with the
 | |
| Physical Function (PF).
 | |
| 
 | |
| To install the plugin, follow the [plugin installation instructions](https://github.com/clearcontainers/sriov).
 | |
| 
 | |
| 
 | |
| ## Host setup for SR-IOV
 | |
| 
 | |
| In order to setup your host for SR-IOV, the following has to be true:
 | |
| 
 | |
| - The host system must support Intel VT-d.
 | |
| - Your device (NIC) must support SR-IOV.
 | |
| - The host kernel must have Input-Output Memory Management Unit (IOMMU)
 | |
|   and Virtual Function I/O (VFIO) support.
 | |
| - `CONFIG_VFIO_NOIOMMU` must be disabled in the host kernel
 | |
|   configuration. You must rebuild your host system's kernel in
 | |
|   order to disable `CONFIG_VFIO_NOIOMMU` in the kernel configuration.
 | |
| - Optionally, you might need to add a PCI override for your Network Interface
 | |
|   Controller (NIC). The section [Checking your NIC for SR-IOV](#checking-your-nic-for-sr-iov) describes how to assess if you need to make NIC changes and how to make
 | |
|   the necessary changes.
 | |
| 
 | |
| Besides, you need to enable the NIC driver in your guest kernel config (e.g. mlx5 for Mellanox NIC).
 | |
| All the modules need to be complied as built-in instead of loadable.
 | |
| 
 | |
| ### Checking your NIC for SR-IOV
 | |
| 
 | |
| The following is an example of how to use `lspci` to check if your NIC supports
 | |
| SR-IOV.
 | |
| 
 | |
| ```
 | |
| $ lspci | grep -i -F ethernet
 | |
| 01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 03)
 | |
| 
 | |
| ...
 | |
| $ #sudo required below to read the card capabilities
 | |
| 
 | |
| $ sudo lspci -s 01:00.0 -v | grep SR-IOV
 | |
|         Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
 | |
| ```
 | |
| 
 | |
| If your card does not report this capability, then it does not support SR-IOV.
 | |
| 
 | |
| ### IOMMU Groups and PCIe Access Control Services
 | |
| 
 | |
| Run the following command to see how the IOMMU groups are setup on your
 | |
| host system:
 | |
| ```
 | |
| $ find /sys/kernel/iommu_groups/ -type l
 | |
| ```
 | |
| 
 | |
| The command's output details whether or not your NIC is setup
 | |
| appropriately with respect to PCIe Access Control Services (ACS).
 | |
| If the IOMMU groups are setup properly, the PCI for each ACS-enabled NIC port
 | |
| should be in its own IOMMU group. If the PCI bridge is within the same IOMMU
 | |
| group as your NIC, it indicates that either your device does not support ACS
 | |
| or your device does not appropriately share this default capability.
 | |
| 
 | |
| If you do not see any output when running the previous
 | |
| command, then you likely need to update your host's kernel configuration.
 | |
| 
 | |
| For more details, see the blog post, "[IOMMU Groups, inside and out](http://vfio.blogspot.com/2014/08/iommu-groups-inside-and-out.html)"
 | |
| 
 | |
| ### Update the host kernel
 | |
| 
 | |
| 
 | |
| Depending on your host kernel configuration, you might have to rebuild the
 | |
| kernel. If the following conditions are true, you do not need to rebuild
 | |
| your kernel:
 | |
| 
 | |
| - `CONFIG_VFIO_IOMMU_TYPE1`, `CONFIG_VFIO`, and `CONFIG_VFIO_PCI` are set in
 | |
| the kernel configuration. Your kernel is built with VFIO support when
 | |
| configurations are set.
 | |
| - `CONFIG_VFIO_NOIOMMU` is disabled in the host kernel configuration.
 | |
| 
 | |
| See the following steps one through three if you need to rebuild the kernel.
 | |
| 
 | |
| The following steps, which are based on the Ubuntu 16.04 distribution, update
 | |
| the SR-IOV host system. If you use a different distribution, make
 | |
| appropriate adjustments to the commands.
 | |
| 
 | |
| Before building a new kernel, keep in mind:
 | |
| 
 | |
| - You need to be *very clear* of the security and maintenance implications
 | |
|   of creating a new **host kernel**.
 | |
| - Mistakes in installing new kernels and updating the bootloader could make
 | |
|   your system unbootable.
 | |
| - We advise you to ensure you have a recent (and tested) full system backup
 | |
|   before proceeding.
 | |
| 
 | |
| 1. Grab kernel sources:
 | |
| 
 | |
|    ```
 | |
|    $ sudo apt-get install linux-source-<linux-version>
 | |
|    $ sudo apt-get install linux-headers-<linux-version>
 | |
|    $ cd /usr/src/linux-source-<linux-version>/
 | |
|    $ sudo tar -xvf linux-source-<linux-version>.tar.bz2
 | |
|    $ cd linux-source-<linux-version>
 | |
|    $ sudo apt-get install libssl-dev
 | |
|    ```
 | |
| 
 | |
| 2. Examine and update the `config` file if necessary:
 | |
| 
 | |
|    ```
 | |
|    $ sudo cp /boot/config-4.8.0-36-generic .config
 | |
|    $ # verify resulting .config does not have NOIOMMU set; ie: `CONFIG_VFIO_NOIOMMU` is not set
 | |
|    $ grep -q "^CONFIG_VFIO_NOIOMMU" /boot/config-$(uname -r) || echo ok
 | |
|    $ # verify `CONFIG_VFIO_IOMMU_TYPE1`, `CONFIG_VFIO=m` and `CONFIG_VFIO_PCI=m` are set as well.
 | |
|    $ for opt in CONFIG_VFIO_IOMMU_TYPE1 CONFIG_VFIO CONFIG_VFIO_PCI
 | |
|      do
 | |
|       grep "^${opt}=" /boot/config-$(uname -r)
 | |
|      done
 | |
|    $ sudo make olddefconfig
 | |
|    ```
 | |
| 
 | |
|    You might want to modify the kernel `Makefile` to add a unique identifier
 | |
|    to the `EXTRAVERSION` variable prior to running the make. Including the `EXTRAVERSION`
 | |
|    variable causes the `uname -r` command to indicate that a customized kernel is
 | |
|    installed and running.
 | |
| 
 | |
| 3. Build and install the kernel:
 | |
| 
 | |
|    ```
 | |
|    $ make -j <number_of_cpus>
 | |
|    $ make modules
 | |
|    $ sudo make modules_install
 | |
|    $ sudo make install
 | |
|    ```
 | |
| 
 | |
| 4. Edit grub to enable `intel-iommu`:
 | |
| 
 | |
|    ```
 | |
|    edit /etc/default/grub and add intel_iommu=on to cmdline:
 | |
|    $ sudo sed -i -e 's/^kernel_params = "\(.*\)"/GRUB_CMDLINE_LINUX="\1 intel_iommu=on"/g' /etc/default/grub
 | |
|    $ sudo update-grub
 | |
|    ```
 | |
| 
 | |
| 5. Reboot the system and verify:
 | |
| 
 | |
|    Host system should be ready now. Reboot the system.
 | |
|    ```
 | |
|    $ sudo reboot
 | |
|    ```
 | |
| 
 | |
|    To verify the kernel version and the kernel command line, take a look at
 | |
|    `/proc/version` and `/proc/cmdline`
 | |
| 
 | |
| 6. Verify Intel VT-d is initialized:
 | |
| 
 | |
|    To check if Intel VT-d initialized correctly, look for the following
 | |
|    line in the `dmesg` output:
 | |
|    ```
 | |
|    DMAR: Intel(R) Virtualization Technology for Directed I/O
 | |
|    ```
 | |
| 
 | |
|    Older kernels use a different prefix (e.g. PCI-DMA):
 | |
|    ```
 | |
|    PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
 | |
|    ```
 | |
| 
 | |
| 7. Add the `vfio-pci` module:
 | |
| 
 | |
|    ```
 | |
|    sudo modprobe vfio-pci
 | |
|    ```
 | |
| 
 | |
| 8. Add PCI quirk for SR-IOV NIC if necessary:
 | |
| 
 | |
|    ```
 | |
|    $ find /sys/kernel/iommu_groups/ -type l
 | |
|    ```
 | |
|    The previous command verifies that your NIC appears in its own IOMMU group
 | |
|    and no other devices appear in the same group. In the rare case where your
 | |
|    PCI NIC does not appear in its own group, it is likely that the NIC does
 | |
|    not support ACS or you built and ran an old kernel. Depending on your NIC
 | |
|    and if it enforces isolation, you might resolve this by adding a
 | |
|    `pcie_acs_override=` option to your kernel command line and reboot.
 | |
|    See [PCIE-ACS-override-option](https://lkml.org/lkml/2013/5/30/513) for
 | |
|    detailed information about this option.
 | |
| 
 | |
| ## Set up the SR-IOV Device
 | |
| 
 | |
| All the steps in prior sections need to be performed just once to prepare the
 | |
| SR-IOV host systems. The following is needed per system boot in order to
 | |
| facilitate setting up a physical device's virtual functions.
 | |
| 
 | |
| The following procedure sets up your SR-IOV device and needs to be done per
 | |
| system boot. Set up includes loading a device driver, finding out how many
 | |
| virtual functions (VF) you can create, and creating those virtual functions.
 | |
| Once you create VFs you cannot increase or decrease the number of VFs without
 | |
| first setting the number back to zero. Based on this, it is expected that you
 | |
| set the number of VFs for a physical device just once.
 | |
| 
 | |
| 1. Add `vfio-pci` device driver:
 | |
| 
 | |
|    ```
 | |
|    $ sudo modprobe vfio-pci
 | |
|    ```
 | |
|    `vfio-pci` is a driver used to reserve a VF PCI device.
 | |
| 
 | |
| 2. Find the NICs of interest:
 | |
| 
 | |
|    ```
 | |
|    $ lspci | grep Ethernet
 | |
|    00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 04)
 | |
|    01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
 | |
|    01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
 | |
|    ```
 | |
| 
 | |
|    The previous example finds the PCI details for the NICs in question.
 | |
|    In our case, both 01:00.0 and 01:00.1 are the two ports on our x540-AT2 card
 | |
|    that we will use. You can use `lshw` command to get further details on the
 | |
|    controller and verify it supports SR-IOV.
 | |
| 
 | |
| 3. Check how many VFs you can create:
 | |
| 
 | |
|    ```
 | |
|    $ cat /sys/bus/pci/devices/0000\:01\:00.0/sriov_totalvfs
 | |
|    63
 | |
|    $ cat /sys/bus/pci/devices/0000\:01\:00.1/sriov_totalvfs
 | |
|    63
 | |
|    ```
 | |
|    The previous commands show how many VFs you can create. The `sriov_totalvfs`
 | |
|    file under `sysfs` for a PCI device specifies the total number of VFs that you
 | |
|    can create.
 | |
| 
 | |
| 4. Create the VFs:
 | |
| 
 | |
|    ```
 | |
|    # echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
 | |
|    # echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.1/sriov_numvfs
 | |
|    ```
 | |
| 
 | |
|    Create virtual functions by editing `sriov_numvfs`. In our example, we create
 | |
|    virtual functions by editing `sriov_numvfs`. This example
 | |
|    creates one VF per physical device. Note, creating one VF eliminates the
 | |
|    usefulness of SR-IOV, and is done for simplicity in this example.
 | |
| 
 | |
|  5. Verify the VFs were added to the host:
 | |
| 
 | |
|     ```
 | |
|     $ sudo lspci | grep Ethernet | grep Virtual
 | |
|     02:10.0 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)
 | |
|     02:10.1 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)
 | |
|     ```
 | |
| 
 | |
| 6. Assign a MAC address to each VF:
 | |
| 
 | |
|    ```
 | |
|    $ sudo ip link set <pf> vf <vfidx> mac <fake MAC address>
 | |
|    ```
 | |
| 
 | |
|    Depending on the NIC being used, you might need to explicitly set the MAC
 | |
|    address for the VF device. Setting the MAC address guarantees that the
 | |
|    address is consistent on the host and when passed to the guest. Verify a MAC
 | |
|    address is assigned to the VF using command `ip link show dev <vf>`.
 | |
| 
 | |
| ## Example: Launch a Kata Containers container using SR-IOV
 | |
| 
 | |
| The following example launches a Kata Containers container using SR-IOV:
 | |
| 
 | |
| 1. Build and start SR-IOV plugin:
 | |
| 
 | |
|    To install the SR-IOV plugin, follow the [SR-IOV plugin installation instructions](https://github.com/clearcontainers/sriov)
 | |
| 
 | |
| 2. Create the docker network:
 | |
| 
 | |
|    ```
 | |
|    $ sudo docker network create -d sriov --internal --opt pf_iface=enp1s0f0 --opt vlanid=100 --subnet=192.168.0.0/24 vfnet
 | |
| 
 | |
|    E0505 09:35:40.550129    2541 plugin.go:297] Numvfs and Totalvfs are not same on the PF - Initialize numvfs to totalvfs
 | |
|    ee2e5a594f9e4d3796eda972f3b46e52342aea04cbae8e5eac9b2dd6ff37b067
 | |
|    ```
 | |
| 
 | |
|    The previous commands create the required SR-IOV docker network, subnet, `vlanid`,
 | |
|    and physical interface.
 | |
| 
 | |
| 3. Start containers and test their connectivity:
 | |
| 
 | |
|    ```
 | |
|    $ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.10 -it alpine
 | |
|    ```
 | |
| 
 | |
|    The previous example starts a container making use of SR-IOV.
 | |
|    If two machines with SR-IOV enabled NICs are connected back-to-back and each
 | |
|    has a network with matching `vlanid` created, use the following two commands
 | |
|    to test the connectivity:
 | |
| 
 | |
|    Machine 1:
 | |
|    ```
 | |
|    sriov-1:~$ sudo docker run --runtime=kata-runtime --net=vfnet  --cap-add SYS_ADMIN --ip=192.168.0.10 -it mcastelino/iperf bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -s"
 | |
| 
 | |
|    ```
 | |
|    Machine 2:
 | |
|    ```
 | |
|    sriov-2:~$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.11 -it mcastelino/iperf iperf3 -c 192.168.0.10 bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -c 192.168.0.10"
 | |
|    ```
 |