kata-containers/docs/use-cases/using-SRIOV-and-kata.md

311 lines
12 KiB
Markdown

# Setup to use SR-IOV with Kata Containers and Docker*
Single Root I/O Virtualization (SR-IOV) enables splitting a physical device into
virtual functions (VFs). Virtual functions enable direct passthrough to virtual
machines or containers. For Kata Containers, we enabled a Container Network
Model (CNM) plugin. Additionally, we made the necessary changes in the
runtime to detect virtual functions in a container's network namespace to use
SR-IOV for network based devices.
## Install the SR-IOV Docker\* plugin
To create a network with associated VFs, which can be passed to
Kata Containers, you must install a SR-IOV Docker plugin. The
created network is based on a physical function (PF) device. The network can
create `n` containers, where `n` is the number of VFs associated with the
Physical Function (PF).
To install the plugin, follow the [plugin installation instructions](https://github.com/clearcontainers/sriov).
## Host setup for SR-IOV
In order to setup your host for SR-IOV, the following has to be true:
- The host system must support Intel VT-d.
- Your device (NIC) must support SR-IOV.
- The host kernel must have Input-Output Memory Management Unit (IOMMU)
and Virtual Function I/O (VFIO) support.
- `CONFIG_VFIO_NOIOMMU` must be disabled in the host kernel
configuration. You must rebuild your host system's kernel in
order to disable `CONFIG_VFIO_NOIOMMU` in the kernel configuration.
- Optionally, you might need to add a PCI override for your Network Interface
Controller (NIC). The section [Checking your NIC for SR-IOV](#checking-your-nic-for-sr-iov) describes how to assess if you need to make NIC changes and how to make
the necessary changes.
Besides, you need to enable the NIC driver in your guest kernel config (e.g. mlx5 for Mellanox NIC).
All the modules need to be complied as built-in instead of loadable.
### Checking your NIC for SR-IOV
The following is an example of how to use `lspci` to check if your NIC supports
SR-IOV.
```
$ lspci | grep -i -F ethernet
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 03)
...
$ #sudo required below to read the card capabilities
$ sudo lspci -s 01:00.0 -v | grep SR-IOV
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
```
If your card does not report this capability, then it does not support SR-IOV.
### IOMMU Groups and PCIe Access Control Services
Run the following command to see how the IOMMU groups are setup on your
host system:
```
$ find /sys/kernel/iommu_groups/ -type l
```
The command's output details whether or not your NIC is setup
appropriately with respect to PCIe Access Control Services (ACS).
If the IOMMU groups are setup properly, the PCI for each ACS-enabled NIC port
should be in its own IOMMU group. If the PCI bridge is within the same IOMMU
group as your NIC, it indicates that either your device does not support ACS
or your device does not appropriately share this default capability.
If you do not see any output when running the previous
command, then you likely need to update your host's kernel configuration.
For more details, see the blog post, "[IOMMU Groups, inside and out](http://vfio.blogspot.com/2014/08/iommu-groups-inside-and-out.html)"
### Update the host kernel
Depending on your host kernel configuration, you might have to rebuild the
kernel. If the following conditions are true, you do not need to rebuild
your kernel:
- `CONFIG_VFIO_IOMMU_TYPE1`, `CONFIG_VFIO`, and `CONFIG_VFIO_PCI` are set in
the kernel configuration. Your kernel is built with VFIO support when
configurations are set.
- `CONFIG_VFIO_NOIOMMU` is disabled in the host kernel configuration.
See the following steps one through three if you need to rebuild the kernel.
The following steps, which are based on the Ubuntu 16.04 distribution, update
the SR-IOV host system. If you use a different distribution, make
appropriate adjustments to the commands.
Before building a new kernel, keep in mind:
- You need to be *very clear* of the security and maintenance implications
of creating a new **host kernel**.
- Mistakes in installing new kernels and updating the bootloader could make
your system unbootable.
- We advise you to ensure you have a recent (and tested) full system backup
before proceeding.
1. Grab kernel sources:
```
$ sudo apt-get install linux-source-<linux-version>
$ sudo apt-get install linux-headers-<linux-version>
$ cd /usr/src/linux-source-<linux-version>/
$ sudo tar -xvf linux-source-<linux-version>.tar.bz2
$ cd linux-source-<linux-version>
$ sudo apt-get install libssl-dev
```
2. Examine and update the `config` file if necessary:
```
$ sudo cp /boot/config-4.8.0-36-generic .config
$ # verify resulting .config does not have NOIOMMU set; ie: `CONFIG_VFIO_NOIOMMU` is not set
$ grep -q "^CONFIG_VFIO_NOIOMMU" /boot/config-$(uname -r) || echo ok
$ # verify `CONFIG_VFIO_IOMMU_TYPE1`, `CONFIG_VFIO=m` and `CONFIG_VFIO_PCI=m` are set as well.
$ for opt in CONFIG_VFIO_IOMMU_TYPE1 CONFIG_VFIO CONFIG_VFIO_PCI
do
grep "^${opt}=" /boot/config-$(uname -r)
done
$ sudo make olddefconfig
```
You might want to modify the kernel `Makefile` to add a unique identifier
to the `EXTRAVERSION` variable prior to running the make. Including the `EXTRAVERSION`
variable causes the `uname -r` command to indicate that a customized kernel is
installed and running.
3. Build and install the kernel:
```
$ make -j <number_of_cpus>
$ make modules
$ sudo make modules_install
$ sudo make install
```
4. Edit grub to enable `intel-iommu`:
```
edit /etc/default/grub and add intel_iommu=on to cmdline:
$ sudo sed -i -e 's/^kernel_params = "\(.*\)"/GRUB_CMDLINE_LINUX="\1 intel_iommu=on"/g' /etc/default/grub
$ sudo update-grub
```
5. Reboot the system and verify:
Host system should be ready now. Reboot the system.
```
$ sudo reboot
```
To verify the kernel version and the kernel command line, take a look at
`/proc/version` and `/proc/cmdline`
6. Verify Intel VT-d is initialized:
To check if Intel VT-d initialized correctly, look for the following
line in the `dmesg` output:
```
DMAR: Intel(R) Virtualization Technology for Directed I/O
```
Older kernels use a different prefix (e.g. PCI-DMA):
```
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
```
7. Add the `vfio-pci` module:
```
sudo modprobe vfio-pci
```
8. Add PCI quirk for SR-IOV NIC if necessary:
```
$ find /sys/kernel/iommu_groups/ -type l
```
The previous command verifies that your NIC appears in its own IOMMU group
and no other devices appear in the same group. In the rare case where your
PCI NIC does not appear in its own group, it is likely that the NIC does
not support ACS or you built and ran an old kernel. Depending on your NIC
and if it enforces isolation, you might resolve this by adding a
`pcie_acs_override=` option to your kernel command line and reboot.
See [PCIE-ACS-override-option](https://lkml.org/lkml/2013/5/30/513) for
detailed information about this option.
## Set up the SR-IOV Device
All the steps in prior sections need to be performed just once to prepare the
SR-IOV host systems. The following is needed per system boot in order to
facilitate setting up a physical device's virtual functions.
The following procedure sets up your SR-IOV device and needs to be done per
system boot. Set up includes loading a device driver, finding out how many
virtual functions (VF) you can create, and creating those virtual functions.
Once you create VFs you cannot increase or decrease the number of VFs without
first setting the number back to zero. Based on this, it is expected that you
set the number of VFs for a physical device just once.
1. Add `vfio-pci` device driver:
```
$ sudo modprobe vfio-pci
```
`vfio-pci` is a driver used to reserve a VF PCI device.
2. Find the NICs of interest:
```
$ lspci | grep Ethernet
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 04)
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)
```
The previous example finds the PCI details for the NICs in question.
In our case, both 01:00.0 and 01:00.1 are the two ports on our x540-AT2 card
that we will use. You can use `lshw` command to get further details on the
controller and verify it supports SR-IOV.
3. Check how many VFs you can create:
```
$ cat /sys/bus/pci/devices/0000\:01\:00.0/sriov_totalvfs
63
$ cat /sys/bus/pci/devices/0000\:01\:00.1/sriov_totalvfs
63
```
The previous commands show how many VFs you can create. The `sriov_totalvfs`
file under `sysfs` for a PCI device specifies the total number of VFs that you
can create.
4. Create the VFs:
```
# echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.0/sriov_numvfs
# echo 1 | sudo tee /sys/bus/pci/devices/0000\:01\:00.1/sriov_numvfs
```
Create virtual functions by editing `sriov_numvfs`. In our example, we create
virtual functions by editing `sriov_numvfs`. This example
creates one VF per physical device. Note, creating one VF eliminates the
usefulness of SR-IOV, and is done for simplicity in this example.
5. Verify the VFs were added to the host:
```
$ sudo lspci | grep Ethernet | grep Virtual
02:10.0 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)
02:10.1 Ethernet controller: Intel Corporation X540 Ethernet Controller Virtual Function (rev 01)
```
6. Assign a MAC address to each VF:
```
$ sudo ip link set <pf> vf <vfidx> mac <fake MAC address>
```
Depending on the NIC being used, you might need to explicitly set the MAC
address for the VF device. Setting the MAC address guarantees that the
address is consistent on the host and when passed to the guest. Verify a MAC
address is assigned to the VF using command `ip link show dev <vf>`.
## Example: Launch a Kata Containers container using SR-IOV
The following example launches a Kata Containers container using SR-IOV:
1. Build and start SR-IOV plugin:
To install the SR-IOV plugin, follow the [SR-IOV plugin installation instructions](https://github.com/clearcontainers/sriov)
2. Create the docker network:
```
$ sudo docker network create -d sriov --internal --opt pf_iface=enp1s0f0 --opt vlanid=100 --subnet=192.168.0.0/24 vfnet
E0505 09:35:40.550129 2541 plugin.go:297] Numvfs and Totalvfs are not same on the PF - Initialize numvfs to totalvfs
ee2e5a594f9e4d3796eda972f3b46e52342aea04cbae8e5eac9b2dd6ff37b067
```
The previous commands create the required SR-IOV docker network, subnet, `vlanid`,
and physical interface.
3. Start containers and test their connectivity:
```
$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.10 -it alpine
```
The previous example starts a container making use of SR-IOV.
If two machines with SR-IOV enabled NICs are connected back-to-back and each
has a network with matching `vlanid` created, use the following two commands
to test the connectivity:
Machine 1:
```
sriov-1:~$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.10 -it mcastelino/iperf bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -s"
```
Machine 2:
```
sriov-2:~$ sudo docker run --runtime=kata-runtime --net=vfnet --cap-add SYS_ADMIN --ip=192.168.0.11 -it mcastelino/iperf iperf3 -c 192.168.0.10 bash -c "mount -t ramfs -o size=20M ramfs /tmp; iperf3 -c 192.168.0.10"
```