kata-containers/src/runtime/virtcontainers
Alex Lyn 8b49564c01
Merge pull request #10610 from Xynnn007/faet-initdata-rbd
Feat | Implement initdata for bare-metal/qemu hypervisor
2025-04-24 09:59:14 +08:00
..
documentation acrn: Drop support 2024-09-19 16:05:43 +02:00
experimental docs: Adjust release documentation 2024-03-27 12:41:48 +01:00
factory runtime: Introduce riscv64 template for vm factory 2025-03-27 09:28:32 +08:00
persist Merge pull request #9977 from l8huang/dan-2-vfio 2024-09-25 10:11:38 +08:00
pkg versions: Upgrade to Cloud Hypervisor v45.0 2025-04-07 20:33:34 +02:00
testdata runtime: implement DAN in Go kata-runtime 2024-07-10 00:22:30 -07:00
types virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
utils vc: utils: Rename CalculateMilliCPUs() to CalculateCPUsF() 2023-11-10 18:26:01 +01:00
agent.go agent: runtime: add the Agent Policy feature 2023-08-14 17:07:35 +00:00
api_test.go vc: fix up UT for CreateSandbox API change 2023-01-03 22:30:42 +08:00
api.go network: Move up defer block tp cleanup network 2024-02-20 06:06:42 -08:00
clh_test.go runtime: Improve vCPU allocation for the VMMs 2023-11-10 18:25:57 +01:00
clh.go config: add hypervisor_loglevel 2025-01-31 18:37:03 +00:00
container_linux_test.go runtime: Fix virtiofs fd leak 2023-04-26 15:53:39 -07:00
container_test.go runtime: Fix virtiofs fd leak 2023-04-26 15:53:39 -07:00
container.go gpu: Handle VFIO devices with DevicePlugin and CDI 2025-04-23 21:02:06 +00:00
drive_0 runtime: bump containerd for gogo deprecation 2023-11-06 16:49:59 +00:00
endpoint_test.go runtime: add DAN support for VFIO network device in Go kata-runtime 2024-09-24 09:53:28 -07:00
endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
example_pod_run_test.go vc: fix up UT for CreateSandbox API change 2023-01-03 22:30:42 +08:00
factory.go
fc_metrics.go runtime: go fix code for 1.19 2022-11-25 11:29:18 +08:00
fc_test.go virtcontainers: Fixed static checks for improved test coverage for fc.go 2023-05-07 00:17:36 -07:00
fc.go runtime: Improve vCPU allocation for the VMMs 2023-11-10 18:25:57 +01:00
fs_share_darwin.go runtime: propagate configmap/secrets etc changes for remote-hyp 2023-08-11 16:31:08 +01:00
fs_share_linux_test.go test: use T.TempDir to create temporary test directory 2022-03-31 09:31:36 +08:00
fs_share_linux.go runtime: Files are not synced between host and guest VMs 2024-11-06 17:31:12 -05:00
fs_share.go runtime: extend SharedFile to support mutiple storage devices 2023-11-23 23:05:14 +08:00
hypervisor_amd64_test.go runtime: change io/ioutil to io/os packages 2021-12-15 07:31:48 +08:00
hypervisor_config_darwin.go runtime: Support the remote hypervisor type 2023-11-17 13:32:49 +00:00
hypervisor_config_linux_test.go runtime: Improve vCPU allocation for the VMMs 2023-11-10 18:25:57 +01:00
hypervisor_config_linux.go Reapply "runtime: confidential: Do not set the max_vcpu to cpu" 2024-11-09 23:20:17 +01:00
hypervisor_config_test.go runtime: Support the remote hypervisor type 2023-11-17 13:32:49 +00:00
hypervisor_darwin.go virtcontainers: introduce hypervisor_darwin 2023-01-06 02:03:34 -08:00
hypervisor_linux_amd64.go hypervisor: Simplify TDX protection detection 2024-04-05 19:51:27 +02:00
hypervisor_linux_arm64_test.go virtcontainers: Split hypervisor into Linux and OS agnostic bits 2022-02-16 19:15:31 +01:00
hypervisor_linux_arm64.go virtcontainers: Split hypervisor into Linux and OS agnostic bits 2022-02-16 19:15:31 +01:00
hypervisor_linux_riscv64.go runtime: Define availableGuestProtection for riscv64 2025-03-27 09:34:53 +08:00
hypervisor_linux_test.go virtcontainers: tests: Ensure Linux specific tests are just run on Linux 2023-01-06 11:09:11 -08:00
hypervisor_linux.go acrn: Drop support 2024-09-19 16:05:43 +02:00
hypervisor_ppc64le.go runtime: Fix gofmt issues 2022-11-17 14:16:12 +01:00
hypervisor_s390x_test.go
hypervisor_s390x.go runtime: Fix some leftover go fmt errors 2021-09-27 12:46:33 +10:00
hypervisor_test.go acrn: Drop support 2024-09-19 16:05:43 +02:00
hypervisor.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
implementation.go vc: rescan network endpoints after running prestart hooks 2023-01-03 22:30:41 +08:00
interfaces.go kata-runtime: add set policy function to kata-runtime 2024-03-07 15:00:23 +08:00
iostream_test.go
iostream.go
ipvlan_endpoint_test.go runtime: go fix code for 1.19 2022-11-25 11:29:18 +08:00
ipvlan_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
kata_agent_linux_test.go runtime: use system pagesize for hugepage test 2023-01-11 17:02:58 +08:00
kata_agent_test.go runtime: Add DeviceInfo to Container for VFIO coldplug configuration 2025-01-28 10:53:00 +01:00
kata_agent.go runtime: cgroups: Remove commented out code 2025-02-21 17:52:17 -06:00
macvlan_endpoint_test.go runtime: go fix code for 1.19 2022-11-25 11:29:18 +08:00
macvlan_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
macvtap_endpoint_test.go runtime: go fix code for 1.19 2022-11-25 11:29:18 +08:00
macvtap_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
Makefile runtime: Move mock hook source 2022-03-23 19:37:35 +11:00
mock_agent.go agent: runtime: add the Agent Policy feature 2023-08-14 17:07:35 +00:00
mock_hypervisor_test.go sandbox_test: Add test to verify memory hotplug behavior 2022-08-31 10:32:30 -07:00
mock_hypervisor.go sandbox_test: Add test to verify memory hotplug behavior 2022-08-31 10:32:30 -07:00
monitor_test.go
monitor.go gpu: Add config to TOML 2023-06-14 08:20:24 +00:00
mount_darwin.go virtcontainers: split out Linux parts from mount 2023-01-13 11:14:56 -08:00
mount_linux_test.go mount: support checking multiple kinds of block device driver 2023-12-01 11:59:30 +08:00
mount_linux.go virtcontainers: allow specifying nydus-overlayfs binary by path 2024-09-04 08:29:40 +02:00
mount_test.go runtime: use filepath.Clean() to clean the mount path 2023-02-24 15:48:09 +08:00
mount.go runtime: fix the bug of func countFiles 2024-06-11 18:17:18 +08:00
network_darwin.go runtime: add a new API for Network interface 2023-10-18 06:35:57 +00:00
network_linux_test.go runtime: add DAN support for VFIO network device in Go kata-runtime 2024-09-24 09:53:28 -07:00
network_linux.go Merge pull request #9977 from l8huang/dan-2-vfio 2024-09-25 10:11:38 +08:00
network_test.go virtcontainers: Split network tests into generic and OS specific parts 2022-02-08 22:27:53 +01:00
network.go runtime: add the flags support to the route setting 2025-03-07 09:56:08 +08:00
nydusd_linux.go nydus: net-ns handling needs to be only executed on Linux hosts 2023-01-05 11:48:43 -08:00
nydusd_other.go nydus: net-ns handling needs to be only executed on Linux hosts 2023-01-05 11:48:43 -08:00
nydusd_test.go gpu: Moved the PCIe configs to drivers 2023-06-14 08:20:24 +00:00
nydusd.go runtime: hybrid-mode cause error in the latest nydusd 2023-03-04 12:58:48 +08:00
persist_test.go runtime: add reconnect timeout for vhost user block 2023-02-13 14:33:46 +08:00
persist.go acrn: Drop support 2024-09-19 16:05:43 +02:00
physical_endpoint_test.go network: Implement network hotunplug for physical endpoints 2024-07-17 16:42:41 -07:00
physical_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
qemu_amd64_test.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_amd64.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_arch_base_test.go runtime: run prestart hooks before starting VM for FC 2023-08-30 02:52:01 +00:00
qemu_arch_base.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_arm64_test.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_arm64.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_ppc64le_test.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_ppc64le.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_riscv64.go runtime: Define default values for QEMU riscv 2025-03-27 10:05:36 +08:00
qemu_s390x_test.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_s390x.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu_test.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
qemu.go runtime: add support for io.katacontainers.config.runtime.cc_init_data 2025-04-15 16:35:59 +08:00
README.md docs: Fix broken links 2023-10-26 10:17:01 -07:00
remote_test.go runtime: Support the remote hypervisor type 2023-11-17 13:32:49 +00:00
remote.go runtime: Add GPU annotations for remote hypervisor 2024-10-29 10:28:21 -04:00
sandbox_linux_test.go runtime: add reconnect timeout for vhost user block 2023-02-13 14:33:46 +08:00
sandbox_metrics.go hypervisor: Export generic interface methods 2021-10-22 16:45:35 -07:00
sandbox_test.go runtime: Add DeviceInfo to Container for VFIO coldplug configuration 2025-01-28 10:53:00 +01:00
sandbox.go gpu: Remove unneeded parsing of CDI devices 2025-04-23 21:02:06 +00:00
stratovirt_test.go tests: Enable tests for StratoVirt hypervisor 2023-11-16 20:47:26 +08:00
stratovirt.go virtcontainers: Add StratoVirt as a supported hypervisor 2023-11-16 20:47:24 +08:00
syscall_test.go
syscall.go
tap_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
tuntap_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
types_test.go
types.go utils: update container type handling 2021-12-14 17:59:19 -08:00
veth_endpoint_test.go gpu: Add config to TOML 2023-06-14 08:20:24 +00:00
veth_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
vfio_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
vhostuser_endpoint_test.go network: Implement network hotunplug for physical endpoints 2024-07-17 16:42:41 -07:00
vhostuser_endpoint.go virtcontainers: Add CCW device to endpoint 2025-02-26 11:36:42 +01:00
virtcontainers_darwin_test.go virtcontainers: tests: Ensure Linux specific tests are just run on Linux 2023-01-06 11:09:11 -08:00
virtcontainers_linux_test.go virtcontainers: tests: Ensure Linux specific tests are just run on Linux 2023-01-06 11:09:11 -08:00
virtcontainers_test.go acrn: Drop support 2024-09-19 16:05:43 +02:00
virtframework.go virtcontainers: Add a Virtualization.framework skeleton 2023-01-08 07:40:21 -08:00
virtiofsd_test.go runtime/virtiofsd: Drop all references to "--cache=none" 2023-09-08 17:57:30 +02:00
virtiofsd.go runtime/virtiofsd: Drop all references to "--cache=none" 2023-09-08 17:57:30 +02:00
vm_linux_test.go virtcontainers: tests: Ensure Linux specific tests are just run on Linux 2023-01-06 11:09:11 -08:00
vm_test.go virtcontainers: tests: Ensure Linux specific tests are just run on Linux 2023-01-06 11:09:11 -08:00
vm.go runtime: Improve vCPU allocation for the VMMs 2023-11-10 18:25:57 +01:00

What is it?

virtcontainers is a Go library that can be used to build hardware-virtualized container runtimes.

Background

The few existing VM-based container runtimes (Clear Containers, runV, rkt's KVM stage 1) all share the same hardware virtualization semantics but use different code bases to implement them. virtcontainers's goal is to factorize this code into a common Go library.

Ideally, VM-based container runtime implementations would become translation layers from the runtime specification they implement (e.g. the OCI runtime-spec or the Kubernetes CRI) to the virtcontainers API.

virtcontainers was used as a foundational package for the Kata Containers runtime implementation, formerly the Clear Containers runtime implementation.

Out of scope

Implementing a container runtime is out of scope for this project. Any tools or executables in this repository are only provided for demonstration or testing purposes.

virtcontainers and Kubernetes CRI

virtcontainers's API is loosely inspired by the Kubernetes CRI because we believe it provides the right level of abstractions for containerized sandboxes. However, despite the API similarities between the two projects, the goal of virtcontainers is not to build a CRI implementation, but instead to provide a generic, runtime-specification agnostic, hardware-virtualized containers library that other projects could leverage to implement CRI themselves.

Design

Sandboxes

The virtcontainers execution unit is a sandbox, i.e. virtcontainers users start sandboxes where containers will be running.

virtcontainers creates a sandbox by starting a virtual machine and setting the sandbox up within that environment. Starting a sandbox means launching all containers with the VM sandbox runtime environment.

Hypervisors

The virtcontainers package relies on hypervisors to start and stop virtual machine where sandboxes will be running. An hypervisor is defined by an Hypervisor interface implementation, and the default implementation is the QEMU one.

Update cloud-hypervisor client code

See docs

Agents

During the lifecycle of a container, the runtime running on the host needs to interact with the virtual machine guest OS in order to start new commands to be executed as part of a given container workload, set new networking routes or interfaces, fetch a container standard or error output, and so on. There are many existing and potential solutions to resolve that problem and virtcontainers abstracts this through the Agent interface.

API

The high level virtcontainers API includes Sandbox API and Container API. For further details, see the API documentation.

Networking

virtcontainers supports the 2 major container networking models: the Container Network Model (CNM) and the Container Network Interface (CNI).

Typically the former is the Docker default networking model while the later is used on Kubernetes deployments.

CNM

High-level CNM Diagram

CNM lifecycle

  1. RequestPool

  2. CreateNetwork

  3. RequestAddress

  4. CreateEndPoint

  5. CreateContainer

  6. Create config.json

  7. Create PID and network namespace

  8. ProcessExternalKey

  9. JoinEndPoint

  10. LaunchContainer

  11. Launch

  12. Run container

Detailed CNM Diagram

Runtime network setup with CNM

  1. Read config.json

  2. Create the network namespace (code)

  3. Call the prestart hook (from inside the netns) (code)

  4. Scan network interfaces inside netns and get the name of the interface created by prestart hook (code)

  5. Create bridge, TAP, and link all together with network interface previously created (code)

  6. Start VM inside the netns and start the container (code)

Drawbacks of CNM

There are three drawbacks about using CNM instead of CNI:

  • The way we call into it is not very explicit: Have to re-exec dockerd binary so that it can accept parameters and execute the prestart hook related to network setup.
  • Implicit way to designate the network namespace: Instead of explicitly giving the netns to dockerd, we give it the PID of our runtime so that it can find the netns from this PID. This means we have to make sure being in the right netns while calling the hook, otherwise the VETH pair will be created with the wrong netns.
  • No results are back from the hook: We have to scan the network interfaces to discover which one has been created inside the netns. This introduces more latency in the code because it forces us to scan the network in the CreateSandbox path, which is critical for starting the VM as quick as possible.

Storage

See Kata Containers Architecture.

Devices

Support has been added to pass VFIO assigned devices on the docker command line with --device. Support for passing other devices including block devices with --device has not been added yet. PCI and AP (IBM Z Crypto Express cards) devices can be passed.

How to pass a device using VFIO-PCI passthrough

  1. Requirements

IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups. VFIO uses this information to enforce safe ownership of devices for userspace.

You will need Intel VT-d capable hardware. Check if IOMMU is enabled in your host kernel by verifying CONFIG_VFIO_NOIOMMU is not in the kernel configuration. If it is set, you will need to rebuild your kernel.

The following kernel configuration options need to be enabled:

CONFIG_VFIO_IOMMU_TYPE1=m 
CONFIG_VFIO=m
CONFIG_VFIO_PCI=m

In addition, you need to pass intel_iommu=on on the kernel command line.

  1. Identify BDF(Bus-Device-Function) of the PCI device to be assigned.
$ lspci -D | grep -e Ethernet -e Network
0000:01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

$ BDF=0000:01:00.0
  1. Find vendor and device id.
$ lspci -n -s $BDF
01:00.0 0200: 8086:1528 (rev 01)
  1. Find IOMMU group.
$ readlink /sys/bus/pci/devices/$BDF/iommu_group
../../../../kernel/iommu_groups/16
  1. Unbind the device from host driver.
$ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind
  1. Bind the device to vfio-pci.
$ sudo modprobe vfio-pci
$ echo 8086 1528 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id
$ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind
  1. Check /dev/vfio
$ ls /dev/vfio
16 vfio
  1. Start a Clear Containers container passing the VFIO group on the docker command line.
docker run -it --device=/dev/vfio/16 centos/tools bash
  1. Running lspci within the container should show the device among the PCI devices. The driver for the device needs to be present within the Clear Containers kernel. If the driver is missing, you can add it to your custom container kernel using the osbuilder tooling.

How to pass a device using VFIO-AP passthrough

IBM Z mainframes (s390x) use the AP (Adjunct Processor) bus for their Crypto Express hardware security modules. Such devices can be passed over VFIO, which is also supported in Kata. Pass-through happens separated by adapter and domain, i.e. a passable VFIO device has one or multiple adapter-domain combinations.

  1. You must follow the kernel documentation for preparing VFIO-AP passthrough. In short, your host kernel should have the following enabled or available as module (in case of modules, load the modules accordingly, e.g. through modprobe). If one is missing, you will have to update your kernel accordingly, e.g. through recompiling.
CONFIG_VFIO_AP
CONFIG_VFIO_IOMMU_TYPE1
CONFIG_VFIO
CONFIG_VFIO_MDEV
CONFIG_VFIO_MDEV_DEVICE
CONFIG_S390_AP_IOMMU
  1. Set the AP adapter(s) and domain(s) you want to pass in /sys/bus/ap/apmask and /sys/bus/ap/aqmask by writing their negative numbers. Assuming you want to pass 06.0032, you'd run
$ echo -0x6 | sudo tee /sys/bus/ap/apmask > /dev/null
$ echo -0x32 | sudo tee /sys/bus/ap/aqmask > /dev/null
  1. Create one or multiple mediated devices -- one per container you want to pass to. You must write a UUID for the device to /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create. You can use uuidgen for generating the UUID, e.g.
$ uuidgen | sudo tee /sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/create
a297db4a-f4c2-11e6-90f6-d3b88d6c9525
  1. Set the AP adapter(s) and domain(s) you want to pass per device by writing their numbers to /sys/devices/vfio_ap/matrix/${UUID}/assign_adapter and assign_domain in the same directory. For the UUID from step 3, that would be
$ echo 0x6 | sudo tee /sys/devices/vfio_ap/matrix/a297db4a-f4c2-11e6-90f6-d3b88d6c9525/assign_adapter > /dev/null
$ echo 0x32 | sudo tee /sys/devices/vfio_ap/matrix/a297db4a-f4c2-11e6-90f6-d3b88d6c9525/assign_domain > /dev/null
  1. Find the IOMMU group of the mediated device by following the link from /sys/devices/vfio_ap/matrix/${UUID}/iommu_group. There should be a correspondent VFIO device in /dev/vfio.
$ readlink /sys/devices/vfio_ap/matrix/a297db4a-f4c2-11e6-90f6-d3b88d6c9525/iommu_group
../../../../kernel/iommu_groups/0
$ ls /dev/vfio
0 vfio
  1. This device can now be passed. To verify the cards are there, you can use lszcrypt from s390-tools (s390-tools in Alpine, Debian, and Ubuntu, s390utils in Fedora). With lszcrypt, you can see the cards after the configuration time has passed.
$ sudo docker run -it --device /dev/vfio/0 ubuntu
$ lszcrypt
CARD.DOMAIN TYPE  MODE        STATUS  REQUESTS
----------------------------------------------
06          CEX7C CCA-Coproc  online         1
06.0032     CEX7C CCA-Coproc  online         1

Developers

For information on how to build, develop and test virtcontainers, see the developer documentation.

Persistent storage plugin support

See the persistent storage plugin documentation.

Experimental features

See the experimental features documentation.