diff --git a/design/arch-images/api-to-construct.png b/design/arch-images/api-to-construct.png new file mode 100644 index 0000000000..d9d4daea82 Binary files /dev/null and b/design/arch-images/api-to-construct.png differ diff --git a/design/arch-images/construct-to-vm-concept.png b/design/arch-images/construct-to-vm-concept.png new file mode 100644 index 0000000000..c1f0d585ce Binary files /dev/null and b/design/arch-images/construct-to-vm-concept.png differ diff --git a/design/arch-images/vm-concept-to-tech.png b/design/arch-images/vm-concept-to-tech.png new file mode 100644 index 0000000000..b9ca1c2076 Binary files /dev/null and b/design/arch-images/vm-concept-to-tech.png differ diff --git a/design/architecture.md b/design/architecture.md index 0c5fcdee35..46e1441a67 100644 --- a/design/architecture.md +++ b/design/architecture.md @@ -1,11 +1,13 @@ # Kata Containers Architecture + * [Overview](#overview) -* [Hypervisor](#hypervisor) - * [Assets](#assets) +* [Virtualization](#virtualization) +* [Guest assets](#guest-assets) * [Guest kernel](#guest-kernel) - * [Root filesystem image](#root-filesystem-image) - * [Initrd image](#initrd-image) + * [Guest Image](#guest-image) + * [Root filesystem image](#root-filesystem-image) + * [Initrd image](#initrd-image) * [Agent](#agent) * [Runtime](#runtime) * [Configuration](#configuration) @@ -101,59 +103,17 @@ configured, `virtio-scsi` will be used. In all other cases a 9pfs VIRTIO mount p will be used. `kata-agent` uses this mount point as the root filesystem for the container processes. -## Hypervisor +## Virtualization -Kata Containers is designed to support multiple virtual machine monitors (VMMs) and hypervisors. +How Kata Containers maps container concepts to virtual machine technologies, and how this is realized in the multiple +hypervisors and VMMs that Kata supports is described within the [virtualization documentation](./virtualization.md) -As of the 1.9 release, Kata Containers supports [QEMU](http://www.qemu-project.org/)/[KVM](http://www.linux-kvm.org/page/Main_Page), -[Firecracker](https://github.com/firecracker-microvm/firecracker)/KVM, as well as the [ACRN hypervisor](https://projectacrn.org/). - -### QEMU/KVM - -Depending on the host architecture, Kata Containers supports various machine types, -for example `pc` and `q35` on x86 systems, `virt` on ARM systems and `pseries` on IBM Power systems. The default Kata Containers -machine type is `pc`. The machine type and its [`Machine accelerators`](#machine-accelerators) can -be changed by editing the runtime [`configuration`](#configuration) file. - -The following QEMU features are used in Kata Containers to manage resource constraints, improve -boot time and reduce memory footprint: - -- Machine accelerators. -- Hot plug devices. - -Each feature is documented below. - -#### Machine accelerators - -Machine accelerators are architecture specific and can be used to improve the performance -and enable specific features of the machine types. The following machine accelerators -are used in Kata Containers: - -- NVDIMM: This machine accelerator is x86 specific and only supported by `pc` and -`q35` machine types. `nvdimm` is used to provide the root filesystem as a persistent -memory device to the Virtual Machine. - -#### Hot plug devices - -The Kata Containers VM starts with a minimum amount of resources, allowing for faster boot time and a reduction in memory footprint. As the container launch progresses, devices are hotplugged to the VM. For example, when a CPU constraint is specified which includes additional CPUs, they can be hot added. Kata Containers has support for hot-adding the following devices: -- Virtio block -- Virtio SCSI -- VFIO -- CPU - -### Firecracker/KVM - -As of the 1.5 release of Kata Containers, Firecracker VMM is supported. Because of its limited -device support, Firecracker does not support filesystem sharing (good for security and footprint!) As a result, -only block-based storage drivers are supported. Similarly, Firecracker does not support updating -container resources after boot (there is not any device hotplug support), nor does it support VFIO. - -### Assets +## Guest assets The hypervisor will launch a virtual machine which includes a minimal guest kernel and a guest image. -#### Guest kernel +### Guest kernel The guest kernel is passed to the hypervisor and used to boot the virtual machine. The default kernel provided in Kata Containers is highly optimized for @@ -161,11 +121,11 @@ kernel boot time and minimal memory footprint, providing only those services required by a container workload. This is based on a very current upstream Linux kernel. -#### Guest image +### Guest image Kata Containers supports both an `initrd` and `rootfs` based minimal guest image. -##### Root filesystem image +#### Root filesystem image The default packaged root filesystem image, sometimes referred to as the "mini O/S", is a highly optimized container bootstrap system based on [Clear Linux](https://clearlinux.org/). It provides an extremely minimal environment and @@ -187,7 +147,7 @@ For example, when `docker run -ti ubuntu date` is run: new context, first setting the root filesystem to the expected Ubuntu\* root filesystem. -##### Initrd image +#### Initrd image A compressed `cpio(1)` archive, created from a rootfs which is loaded into memory and used as part of the Linux startup process. During startup, the kernel unpacks it into a special instance of a `tmpfs` that becomes the initial root filesystem. diff --git a/design/virtualization.md b/design/virtualization.md new file mode 100644 index 0000000000..b019993bee --- /dev/null +++ b/design/virtualization.md @@ -0,0 +1,129 @@ +# Virtualization in Kata Containers + +- [Virtualization in Kata Containers](#virtualization-in-kata-containers) + - [Mapping container concepts to virtual machine technologies](#mapping-container-concepts-to-virtual-machine-technologies) + - [Kata Containers Hypervisor and VMM support](#kata-containers-hypervisor-and-vmm-support) + - [QEMU/KVM](#qemukvm) + - [Machine accelerators](#machine-accelerators) + - [Hotplug devices](#hotplug-devices) + - [Firecracker/KVM](#firecrackerkvm) + - [Cloud Hypervisor/KVM](#cloud-hypervisorkvm) + - [Summary](#summary) + + +Kata Containers, a second layer of isolation is created on top of those provided by traditional namespace-containers. The +hardware virtualization interface is the basis of this additional layer. Kata will launch a lightweight virtual machine, +and use the guest’s Linux kernel to create a container workload, or workloads in the case of multi-container pods. In Kubernetes +and in the Kata implementation, the sandbox is carried out at the pod level. In Kata, this sandbox is created using a virtual machine. + +This document describes how Kata Containers maps container technologies to virtual machines technologies, and how this is realized in +the multiple hypervisors and virtual machine monitors that Kata supports. + +## Mapping container concepts to virtual machine technologies + +A typical deployment of Kata Containers will be in Kubernetes by way of a Container Runtime Interface (CRI) implementation. On every node, +Kubelet will interact with a CRI implementor (such as containerd or CRI-O), which will in turn interface with Kata Containers (an OCI based runtime). + +The CRI API, as defined at the [Kubernetes CRI-API repo](https://github.com/kubernetes/cri-api/), implies a few constructs being supported by the +CRI implementation, and ultimately in Kata Containers. In order to support the full [API](https://github.com/kubernetes/cri-api/blob/a6f63f369f6d50e9d0886f2eda63d585fbd1ab6a/pkg/apis/runtime/v1alpha2/api.proto#L34-L110) with the CRI-implementor, Kata must provide the following constructs: + +![API to construct](./arch-images/api-to-construct.png) + +These constructs can then be further mapped to what devices are necessary for interfacing with the virtual machine: + +![construct to VM concept](./arch-images/construct-to-vm-concept.png) + +Ultimately, these concepts map to specific para-virtualized devices or virtualization technologies. + +![VM concept to underlying technology](./arch-images/vm-concept-to-tech.png) + +Each hypervisor or VMM varies on how or if it handles each of these. + +## Kata Containers Hypervisor and VMM support + +Kata Containers is designed to support multiple virtual machine monitors (VMMs) and hypervisors. +Kata Containers supports: + - [ACRN hypervisor](https://projectacrn.org/) + - [Cloud Hypervisor](https://github.com/cloud-hypervisor/cloud-hypervisor)/[KVM](https://www.linux-kvm.org/page/Main_Page) + - [Firecracker](https://github.com/firecracker-microvm/firecracker)/KVM + - [QEMU](http://www.qemu-project.org/)/KVM + +Which configuration to use will depend on the end user's requirements. Details of each solution and a summary are provided below. + +### QEMU/KVM + +Kata Containers with QEMU has complete compatibility with Kubernetes. + +Depending on the host architecture, Kata Containers supports various machine types, +for example `pc` and `q35` on x86 systems, `virt` on ARM systems and `pseries` on IBM Power systems. The default Kata Containers +machine type is `pc`. The machine type and its [`Machine accelerators`](#machine-accelerators) can +be changed by editing the runtime [`configuration`](./architecture.md/#configuration) file. + +Devices and features used: +- virtio VSOCK or virtio serial +- virtio block or virtio SCSI +- virtio net +- virtio fs or virtio 9p (recommend: virtio fs) +- VFIO +- hotplug +- machine accelerators + +Machine accelerators and hotplug are used in Kata Containers to manage resource constraints, improve boot time and reduce memory footprint. These are documented below. + +#### Machine accelerators + +Machine accelerators are architecture specific and can be used to improve the performance +and enable specific features of the machine types. The following machine accelerators +are used in Kata Containers: + +- NVDIMM: This machine accelerator is x86 specific and only supported by `pc` and +`q35` machine types. `nvdimm` is used to provide the root filesystem as a persistent +memory device to the Virtual Machine. + +#### Hotplug devices + +The Kata Containers VM starts with a minimum amount of resources, allowing for faster boot time and a reduction in memory footprint. As the container launch progresses, +devices are hotplugged to the VM. For example, when a CPU constraint is specified which includes additional CPUs, they can be hot added. Kata Containers has support +for hot-adding the following devices: +- Virtio block +- Virtio SCSI +- VFIO +- CPU + +### Firecracker/KVM + +Firecracker, built on many rust crates that are within [rust-VMM](https://github.com/rust-vmm), has a very limited device model, providing a lighter +footprint and attack surface, focusing on function-as-a-service like use cases. As a result, Kata Containers with Firecracker VMM supports a subset of the CRI API. +Firecracker does not support file-system sharing, and as a result only block-based storage drivers are supported. Firecracker does not support device +hotplug nor does it support VFIO. As a result, Kata Containers with Firecracker VMM does not support updating container resources after boot, nor +does it support device passthrough. + +Devices used: +- virtio VSOCK +- virtio block +- virtio net + +### Cloud Hypervisor/KVM + +Cloud Hypervisor, based on [rust-VMM](https://github.com/rust-vmm), is designed to have a lighter footprint and attack surface. For Kata Containers, +relative to Firecracker, the Cloud Hypervisor configuration provides better compatibility at the expense of exposing additional devices: file system +sharing and direct device assignment. As of the 1.10 release of Kata Containers, Cloud Hypervisor does not support device hotplug, and as a result +does not support updating container resources after boot, or utilizing block based volumes. While Cloud Hypervisor does support VFIO, Kata is still adding +this support. As of 1.10, Kata does not support block based volumes or direct device assignment. See [Cloud Hypervisor device support documentation](https://github.com/cloud-hypervisor/cloud-hypervisor/blob/master/docs/device_model.md) +for more details on Cloud Hypervisor. + +Devices used: +- virtio VSOCK +- virtio block +- virtio net +- virtio fs + +### Summary + +| Solution | release introduced | brief summary | +|-|-|-| +| QEMU | 1.0 | upstream QEMU, with support for hotplug and filesystem sharing | +| NEMU | 1.4 | Deprecated, removed as of 1.10 release. Slimmed down fork of QEMU, with experimental support of virtio-fs | +| Firecracker | 1.5 | upstream Firecracker, rust-VMM based, no VFIO, no FS sharing, no memory/CPU hotplug | +| QEMU-virtio-fs | 1.7 | upstream QEMU with support for virtio-fs. Will be removed once virtio-fs lands in upstream QEMU | +| Cloud Hypervisor | 1.10 | rust-VMM based, includes VFIO and FS sharing through virtio-fs, no hotplug |