Merge pull request #65 from egernst/architecture-docs
add initial kata architecture docs
BIN
arch-images/CNI_diagram.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
arch-images/CNM_detailed_diagram.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
arch-images/CNM_overall_diagram.png
Normal file
After Width: | Height: | Size: 25 KiB |
BIN
arch-images/DAX.png
Normal file
After Width: | Height: | Size: 21 KiB |
BIN
arch-images/docker-kata.png
Normal file
After Width: | Height: | Size: 46 KiB |
BIN
arch-images/kata-crio-uml.png
Normal file
After Width: | Height: | Size: 509 KiB |
174
arch-images/kata-crio-uml.txt
Normal file
@@ -0,0 +1,174 @@
|
||||
Title: Kata Flow
|
||||
participant CRI
|
||||
participant CRIO
|
||||
participant Kata Runtime
|
||||
participant virtcontainers
|
||||
participant hypervisor
|
||||
participant agent
|
||||
participant shim-pod
|
||||
participant shim-ctr
|
||||
participant proxy
|
||||
|
||||
# Run the sandbox
|
||||
CRI->CRIO: RunPodSandbox()
|
||||
CRIO->Kata Runtime: create
|
||||
Kata Runtime->virtcontainers: CreateSandbox()
|
||||
Note left of virtcontainers: Sandbox\nReady
|
||||
virtcontainers->virtcontainers: createNetwork()
|
||||
virtcontainers->virtcontainers: Execute PreStart Hooks
|
||||
virtcontainers->+hypervisor: Start VM (inside the netns)
|
||||
hypervisor-->-virtcontainers: VM started
|
||||
virtcontainers->proxy: Start Proxy
|
||||
proxy->hypervisor: Connect the VM
|
||||
virtcontainers->+agent: CreateSandbox()
|
||||
agent-->-virtcontainers: Sandbox Created
|
||||
virtcontainers->+agent: CreateContainer()
|
||||
agent-->-virtcontainers: Container Created
|
||||
virtcontainers->shim-pod: Start Shim
|
||||
shim-pod->agent: ReadStdout() (blocking call)
|
||||
shim-pod->agent: ReadStderr() (blocking call)
|
||||
shim-pod->agent: WaitProcess() (blocking call)
|
||||
Note left of virtcontainers: Container-pod\nReady
|
||||
virtcontainers-->Kata Runtime: End of CreateSandbox()
|
||||
Kata Runtime-->CRIO: End of create
|
||||
CRIO->Kata Runtime: start
|
||||
Kata Runtime->virtcontainers: StartSandbox()
|
||||
Note left of virtcontainers: Sandbox\nRunning
|
||||
virtcontainers->+agent: StartContainer()
|
||||
agent-->-virtcontainers: Container Started
|
||||
Note left of virtcontainers: Container-pod\nRunning
|
||||
virtcontainers->virtcontainers: Execute PostStart Hooks
|
||||
virtcontainers-->Kata Runtime: End of StartSandbox()
|
||||
Kata Runtime-->CRIO: End of start
|
||||
CRIO-->CRI: End of RunPodSandbox()
|
||||
|
||||
# Create the container
|
||||
CRI->CRIO: CreateContainer()
|
||||
CRIO->Kata Runtime: create
|
||||
Kata Runtime->virtcontainers: CreateContainer()
|
||||
virtcontainers->+agent: CreateContainer()
|
||||
agent-->-virtcontainers: Container Created
|
||||
virtcontainers->shim-ctr: Start Shim
|
||||
shim-ctr->agent: ReadStdout() (blocking call)
|
||||
shim-ctr->agent: ReadStderr() (blocking call)
|
||||
shim-ctr->agent: WaitProcess() (blocking call)
|
||||
Note left of virtcontainers: Container-ctr\nReady
|
||||
virtcontainers-->Kata Runtime: End of CreateContainer()
|
||||
Kata Runtime-->CRIO: End of create
|
||||
CRIO-->CRI: End of CreateContainer()
|
||||
|
||||
# Start the container
|
||||
CRI->CRIO: StartContainer()
|
||||
CRIO->Kata Runtime: start
|
||||
Kata Runtime->virtcontainers: StartContainer()
|
||||
virtcontainers->+agent: StartContainer()
|
||||
agent-->-virtcontainers: Container Started
|
||||
Note left of virtcontainers: Container-ctr\nRunning
|
||||
virtcontainers-->Kata Runtime: End of StartContainer()
|
||||
Kata Runtime-->CRIO: End of start
|
||||
CRIO-->CRI: End of StartContainer()
|
||||
|
||||
# Stop the container
|
||||
CRI->CRIO: StopContainer()
|
||||
CRIO->Kata Runtime: kill
|
||||
Kata Runtime->virtcontainers: KillContainer()
|
||||
virtcontainers->+agent: SignalProcess()
|
||||
alt SIGTERM OR SIGKILL
|
||||
agent-->shim-ctr: WaitProcess() returns
|
||||
end
|
||||
agent-->-virtcontainers: Process Signalled
|
||||
virtcontainers-->Kata Runtime: End of KillContainer()
|
||||
alt SIGTERM OR SIGKILL
|
||||
Kata Runtime->virtcontainers: StopContainer()
|
||||
virtcontainers->+shim-ctr: waitForShim()
|
||||
alt Timeout exceeded
|
||||
virtcontainers->+agent: SignalProcess(SIGKILL)
|
||||
agent-->shim-ctr: WaitProcess() returns
|
||||
agent-->-virtcontainers: Process Signalled by SIGKILL
|
||||
virtcontainers->shim-ctr: waitForShim()
|
||||
end
|
||||
shim-ctr-->-virtcontainers: Shim terminated
|
||||
virtcontainers->+agent: SignalProcess(SIGKILL)
|
||||
agent-->-virtcontainers: Process Signalled by SIGKILL
|
||||
virtcontainers->+agent: RemoveContainer()
|
||||
agent-->-virtcontainers: Container Removed
|
||||
Note left of virtcontainers: Container-ctr\nStopped
|
||||
virtcontainers-->Kata Runtime: End of StopContainer()
|
||||
end
|
||||
Kata Runtime-->CRIO: End of kill
|
||||
CRIO-->CRI: End of StopContainer()
|
||||
|
||||
# Remove the container
|
||||
CRI->CRIO: RemoveContainer()
|
||||
CRIO->Kata Runtime: delete
|
||||
Kata Runtime->virtcontainers: DeleteContainer()
|
||||
virtcontainers->virtcontainers: Delete container resources
|
||||
virtcontainers-->Kata Runtime: End of DeleteContainer()
|
||||
Kata Runtime-->CRIO: End of delete
|
||||
CRIO-->CRI: End of RemoveContainer()
|
||||
|
||||
# Stop the sandbox
|
||||
CRI->CRIO: StopPodSandbox()
|
||||
CRIO->Kata Runtime: kill
|
||||
Kata Runtime->virtcontainers: KillContainer()
|
||||
virtcontainers->+agent: SignalProcess()
|
||||
alt SIGTERM OR SIGKILL
|
||||
agent-->shim-pod: WaitProcess() returns
|
||||
end
|
||||
agent-->-virtcontainers: Process Signalled
|
||||
virtcontainers-->Kata Runtime: End of KillContainer()
|
||||
alt SIGTERM OR SIGKILL
|
||||
Kata Runtime->virtcontainers: StopSandbox()
|
||||
loop for each container
|
||||
alt Container-ctr
|
||||
virtcontainers->+shim-ctr: waitForShim()
|
||||
alt Timeout exceeded
|
||||
virtcontainers->+agent: SignalProcess(SIGKILL)
|
||||
agent-->shim-ctr: WaitProcess() returns
|
||||
agent-->-virtcontainers: Process Signalled by SIGKILL
|
||||
virtcontainers->shim-ctr: waitForShim()
|
||||
end
|
||||
shim-ctr-->-virtcontainers: Shim terminated
|
||||
virtcontainers->+agent: SignalProcess(SIGKILL)
|
||||
agent-->-virtcontainers: Process Signalled by SIGKILL
|
||||
virtcontainers->+agent: RemoveContainer()
|
||||
agent-->-virtcontainers: Container Removed
|
||||
Note left of virtcontainers: Container-ctr\nStopped
|
||||
else Container-pod
|
||||
virtcontainers->+shim-pod: waitForShim()
|
||||
alt Timeout exceeded
|
||||
virtcontainers->+agent: SignalProcess(SIGKILL)
|
||||
agent-->shim-pod: WaitProcess() returns
|
||||
agent-->-virtcontainers: Process Signalled by SIGKILL
|
||||
virtcontainers->shim-pod: waitForShim()
|
||||
end
|
||||
shim-pod-->-virtcontainers: Shim terminated
|
||||
virtcontainers->+agent: SignalProcess(SIGKILL)
|
||||
agent-->-virtcontainers: Process Signalled by SIGKILL
|
||||
virtcontainers->+agent: RemoveContainer()
|
||||
agent-->-virtcontainers: Container Removed
|
||||
Note left of virtcontainers: Container-pod\nStopped
|
||||
end
|
||||
end
|
||||
virtcontainers->+agent: DestroySandbox()
|
||||
agent-->-virtcontainers: Sandbox Destroyed
|
||||
virtcontainers->hypervisor: Stop VM
|
||||
Note left of virtcontainers: Sandbox\nStopped
|
||||
virtcontainers->virtcontainers: removeNetwork()
|
||||
virtcontainers->virtcontainers: Execute PostStop Hooks
|
||||
virtcontainers-->Kata Runtime: End of StopSandbox()
|
||||
end
|
||||
Kata Runtime-->CRIO: End of kill
|
||||
CRIO-->CRI: End of StopPodSandbox()
|
||||
|
||||
# Remove the sandbox
|
||||
CRI->CRIO: RemovePodSandbox()
|
||||
CRIO->Kata Runtime: delete
|
||||
Kata Runtime->virtcontainers: DeleteSandbox()
|
||||
loop for each container
|
||||
virtcontainers->virtcontainers: Delete container resources
|
||||
end
|
||||
virtcontainers->virtcontainers: Delete sandbox resources
|
||||
virtcontainers-->Kata Runtime: End of DeleteSandbox()
|
||||
Kata Runtime-->CRIO: End of delete
|
||||
CRIO-->CRI: End of RemovePodSandbox()
|
27
arch-images/kata-oci-create.svg
Normal file
After Width: | Height: | Size: 14 KiB |
31
arch-images/kata-oci-create.txt
Normal file
@@ -0,0 +1,31 @@
|
||||
Title: Kata Flow
|
||||
participant Docker
|
||||
participant Kata Runtime
|
||||
participant virtcontainers
|
||||
participant hypervisor
|
||||
participant agent
|
||||
participant shim-pod
|
||||
participant shim-ctr
|
||||
participant proxy
|
||||
|
||||
#Docker Create!
|
||||
Docker->Kata Runtime: create
|
||||
Kata Runtime->virtcontainers: CreateSandbox()
|
||||
Note left of virtcontainers: Sandbox\nReady
|
||||
virtcontainers->virtcontainers: createNetwork()
|
||||
virtcontainers->virtcontainers: Execute PreStart Hooks
|
||||
virtcontainers->+hypervisor: Start VM (inside the netns)
|
||||
hypervisor-->-virtcontainers: VM started
|
||||
virtcontainers->proxy: Start Proxy
|
||||
proxy->hypervisor: Connect the VM
|
||||
virtcontainers->+agent: CreateSandbox()
|
||||
agent-->-virtcontainers: Sandbox Created
|
||||
virtcontainers->+agent: CreateContainer()
|
||||
agent-->-virtcontainers: Container Created
|
||||
virtcontainers->shim-pod: Start Shim
|
||||
shim->agent: ReadStdout() (blocking call)
|
||||
shim->agent: ReadStderr() (blocking call)
|
||||
shim->agent: WaitProcess() (blocking call)
|
||||
Note left of virtcontainers: Container\nReady
|
||||
virtcontainers-->Kata Runtime: End of CreateSandbox()
|
||||
Kata Runtime-->Docker: End of create
|
11
arch-images/kata-oci-exec.svg
Normal file
After Width: | Height: | Size: 7.8 KiB |
20
arch-images/kata-oci-exec.txt
Normal file
@@ -0,0 +1,20 @@
|
||||
Title: Docker Exec
|
||||
participant Docker
|
||||
participant kata-runtime
|
||||
participant virtcontainers
|
||||
participant shim
|
||||
participant hypervisor
|
||||
participant agent
|
||||
participant proxy
|
||||
|
||||
#Docker Exec
|
||||
Docker->kata-runtime: exec
|
||||
kata-runtime->virtcontainers: EnterContainer()
|
||||
virtcontainers->agent: exec
|
||||
agent->virtcontainers: Process started in the container
|
||||
virtcontainers->shim: start shim
|
||||
shim->agent: ReadStdout()
|
||||
shim->agent: ReadStderr()
|
||||
shim->agent: WaitProcess()
|
||||
virtcontainers->kata-runtime: End of EnterContainer()
|
||||
kata-runtime-->Docker: End of exec
|
9
arch-images/kata-oci-start.svg
Normal file
After Width: | Height: | Size: 7.3 KiB |
20
arch-images/kata-oci-start.txt
Normal file
@@ -0,0 +1,20 @@
|
||||
Title: Docker Start
|
||||
participant Docker
|
||||
participant Kata Runtime
|
||||
participant virtcontainers
|
||||
participant hypervisor
|
||||
participant agent
|
||||
participant shim-pod
|
||||
participant shim-ctr
|
||||
participant proxy
|
||||
|
||||
#Docker Start
|
||||
Docker->Kata Runtime: start
|
||||
Kata Runtime->virtcontainers: StartSandbox()
|
||||
Note left of virtcontainers: Sandbox\nRunning
|
||||
virtcontainers->+agent: StartContainer()
|
||||
agent-->-virtcontainers: Container Started
|
||||
Note left of virtcontainers: Container-pod\nRunning
|
||||
virtcontainers->virtcontainers: Execute PostStart Hooks
|
||||
virtcontainers-->Kata Runtime: End of StartSandbox()
|
||||
Kata Runtime-->Docker: End of start
|
BIN
arch-images/network.png
Normal file
After Width: | Height: | Size: 24 KiB |
BIN
arch-images/qemu.png
Normal file
After Width: | Height: | Size: 24 KiB |
687
architecture.md
Normal file
@@ -0,0 +1,687 @@
|
||||
# Kata Containers Architecture
|
||||
|
||||
* [Overview](#overview)
|
||||
* [Hypervisor](#hypervisor)
|
||||
* [Assets](#assets)
|
||||
* [Guest kernel](#guest-kernel)
|
||||
* [Root filesystem image](#root-filesystem-image)
|
||||
* [Agent](#agent)
|
||||
* [Runtime](#runtime)
|
||||
* [Configuration](#configuration)
|
||||
* [Significant commands](#significant-commands)
|
||||
* [create](#create)
|
||||
* [start](#start)
|
||||
* [exec](#exec)
|
||||
* [kill](#kill)
|
||||
* [delete](#delete)
|
||||
* [Proxy](#proxy)
|
||||
* [Shim](#shim)
|
||||
* [Networking](#networking)
|
||||
* [Storage](#storage)
|
||||
* [Kubernetes Support](#kubernetes-support)
|
||||
* [Problem Statement](#problem-statemem)
|
||||
* [OCI Annotations](#oci-annotations)
|
||||
* [Generalization](#generalization)
|
||||
* [Mixing VM based and namespace based runtimes](#mixing-vm-based-and-namespace-based-runtimes)
|
||||
* [Appendices](#appendices)
|
||||
* [DAX](#dax)
|
||||
* [Previous Releases](#previous-releases)
|
||||
* [Resources](#resources)
|
||||
|
||||
## Overview
|
||||
|
||||
This is an architectural overview of Kata Containers, based on the 1.0.0 release.
|
||||
|
||||
The two primary deliverables of the Kata Containers project are a container runtime
|
||||
and a CRI friendly library API.
|
||||
|
||||
The [Kata Containers runtime (kata-runtime)](https://github.com/kata-containers/runtime)
|
||||
is compatible with the [OCI](https://github.com/opencontainers) [runtime specification](https://github.com/opencontainers/runtime-spec)
|
||||
and therefore works seamlessly with the
|
||||
[Docker\* Engine](https://www.docker.com/products/docker-engine) pluggable runtime
|
||||
architecture. It also supports the [Kubernetes\* Container Runtime Interface (CRI)](https://github.com/kubernetes/kubernetes/tree/master/pkg/kubelet/apis/cri/v1alpha1/runtime)
|
||||
through the [CRI-O\*](https://github.com/kubernetes-incubator/cri-o) and
|
||||
[CRI-containerd\*](https://github.com/containerd/cri) implementation. In other words, you can transparently
|
||||
select between the [default Docker and CRI shim runtime (runc)](https://github.com/opencontainers/runc)
|
||||
and `kata-runtime`.
|
||||
|
||||

|
||||
|
||||
`kata-runtime` creates a QEMU\*/KVM virtual machine for each container or pod
|
||||
the Docker engine or Kubernetes' `kubelet` creates.
|
||||
|
||||
The container process is then spawned by
|
||||
[agent](https://github.com/kata-containers/agent), an agent process running
|
||||
as a daemon inside the virtual machine. kata-agent runs a gRPC server in
|
||||
the guest using a virtio serial interface which QEMU exposes as a serial
|
||||
device on the host. kata-runtime uses a gRPC protocol to communicate with
|
||||
the agent. This protocol allows the runtime to send container management
|
||||
commands to the agent. The protocol is also used to pass I/O streams (stdout,
|
||||
stderr, stdin) between the guest and the Docker Engine.
|
||||
|
||||
For any given container, both the init process and all potentially executed
|
||||
commands within that container, together with their related I/O streams, need
|
||||
to go through the virtio serial interface exported by QEMU. A [Kata Containers
|
||||
proxy (`kata-proxy`)](https://github.com/kata-containers/proxy) instance is
|
||||
launched for each virtual machine to handle multiplexing and demultiplexing
|
||||
those commands and streams.
|
||||
|
||||
On the host, each container process's removal is handled by a reaper in the higher
|
||||
layers of the container stack. In the case of Docker it is handled by `containerd-shim`.
|
||||
In the case of CRI-O it is handled by `conmon`. For clarity, for the remainder
|
||||
of this document the term "container process reaper" will be used to refer to
|
||||
either reaper. As Kata Containers processes run inside their own virtual machines,
|
||||
the container process reaper cannot monitor, control
|
||||
or reap them. `kata-runtime` fixes that issue by creating an [additional shim process
|
||||
(`kata-shim`)](https://github.com/kata-containers/shim) between the container process
|
||||
reaper and `kata-proxy`. A `kata-shim` instance will both forward signals and `stdin`
|
||||
streams to the container process on the guest and pass the container `stdout`
|
||||
and `stderr` streams back up the stack to the CRI shim or Docker via the container process
|
||||
reaper. `kata-runtime` creates a `kata-shim` daemon for each container and for each
|
||||
OCI command received to run within an already running container (example, `docker
|
||||
exec`).
|
||||
|
||||
The container workload, that is, the actual OCI bundle rootfs, is exported from the
|
||||
host to the virtual machine. In the case where a block-based graph driver is
|
||||
configured, virtio-scsi will be used. In all other cases a 9pfs virtio mount point
|
||||
will be used. `kata-agent` uses this mount point as the root filesystem for the
|
||||
container processes.
|
||||
|
||||
## Hypervisor
|
||||
|
||||
Kata Containers is designed to support multiple hypervisors. For the 1.0 release,
|
||||
Kata Containers uses just [QEMU](http://www.qemu-project.org/)/[KVM](http://www.linux-kvm.org/page/Main_Page)
|
||||
to create virtual machines where containers will run:
|
||||
|
||||

|
||||
|
||||
### QEMU/KVM
|
||||
|
||||
Depending of the host architecture, Kata Containers support various machine types,
|
||||
for example `pc` and `q35` on x86 systems and `virt` on ARM systems. Kata Containers'
|
||||
default machine type is `pc`. The default machine type and its [`Machine accelerators`](#Machine-accelerators) can
|
||||
be changed by editing the runtime [`configuration`](#Configuration) file.
|
||||
|
||||
The follow QEMU features are used in Kata Containers to manage resource constraints, improve
|
||||
boot time and reduce memory footprint:
|
||||
|
||||
- Machine accelerators.
|
||||
- Hot plug devices.
|
||||
|
||||
Each feature is documented below.
|
||||
|
||||
#### Machine accelerators
|
||||
|
||||
Machine accelerators are architecture specific and can be used to improve the performance
|
||||
and enable specific features of the machine types. The following machine accelerators
|
||||
are used in Kata Containers:
|
||||
|
||||
- nvdimm: This machine accelerator is x86 specific and only supported by `pc` and
|
||||
`q35` machine types. `nvdimm` is used to provide the root filesystem as a persistent
|
||||
memory device to the Virtual Machine.
|
||||
|
||||
Although Kata Containers can run with any recent QEMU release, Kata Containers
|
||||
boot time, memory footprint and 9p IO are significantly optimized by using a specific
|
||||
QEMU version called [`qemu-lite`](https://github.com/kata-containers/qemu/tree/qemu-lite-2.11.0) and
|
||||
custom machine accelerators that are not available in the upstream version of QEMU.
|
||||
These custom machine accelerators are described below.
|
||||
|
||||
- nofw: this machine accelerator is x86 specific and only supported by `pc` and `q35`
|
||||
machine types. `nofw` is used to boot an ELF format kernel by skipping the BIOS/firmware
|
||||
in the guest. This custom machine accelerator improves boot time significantly.
|
||||
- static-prt: this machine accelerator is x86 specific and only supported by `pc`
|
||||
and `q35` machine types. `static-prt` is used to reduce the interpretation burden
|
||||
for guest ACPI component.
|
||||
|
||||
#### Hot plug devices
|
||||
|
||||
The Kata Containers VM starts with a minimum amount of resources, allowing for faster boot time and a reduction in memory footprint. As the container launch progresses, devices are hotplugged to the VM. For example, when a CPU constraint is specified which includes additional CPUs, they can be hot added. Kata Containers has support for hot-adding the following devices:
|
||||
- Virtio block
|
||||
- Virtio SCSI
|
||||
- VFIO
|
||||
- CPU
|
||||
|
||||
### Assets
|
||||
|
||||
The hypervisor will launch a virtual machine which includes a minimal guest kernel
|
||||
and a guest image.
|
||||
|
||||
#### Guest kernel
|
||||
|
||||
The guest kernel is passed to the hypervisor and used to boot the virtual
|
||||
machine. The default kernel provided in Kata Containers is highly optimized for
|
||||
kernel boot time and minimal memory footprint, providing only those services
|
||||
required by a container workload. This is based on a very current upstream Linux
|
||||
kernel.
|
||||
|
||||
#### Guest image
|
||||
|
||||
Kata Containers supports both an `initrd` and `rootfs` based minimal guest image.
|
||||
|
||||
##### Root filesystem image
|
||||
|
||||
The default root filesystem image, sometimes referred to as the "mini O/S", is a
|
||||
highly optimized container bootstrap system based on [Clear Linux](https://clearlinux.org/). It provides an extremely minimal environment and
|
||||
has a highly optimized boot path.
|
||||
|
||||
The only services running in the context of the mini O/S are the init daemon
|
||||
(`systemd`) and the [Agent](#agent). The real workload the user wishes to run
|
||||
is created using libcontainer, creating a container in the same manner that is done
|
||||
by runc.
|
||||
|
||||
For example, when `docker run -ti ubuntu date` is run:
|
||||
|
||||
- The hypervisor will boot the mini-OS image using the guest kernel.
|
||||
- `systemd`, running inside the mini-OS context, will launch the `kata-agent` in
|
||||
the same context.
|
||||
- The agent will create a new confined context to run the specified command in
|
||||
(`date` in this example).
|
||||
- The agent will then execute the command (`date` in this example) inside this
|
||||
new context, first setting the root filesystem to the expected Ubuntu\* root
|
||||
filesystem.
|
||||
|
||||
##### Initrd image
|
||||
|
||||
placeholder
|
||||
|
||||
## Agent
|
||||
|
||||
[`kata-agent`](https://github.com/kata-containers/agent) is a process running in the
|
||||
guest as a supervisor for managing containers and processes running within
|
||||
those containers.
|
||||
|
||||
The `kata-agent` execution unit is the sandbox. A `kata-agent` sandbox is a container sandbox defined by a set of namespaces (NS, UTS, IPC and PID). `kata-runtime` can
|
||||
run several containers per VM to support container engines that require multiple
|
||||
containers running inside a pod. In the case of docker, `kata-runtime` creates a
|
||||
single container per pod.
|
||||
|
||||
`kata-agent` communicates with the other Kata components over gRPC.
|
||||
It also runs a [`yamux`](https://github.com/hashicorp/yamux) server on the same gRPC URL.
|
||||
|
||||
The `kata-agent` makes use of [`libcontainer`](https://github.com/opencontainers/runc/tree/master/libcontainer)
|
||||
to manage the lifecycle of the container. This way the `kata-agent` reuses most
|
||||
of the code used by [`runc`](https://github.com/opencontainers/runc).
|
||||
|
||||
### Agend gRPC protocol
|
||||
|
||||
placeholder
|
||||
|
||||
## Runtime
|
||||
|
||||
`kata-runtime` is an OCI compatible container runtime and is responsible for handling
|
||||
all commands specified by
|
||||
[the OCI runtime specification](https://github.com/opencontainers/runtime-spec)
|
||||
and launching `kata-shim` instances.
|
||||
|
||||
`kata-runtime` heavily utilizes the
|
||||
[virtcontainers project](https://github.com/containers/virtcontainers), which
|
||||
provides a generic, runtime-specification agnostic, hardware-virtualized containers
|
||||
library.
|
||||
|
||||
### Configuration
|
||||
|
||||
The runtime uses a TOML format configuration file called `configuration.toml`. By
|
||||
default this file is installed in the `/usr/share/defaults/kata-containers`
|
||||
directory and contains various settings such as the paths to the hypervisor,
|
||||
the guest kernel and the mini-OS image.
|
||||
|
||||
Most users will not need to modify the configuration file.
|
||||
|
||||
The file is well commented and provides a few "knobs" that can be used to modify
|
||||
the behavior of the runtime.
|
||||
|
||||
The configuration file is also used to enable runtime debug output (see
|
||||
some url to documentation on how to enable debug).
|
||||
|
||||
### Significant OCI commands
|
||||
|
||||
Here we describe how `kata-runtime` handles the most important OCI commands.
|
||||
|
||||
#### [`create`](https://github.com/kata-containers/runtime/blob/master/cli/create.go)
|
||||
|
||||
When handling the OCI `create` command, `kata-runtime` goes through the following steps:
|
||||
|
||||
1. Create the network namespace where we will spawn VM and shims processes.
|
||||
2. Call into the pre-start hooks. One of them should be responsible for creating
|
||||
the `veth` network pair between the host network namespace and the network namespace
|
||||
freshly created.
|
||||
3. Scan the network from the new network namespace, and create a MACVTAP connection
|
||||
between the `veth` interface and a `tap` interface into the VM.
|
||||
4. Start the VM inside the network namespace by providing the `tap` interface
|
||||
previously created.
|
||||
5. Wait for the VM to be ready.
|
||||
6. Start `kata-proxy`, which will connect to the created VM. The `kata-proxy` process
|
||||
will take care of proxying all communications with the VM. Kata has a single proxy
|
||||
per VM.
|
||||
7. Communicate with `kata-agent` (through the proxy) to configure the sandbox
|
||||
inside the VM.
|
||||
8. Communicate with `kata-agent` to create the container, relying on the OCI
|
||||
configuration file `config.json` initially provided to `kata-runtime`. This
|
||||
spawns the container process inside the VM, leveraging the `libcontainer` package.
|
||||
9. Start `kata-shim`, which will connect to the gRPC server socket provided by the `kata-proxy`. `kata-shim` will spawn a few Go routines to parallelize blocking calls `ReadStdout()` , `ReadStderr()` and `WaitProcess()`. Both `ReadStdout()` and `ReadStderr()` are run through infinite loops since `kata-shim` wants the output of those until the container process terminates. `WaitProcess()` is a unique call which returns the exit code of the container process when it terminates inside the VM. Note that `kata-shim` is started inside the network namespace, to allow upper layers to determine which network namespace has been created and by checking the `kata-shim` process. It also creates a new PID namespace by entering into it. This ensures that all `kata-shim` processes belonging to the same container will get killed when the `kata-shim` representing the container process terminates.
|
||||
|
||||
At this point the container process is running inside of the VM, and it is represented
|
||||
on the host system by the `kata-shim` process.
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
#### [`start`](https://github.com/kata-containers/runtime/blob/master/cli/start.go)
|
||||
|
||||
With traditional containers, `start` launches a container process in its own set of namespaces. With Kata Containers, the main task of `kata-runtime` is to ask [`kata-agent`](#agent) to start the container workload inside the virtual machine. `kata-runtime` will run through the following steps:
|
||||
|
||||
1. Communicate with `kata-agent` (through the proxy) to start the container workload
|
||||
inside the VM. If, for example, the command to execute inside of the container is `top`,
|
||||
the `kata-shim`'s `ReadStdOut()` will start returning text output for top, and
|
||||
`WaitProcess()` will continue to block as long as the `top` process runs.
|
||||
2. Call into the post-start hooks. Usually, this is a no-op since nothing is provided
|
||||
(this needs clarification)
|
||||
|
||||

|
||||
|
||||
#### [`exec`](https://github.com/kata-containers/runtime/blob/master/cli/exec.go)
|
||||
|
||||
OCI `exec` allows you to run an additional command within an already running
|
||||
container. In Kata Containers, this is handled as follows:
|
||||
|
||||
1. A request is sent to the `kata agent` (through the proxy) to start a new process
|
||||
inside an existing container running within the VM.
|
||||
2. A new `kata-shim` is created within the same network and PID namespaces as the
|
||||
original `kata-shim` representing the container process. This new `kata-shim` is
|
||||
used for the new exec process.
|
||||
|
||||
Now the `exec`'ed process is running within the VM, sharing `uts`, `pid`, `mnt` and `ipc` namespaces with the container process.
|
||||
|
||||

|
||||
|
||||
#### [`kill`](https://github.com/kata-containers/runtime/blob/master/cli/kill.go)
|
||||
|
||||
When sending the OCI `kill` command, the container runtime should send a
|
||||
[UNIX signal](https://en.wikipedia.org/wiki/Unix_signal) to the container process.
|
||||
A `kill` sending a termination signal such as `SIGKILL` or `SIGTERM` is expected
|
||||
to terminate the container process. In the context of a traditional container,
|
||||
this means stopping the container. For `kata-runtime`, this translates to stopping
|
||||
the container and the VM associated with it.
|
||||
|
||||
1. Send a request to kill the container process to the `kata-agent` (through the proxy).
|
||||
else needs to be done.
|
||||
2. Wait for `kata-shim` process to exit.
|
||||
3. Force kill the container process if `kata-shim` process didn't return after a
|
||||
timeout. This is done by communicating with `kata-agent` (connecting the proxy),
|
||||
sending `SIGKILL` signal to the container process inside the VM.
|
||||
4. Wait for `kata-shim` process to exit, and return an error if we reach the
|
||||
timeout again.
|
||||
5. Communicate with `kata-agent` (through the proxy) to remove the container
|
||||
configuration from the VM.
|
||||
6. Communicate with `kata-agent` (through the proxy) to destroy the sandbox
|
||||
configuration from the VM.
|
||||
7. Stop the VM.
|
||||
8. Remove all network configurations inside the network namespace and delete the
|
||||
namespace.
|
||||
9. Execute post-stop hooks.
|
||||
|
||||
If `kill` was invoked with a non-termination signal, this simply signals the container process. Otherwise, everything has been torn down, and the VM has been removed.
|
||||
|
||||
#### [`delete`](https://github.com/kata-containers/runtime/blob/master/cli/delete.go)
|
||||
|
||||
`delete` removes all internal resources related to a container. A running container
|
||||
cannot be deleted unless the OCI runtime is explicitly being asked to, by using
|
||||
`--force` flag.
|
||||
|
||||
If the sandbox is not stopped, but the particular container process returned on
|
||||
its own already, the `kata-runtime` will first go through most of the steps a `kill`
|
||||
would go through for a termination signal. After this (or simply this if the sandboxIDwas already stopped), then `kata-runtime` will: If the sandbox was already stoppedfollowed by:
|
||||
|
||||
1. Remove container resources. Every file kept under `/var/{lib,run}/virtcontainers/sandboxes/<sandboxID>/<containerID>`.
|
||||
2. Remove sandbox resources. Every file kept under `/var/{lib,run}/virtcontainers/sandboxes/<sandboxID>`.
|
||||
|
||||
At this point, everything related to the container should have been removed from the host system, and no related process should be running.
|
||||
|
||||
#### [`state`](https://github.com/kata-containers/runtime/blob/master/cli/state.go)
|
||||
|
||||
`state` returns the status of the container. For `kata-runtime`, this means being
|
||||
able to detect if the container is still running by looking at the state of `kata-shim`
|
||||
process representing this container process.
|
||||
|
||||
1. Ask the container status by checking information stored on disk. (clarification needed)
|
||||
2. Check `kata-shim` process representing the container.
|
||||
3. In case the container status on disk was supposed to be `ready` or `running`,
|
||||
and the `kata-shim` process no longer exists, this involves the detection of a
|
||||
stopped container. This means that before returning the container status,
|
||||
the container has to be properly stopped. Here are the steps involved in this detection:
|
||||
1. Wait for `kata-shim` process to exit.
|
||||
2. Force kill the container process if `kata-shim` process didn't return after a timeout. This is done by communicating with `kata-agent` (connecting the proxy), sending `SIGKILL` signal to the container process inside the VM.
|
||||
3. Wait for `kata-shim` process to exit, and return an error if we reach the timeout again.
|
||||
4. Communicate with `kata-agent` (connecting the proxy) to remove the container configuration from the VM.
|
||||
4. Return container status.
|
||||
|
||||
## Proxy
|
||||
|
||||
Communication with the VM can be achieved by either `virtio-serial` or, if the host
|
||||
kernel is newer than v4.8, a virtual socket, `vsock` can be used. The default is `virtio-serial`.
|
||||
|
||||
The VM will likely be running multiple container processes. In the event `virtio-serial`
|
||||
is used, the I/O streams associated with each process needs to be multiplexed and demultiplexed on the host. On systems with `vsock` support, this component becomes optional.
|
||||
|
||||
`kata-proxy` is a process offering access to the VM [`kata-agent`](https://github.com/kata-containers/agent)
|
||||
to multiple `kata-shim` and `kata-runtime` clients associated with the VM. Its
|
||||
main role is to route the I/O streams and signals between each `kata-shim`
|
||||
instance and the `kata-agent`.
|
||||
`kata-proxy` connects to `kata-agent` on a unix domain socket that `kata-runtime` provides
|
||||
while spawning `kata-proxy`.
|
||||
`kata-proxy` uses [`yamux`](https://github.com/hashicorp/yamux) to multiplex gRPC
|
||||
requests on its connection to the `kata-agent`.
|
||||
|
||||
When proxy type is configured as "proxyBuiltIn", we do not spawn a separate
|
||||
process to proxy grpc connections. Instead a built-in yamux grpc dialer is used to connect
|
||||
directly to `kata-agent`. This is used by CRI container runtime server `frakti` which
|
||||
calls directly into `kata-runtime`.
|
||||
|
||||
## Shim
|
||||
|
||||
A container process reaper, such as Docker's `containerd-shim` or CRI-O's `conmon`,
|
||||
is designed around the assumption that it can monitor and reap the actual container
|
||||
process. As the container process reaper runs on the host, it cannot directly
|
||||
monitor a process running within a virtual machine. At most it can see the QEMU
|
||||
process, but that is not enough. With Kata Containers, `kata-shim` acts as the
|
||||
container process that the container process reaper can monitor. Therefore
|
||||
`kata-shim` needs to handle all container I/O streams (`stdout`, `stdin` and `stderr`)
|
||||
and forward all signals the container process reaper decides to send to the container
|
||||
process.
|
||||
|
||||
`kata-shim` has an implicit knowledge about which VM agent will handle those streams
|
||||
and signals and thus acts as an encapsulation layer between the container process
|
||||
reaper and the `kata-agent`. `kata-shim`:
|
||||
|
||||
- Connects to `kata-proxy` on a unix domain socket. The socket url is passed from
|
||||
`kata-runtime` to `kata-shim` when the former spawns the latter along with a
|
||||
`containerID` and `execID`. The `containerID` and `execID` are used to identify
|
||||
the true container process that the shim process will be shadowing or representing.
|
||||
- Forwards the standard input stream from the container process reaper into
|
||||
`kata-proxy` using grpc `WriteStdin` gRPC API.
|
||||
- Reads the standard output/error from the container process to the
|
||||
- Forwards signals it receives from the container process reaper to `kata-proxy`
|
||||
using `SignalProcessRequest` API.
|
||||
- Monitors terminal changes and forwards them to `kata-proxy` using grpc `TtyWinResize`
|
||||
API.
|
||||
|
||||
|
||||
## Networking
|
||||
|
||||
Containers will typically live in their own, possibly shared, networking namespace.
|
||||
At some point in a container lifecycle, container engines will set up that namespace
|
||||
to add the container to a network which is isolated from the host network, but
|
||||
which is shared between containers
|
||||
|
||||
In order to do so, container engines will usually add one end of a `virtual ethernet
|
||||
(veth)` pair into the container networking namespace. The other end of the `veth`
|
||||
pair is added to the container network.
|
||||
|
||||
This is a very namespace-centric approach as many hypervisors (in particular QEMU)
|
||||
cannot handle `veth` interfaces. Typically, `TAP` interfaces are created for VM
|
||||
connectivity.
|
||||
|
||||
To overcome incompatibility between typical container engines expectations
|
||||
and virtual machines, `kata-runtime` networking transparently connects `veth`
|
||||
interfaces with `TAP` ones using MACVTAP:
|
||||
|
||||

|
||||
|
||||
Kata Containers supports both
|
||||
[CNM](https://github.com/docker/libnetwork/blob/master/docs/design.md#the-container-network-model)
|
||||
and [CNI](https://github.com/containernetworking/cni) for networking management.
|
||||
|
||||
### CNM
|
||||
|
||||

|
||||
|
||||
__CNM lifecycle__
|
||||
|
||||
1. RequestPool
|
||||
|
||||
2. CreateNetwork
|
||||
|
||||
3. RequestAddress
|
||||
|
||||
4. CreateEndPoint
|
||||
|
||||
5. CreateContainer
|
||||
|
||||
6. Create `config.json`
|
||||
|
||||
7. Create PID and network namespace
|
||||
|
||||
8. ProcessExternalKey
|
||||
|
||||
9. JoinEndPoint
|
||||
|
||||
10. LaunchContainer
|
||||
|
||||
11. Launch
|
||||
|
||||
12. Run container
|
||||
|
||||

|
||||
|
||||
__Runtime network setup with CNM__
|
||||
|
||||
1. Read `config.json`
|
||||
|
||||
2. Create the network namespace
|
||||
|
||||
3. Call the prestart hook (from inside the netns)
|
||||
|
||||
4. Scan network interfaces inside netns and get the name of the interface
|
||||
created by prestart hook
|
||||
|
||||
5. Create bridge, TAP, and link all together with network interface previously
|
||||
created
|
||||
|
||||
### CNI
|
||||
|
||||

|
||||
|
||||
__Runtime network setup with CNI__
|
||||
|
||||
1. Create the network namespace.
|
||||
|
||||
2. Get CNI plugin information.
|
||||
|
||||
3. Start the plugin (providing previously created network namespace) to add a network
|
||||
described into `/etc/cni/net.d/ directory`. At that time, the CNI plugin will
|
||||
create the `cni0` network interface and a veth pair between the host and the created
|
||||
netns. It links `cni0` to the veth pair before to exit.
|
||||
|
||||
4. Create network bridge, TAP, and link all together with network interface previously
|
||||
created.
|
||||
|
||||
5. Start VM inside the netns and start the container.
|
||||
|
||||
## Storage
|
||||
Container workloads are shared with the virtualized environment through [9pfs](https://www.kernel.org/doc/Documentation/filesystems/9p.txt).
|
||||
The devicemapper storage driver is a special case. The driver uses dedicated block
|
||||
devices rather than formatted filesystems, and operates at the block level rather
|
||||
than the file level. This knowledge is used to directly use the underlying block
|
||||
device instead of the overlay file system for the container root file system. The
|
||||
block device maps to the top read-write layer for the overlay. This approach gives
|
||||
much better I/O performance compared to using 9pfs to share the container file system.
|
||||
|
||||
The approach above does introduce a limitation in terms of dynamic file copy
|
||||
in/out of the container using the `docker cp` operations. The copy operation from
|
||||
host to container accesses the mounted file system on the host-side. This is
|
||||
not expected to work and may lead to inconsistencies as the block device will
|
||||
be simultaneously written to from two different mounts. The copy operation from
|
||||
container to host will work, provided the user calls `sync(1)` from within the
|
||||
container prior to the copy to make sure any outstanding cached data is written
|
||||
to the block device.
|
||||
|
||||
```
|
||||
docker cp [OPTIONS] CONTAINER:SRC_PATH HOST:DEST_PATH
|
||||
docker cp [OPTIONS] HOST:SRC_PATH CONTAINER:DEST_PATH
|
||||
```
|
||||
|
||||
Kata Containers has the ability to hotplug and remove block devices, which makes it
|
||||
possible to use block devices for containers started after the VM has been launched.
|
||||
|
||||
Users can check to see if the container uses the devicemapper block device as its
|
||||
rootfs by calling `mount(8)` within the counter. If the devicemapper block device
|
||||
is used, `/` will be mounted on `/dev/vda`. Users can disable direct mounting
|
||||
of the underlying block device through the runtime configuration.
|
||||
|
||||
## Kubernetes support
|
||||
|
||||
[Kubernetes\*](https://github.com/kubernetes/kubernetes/) is a popular open source
|
||||
container orchestration engine. In Kubernetes, a set of containers sharing resources
|
||||
such as networking, storage, mount, PID, etc. is called a
|
||||
[Pod](https://kubernetes.io/docs/user-guide/pods/).
|
||||
A node can have multiple pods, but at a minimum, a node within a Kubernetes cluster
|
||||
only needs to run a container runtime and a container agent (called a
|
||||
[kubelet](https://kubernetes.io/docs/admin/kubelet/)).
|
||||
|
||||
A Kubernetes cluster runs a control plane where a scheduler (typically running on a
|
||||
dedicated master node) calls into a compute kubelet. This kubelet instance is
|
||||
responsible for managing the lifecycle of pods within the nodes and eventually relies
|
||||
on a container runtime to handle execution. The kubelet architecture decouples
|
||||
lifecycle management from container execution through the dedicated
|
||||
[`gRPC`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto)
|
||||
based [Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/container-runtime-interface-v1.md).
|
||||
|
||||
In other words, a kubelet is a CRI client and expects a CRI implementation to
|
||||
handle the server side of the interface.
|
||||
[CRI-O\*](https://github.com/kubernetes-incubator/cri-o) and [CRI-containerd\*](https://github.com/containerd/cri) are CRI implementations that rely on [OCI](https://github.com/opencontainers/runtime-spec)
|
||||
compatible runtimes for managing container instances.
|
||||
|
||||
Kata Containers is an officially supported CRI-O and CRI-containerd runtime. It is OCI compatible and therefore aligns with each projects' architecture and requirements.
|
||||
However, due to the fact that Kubernetes execution units are sets of containers (also
|
||||
known as pods) rather than single containers, the Kata Containers runtime needs to
|
||||
get extra information to seamlessly integrate with Kubernetes.
|
||||
|
||||
### Problem statement
|
||||
|
||||
The Kubernetes\* execution unit is a pod that has specifications detailing constraints
|
||||
such as namespaces, groups, hardware resources, security contents, *etc* shared by all
|
||||
the containers within that pod.
|
||||
By default the kubelet will send a container creation request to its CRI runtime for
|
||||
each pod and container creation. Without additional metadata from the CRI runtime,
|
||||
the Kata Containers runtime will thus create one virtual machine for each pod and for
|
||||
each containers within a pod. However the task of providing the Kubernetes pod semantics
|
||||
when creating one virtual machine for each container within the same pod is complex given
|
||||
the resources of these virtual machines (such as networking or PID) need to be shared.
|
||||
|
||||
The challenge with Kata Containers when working as a Kubernetes\* runtime is thus to know
|
||||
when to create a full virtual machine (for pods) and when to create a new container inside
|
||||
a previously created virtual machine. In both cases it will get called with very similar
|
||||
arguments, so it needs the help of the Kubernetes CRI runtime to be able to distinguish a
|
||||
pod creation request from a container one.
|
||||
|
||||
### CRI-O
|
||||
|
||||
#### OCI annotations
|
||||
|
||||
In order for the Kata Containers runtime (or any virtual machine based OCI compatible
|
||||
runtime) to be able to understand if it needs to create a full virtual machine or if it
|
||||
has to create a new container inside an existing pod's virtual machine, CRI-O adds
|
||||
specific annotations to the OCI configuration file (`config.json`) which is passed to
|
||||
the OCI compatible runtime.
|
||||
|
||||
Before calling its runtime, CRI-O will always add a `io.kubernetes.cri-o.ContainerType`
|
||||
annotation to the `config.json` configuration file it produces from the kubelet CRI
|
||||
request. The `io.kubernetes.cri-o.ContainerType` annotation can either be set to `sandbox`
|
||||
or `container`. Kata Containers will then use this annotation to decide if it needs to
|
||||
respectively create a virtual machine or a container inside a virtual machine associated
|
||||
with a Kubernetes pod:
|
||||
|
||||
```Go
|
||||
containerType, err := ociSpec.ContainerType()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
switch containerType {
|
||||
case vc.PodSandbox:
|
||||
process, err = createPod(ociSpec, runtimeConfig, containerID, bundlePath, console, disableOutput)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
case vc.PodContainer:
|
||||
process, err = createContainer(ociSpec, containerID, bundlePath, console, disableOutput)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
### Mixing VM based and namespace based runtimes
|
||||
|
||||
One interesting evolution of the CRI-O support for `kata-runtime` is the ability
|
||||
to run virtual machine based pods alongside namespace ones. With CRI-O and Kata
|
||||
Containers, one can introduce the concept of workload trust inside a Kubernetes
|
||||
cluster.
|
||||
|
||||
A cluster operator can now tag (through Kubernetes annotations) container workloads
|
||||
as `trusted` or `untrusted`. The former labels known to be safe workloads while
|
||||
the latter describes potentially malicious or misbehaving workloads that need the
|
||||
highest degree of isolation. In a software development context, an example of a `trusted` workload would be a containerized continuous integration engine whereas all
|
||||
developers applications would be `untrusted` by default. Developers workloads can
|
||||
be buggy, unstable or even include malicious code and thus from a security perspective
|
||||
it makes sense to tag them as `untrusted`. A CRI-O and Kata Containers based
|
||||
Kubernetes cluster handles this use case transparently as long as the deployed
|
||||
containers are properly tagged. All `untrusted` containers will be handled by kata Containers and thus run in a hardware virtualized secure sandbox while `runc`, for
|
||||
example, could handle the `trusted` ones.
|
||||
|
||||
CRI-O's default behavior is to trust all pods, except when they're annotated with
|
||||
`io.kubernetes.cri-o.TrustedSandbox` set to `false`. The default CRI-O trust level
|
||||
is set through its `configuration.toml` configuration file. Generally speaking,
|
||||
the CRI-O runtime selection between its trusted runtime (typically `runc`) and its untrusted one (`kata-runtime`) is a function of the pod `Privileged` setting, the `io.kubernetes.cri-o.TrustedSandbox` annotation value, and the default CRI-O trust
|
||||
level. When a pod is `Privileged`, the runtime will always be `runc`. However, when
|
||||
a pod is **not** `Privileged` the runtime selection is done as follows:
|
||||
|
||||
| | `io.kubernetes.cri-o.TrustedSandbox` not set | `io.kubernetes.cri-o.TrustedSandbox` = `true` | `io.kubernetes.cri-o.TrustedSandbox` = `false` |
|
||||
| :--- | :---: | :---: | :---: |
|
||||
| Default CRI-O trust level: `trusted` | runc | runc | Kata Containers |
|
||||
| Default CRI-O trust level: `untrusted` | Kata Containers | Kata Containers | Kata Containers |
|
||||
|
||||
|
||||
### CRI-containerd
|
||||
|
||||
placeholder
|
||||
|
||||
#### Mixing VM based and namespace based runtimes
|
||||
|
||||
placeholder
|
||||
|
||||
# Appendices
|
||||
|
||||
## DAX
|
||||
|
||||
Kata Containers utilizes the Linux kernel DAX [(Direct Access filesystem)](https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.txt)
|
||||
feature to efficiently map some host-side files into the guest VM space.
|
||||
In particular, Kata Containers uses the QEMU nvdimm feature to provide a
|
||||
memory-mapped virtual device that can be used to DAX map the virtual machine's
|
||||
root filesystem into the guest memory address space.
|
||||
|
||||
Mapping files using DAX provides a number of benefits over more traditional VM
|
||||
file and device mapping mechanisms:
|
||||
|
||||
- Mapping as a direct access devices allows the guest to directly access
|
||||
the host memory pages (such as via eXicute In Place (XIP)), bypassing the guest
|
||||
page cache. This provides both time and space optimizations.
|
||||
- Mapping as a direct access device inside the VM allows pages from the
|
||||
host to be demand loaded using page faults, rather than having to make requests
|
||||
via a virtualized device (causing expensive VM exits/hypercalls), thus providing
|
||||
a speed optimization.
|
||||
- Utilizing `MAP_SHARED` shared memory on the host allows the host to efficiently
|
||||
share pages.
|
||||
|
||||
Kata Containers uses the following steps to set up the DAX mappings:
|
||||
1. QEMU is configured with an nvdimm memory device, with a memory file
|
||||
backend to map in the host-side file into the virtual nvdimm space.
|
||||
2. The guest kernel command line mounts this nvdimm device with the DAX
|
||||
feature enabled, allowing direct page mapping and access, thus bypassing the
|
||||
guest page cache.
|
||||
|
||||

|
||||
|
||||
Information on the use of nvdimm via QEMU is available in the [QEMU source code](http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/nvdimm.txt;hb=HEAD)
|