mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-08-12 21:27:02 +00:00
Merge pull request #3287 from jodh-intel/docs-split-arch-doc
Split architecture doc into separate files
This commit is contained in:
commit
2ebae2d279
@ -41,7 +41,7 @@ Documents that help to understand and contribute to Kata Containers.
|
|||||||
|
|
||||||
### Design and Implementations
|
### Design and Implementations
|
||||||
|
|
||||||
* [Kata Containers Architecture](design/architecture.md): Architectural overview of Kata Containers
|
* [Kata Containers Architecture](design/architecture): Architectural overview of Kata Containers
|
||||||
* [Kata Containers E2E Flow](design/end-to-end-flow.md): The entire end-to-end flow of Kata Containers
|
* [Kata Containers E2E Flow](design/end-to-end-flow.md): The entire end-to-end flow of Kata Containers
|
||||||
* [Kata Containers design](./design/README.md): More Kata Containers design documents
|
* [Kata Containers design](./design/README.md): More Kata Containers design documents
|
||||||
* [Kata Containers threat model](./threat-model/threat-model.md): Kata Containers threat model
|
* [Kata Containers threat model](./threat-model/threat-model.md): Kata Containers threat model
|
||||||
|
@ -114,7 +114,7 @@ with containerd.
|
|||||||
> kernel or image.
|
> kernel or image.
|
||||||
|
|
||||||
If you are using custom
|
If you are using custom
|
||||||
[guest assets](design/architecture.md#guest-assets),
|
[guest assets](design/architecture/README.md#guest-assets),
|
||||||
you must upgrade them to work with Kata Containers 2.x since Kata
|
you must upgrade them to work with Kata Containers 2.x since Kata
|
||||||
Containers 1.x assets will **not** work.
|
Containers 1.x assets will **not** work.
|
||||||
|
|
||||||
|
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
Kata Containers design documents:
|
Kata Containers design documents:
|
||||||
|
|
||||||
- [Kata Containers architecture](architecture.md)
|
- [Kata Containers architecture](architecture)
|
||||||
- [API Design of Kata Containers](kata-api-design.md)
|
- [API Design of Kata Containers](kata-api-design.md)
|
||||||
- [Design requirements for Kata Containers](kata-design-requirements.md)
|
- [Design requirements for Kata Containers](kata-design-requirements.md)
|
||||||
- [VSocks](VSocks.md)
|
- [VSocks](VSocks.md)
|
||||||
|
@ -1,864 +0,0 @@
|
|||||||
# Kata Containers Architecture
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Kata Containers is an open source community working to build a secure
|
|
||||||
container [runtime](#runtime) with lightweight virtual machines (VM's)
|
|
||||||
that feel and perform like standard Linux containers, but provide
|
|
||||||
stronger [workload](#workload) isolation using hardware
|
|
||||||
[virtualization](#virtualization) technology as a second layer of
|
|
||||||
defence.
|
|
||||||
|
|
||||||
Kata Containers runs on [multiple architectures](../../src/runtime/README.md#platform-support)
|
|
||||||
and supports [multiple hypervisors](../hypervisors.md).
|
|
||||||
|
|
||||||
This document is a summary of the Kata Containers architecture.
|
|
||||||
|
|
||||||
## Virtualization
|
|
||||||
|
|
||||||
For details on how Kata Containers maps container concepts to VM
|
|
||||||
technologies, and how this is realized in the multiple hypervisors and
|
|
||||||
VMMs that Kata supports see the
|
|
||||||
[virtualization documentation](./virtualization.md).
|
|
||||||
|
|
||||||
## Compatibility
|
|
||||||
|
|
||||||
The [Kata Containers runtime](../../src/runtime) is compatible with
|
|
||||||
the [OCI](https://github.com/opencontainers)
|
|
||||||
[runtime specification](https://github.com/opencontainers/runtime-spec)
|
|
||||||
and therefore works seamlessly with the
|
|
||||||
[Kubernetes Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/container-runtime-interface.md)
|
|
||||||
through the [CRI-O](https://github.com/kubernetes-incubator/cri-o)
|
|
||||||
and [containerd](https://github.com/containerd/containerd)
|
|
||||||
implementations.
|
|
||||||
|
|
||||||
Kata Containers provides a ["shimv2"](#shim-v2-architecture) compatible runtime.
|
|
||||||
|
|
||||||
## Shim v2 architecture
|
|
||||||
|
|
||||||
The Kata Containers runtime is shim v2 ("shimv2") compatible. This
|
|
||||||
section explains what this means.
|
|
||||||
|
|
||||||
### History
|
|
||||||
|
|
||||||
In the old [Kata 1.x architecture](https://github.com/kata-containers/documentation/blob/master/design/architecture.md),
|
|
||||||
the Kata [runtime](#runtime) was an executable called `kata-runtime`.
|
|
||||||
The container manager called this executable multiple times when
|
|
||||||
creating each container. Each time the runtime was called a different
|
|
||||||
OCI command-line verb was provided. This architecture was simple, but
|
|
||||||
not well suited to creating VM based containers due to the issue of
|
|
||||||
handling state between calls. Additionally, the architecture suffered
|
|
||||||
from performance issues related to continually having to spawn new
|
|
||||||
instances of the runtime binary, and
|
|
||||||
[Kata shim](https://github.com/kata-containers/shim) and
|
|
||||||
[Kata proxy](https://github.com/kata-containers/proxy) processes for systems
|
|
||||||
that did not provide VSOCK.
|
|
||||||
|
|
||||||
### An improved architecture
|
|
||||||
|
|
||||||
The
|
|
||||||
[containerd runtime shimv2 architecture](https://github.com/containerd/containerd/tree/main/runtime/v2)
|
|
||||||
or _shim API_ architecture resolves the issues with the old
|
|
||||||
architecture by defining a set of shimv2 APIs that a compatible
|
|
||||||
runtime implementation must supply. Rather than calling the runtime
|
|
||||||
binary multiple times for each new container, the shimv2 architecture
|
|
||||||
runs a single instance of the runtime binary (for any number of
|
|
||||||
containers). This improves performance and resolves the state handling
|
|
||||||
issue.
|
|
||||||
|
|
||||||
The shimv2 API is similar to the
|
|
||||||
[OCI runtime](https://github.com/opencontainers/runtime-spec)
|
|
||||||
API in terms of the way the container lifecycle is split into
|
|
||||||
different verbs. Rather than calling the runtime multiple times, the
|
|
||||||
container manager creates a socket and passes it to the shimv2
|
|
||||||
runtime. The socket is a bi-directional communication channel that
|
|
||||||
uses a gRPC based protocol to allow the container manager to send API
|
|
||||||
calls to the runtime, which returns the result to the container
|
|
||||||
manager using the same channel.
|
|
||||||
|
|
||||||
The shimv2 architecture allows running several containers per VM to
|
|
||||||
support container engines that require multiple containers running
|
|
||||||
inside a pod.
|
|
||||||
|
|
||||||
With the new architecture [Kubernetes](#kubernetes-support) can
|
|
||||||
launch both Pod and OCI compatible containers with a single
|
|
||||||
[runtime](#runtime) shim per Pod, rather than `2N+1` shims. No stand
|
|
||||||
alone `kata-proxy` process is required, even if VSOCK is not
|
|
||||||
available.
|
|
||||||
|
|
||||||
### Architectural comparison
|
|
||||||
|
|
||||||
| Kata version | Kata Runtime process calls | Kata shim processes | Kata proxy processes (if no VSOCK) |
|
|
||||||
|-|-|-|-|
|
|
||||||
| 1.x | multiple per container | 1 per container connection | 1 |
|
|
||||||
| 2.x | 1 per VM (hosting any number of containers) | 0 | 0 |
|
|
||||||
|
|
||||||
> **Notes:**
|
|
||||||
>
|
|
||||||
> - A single VM can host one or more containers.
|
|
||||||
>
|
|
||||||
> - The "Kata shim processes" column refers to the old
|
|
||||||
> [Kata shim](https://github.com/kata-containers/shim) (`kata-shim` binary),
|
|
||||||
> *not* the new shimv2 runtime instance (`containerd-shim-kata-v2` binary).
|
|
||||||
|
|
||||||
The diagram below shows how the original architecture was simplified
|
|
||||||
with the advent of shimv2.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Root filesystem
|
|
||||||
|
|
||||||
This document uses the term _rootfs_ to refer to a root filesystem
|
|
||||||
which is mounted as the top-level directory ("`/`") and often referred
|
|
||||||
to as _slash_.
|
|
||||||
|
|
||||||
It is important to understand this term since the overall system uses
|
|
||||||
multiple different rootfs's (as explained in the
|
|
||||||
[Environments](#environments) section.
|
|
||||||
|
|
||||||
## Example command
|
|
||||||
|
|
||||||
The following containerd command creates a container. It is referred
|
|
||||||
to throughout this document to help explain various points:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ sudo ctr run --runtime "io.containerd.kata.v2" --rm -t "quay.io/libpod/ubuntu:latest" foo sh
|
|
||||||
```
|
|
||||||
|
|
||||||
This command requests that containerd:
|
|
||||||
|
|
||||||
- Create a container (`ctr run`).
|
|
||||||
- Use the Kata [shimv2](#shim-v2-architecture) runtime (`--runtime "io.containerd.kata.v2"`).
|
|
||||||
- Delete the container when it [exits](#workload-exit) (`--rm`).
|
|
||||||
- Attach the container to the user's terminal (`-t`).
|
|
||||||
- Use the Ubuntu Linux [container image](#container-image)
|
|
||||||
to create the container [rootfs](#root-filesystem) that will become
|
|
||||||
the [container environment](#environments)
|
|
||||||
(`quay.io/libpod/ubuntu:latest`).
|
|
||||||
- Create the container with the name "`foo`".
|
|
||||||
- Run the `sh(1)` command in the Ubuntu rootfs based container
|
|
||||||
environment.
|
|
||||||
|
|
||||||
The command specified here is referred to as the [workload](#workload).
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> For the purposes of this document and to keep explanations
|
|
||||||
> simpler, we assume the user is running this command in the
|
|
||||||
> [host environment](#environments).
|
|
||||||
|
|
||||||
## Container image
|
|
||||||
|
|
||||||
In the [example command](#example-command) the user has specified the
|
|
||||||
type of container they wish to run via the container image name:
|
|
||||||
`ubuntu`. This image name corresponds to a _container image_ that can
|
|
||||||
be used to create a container with an Ubuntu Linux environment. Hence,
|
|
||||||
in our [example](#example-command), the `sh(1)` command will be run
|
|
||||||
inside a container which has an Ubuntu rootfs.
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> The term _container image_ is confusing since the image in question
|
|
||||||
> is **not** a container: it is simply a set of files (_an image_)
|
|
||||||
> that can be used to _create_ a container. The term _container
|
|
||||||
> template_ would be more accurate but the term _container image_ is
|
|
||||||
> commonly used so this document uses the standard term.
|
|
||||||
|
|
||||||
For the purposes of this document, the most important part of the
|
|
||||||
[example command line](#example-command) is the container image the
|
|
||||||
user has requested. Normally, the container manager will _pull_
|
|
||||||
(download) a container image from a remote site and store a copy
|
|
||||||
locally. This local container image is used by the container manager
|
|
||||||
to create an [OCI bundle](#oci-bundle) which will form the environment
|
|
||||||
the container will run in. After creating the OCI bundle, the
|
|
||||||
container manager launches a [runtime](#runtime) which will create the
|
|
||||||
container using the provided OCI bundle.
|
|
||||||
|
|
||||||
## OCI bundle
|
|
||||||
|
|
||||||
To understand what follows, it is important to know at a high level
|
|
||||||
how an OCI ([Open Containers Initiative](https://opencontainers.org)) compatible container is created.
|
|
||||||
|
|
||||||
An OCI compatible container is created by taking a
|
|
||||||
[container image](#container-image) and converting the embedded rootfs
|
|
||||||
into an
|
|
||||||
[OCI rootfs bundle](https://github.com/opencontainers/runtime-spec/blob/main/bundle.md),
|
|
||||||
or more simply, an _OCI bundle_.
|
|
||||||
|
|
||||||
An OCI bundle is a `tar(1)` archive normally created by a container
|
|
||||||
manager which is passed to an OCI [runtime](#runtime) which converts
|
|
||||||
it into a full container rootfs. The bundle contains two assets:
|
|
||||||
|
|
||||||
- A container image [rootfs](#root-filesystem)
|
|
||||||
|
|
||||||
This is simply a directory of files that will be used to represent
|
|
||||||
the rootfs for the container.
|
|
||||||
|
|
||||||
For the [example command](#example-command), the directory will
|
|
||||||
contain the files necessary to create a minimal Ubuntu root
|
|
||||||
filesystem.
|
|
||||||
|
|
||||||
- An [OCI configuration file](https://github.com/opencontainers/runtime-spec/blob/main/config.md)
|
|
||||||
|
|
||||||
This is a JSON file called `config.json`.
|
|
||||||
|
|
||||||
The container manager will create this file so that:
|
|
||||||
|
|
||||||
- The `root.path` value is set to the full path of the specified
|
|
||||||
container rootfs.
|
|
||||||
|
|
||||||
In [the example](#example-command) this value will be `ubuntu`.
|
|
||||||
|
|
||||||
- The `process.args` array specifies the list of commands the user
|
|
||||||
wishes to run. This is known as the [workload](#workload).
|
|
||||||
|
|
||||||
In [the example](#example-command) the workload is `sh(1)`.
|
|
||||||
|
|
||||||
## Workload
|
|
||||||
|
|
||||||
The workload is the command the user requested to run in the
|
|
||||||
container and is specified in the [OCI bundle](#oci-bundle)'s
|
|
||||||
configuration file.
|
|
||||||
|
|
||||||
In our [example](#example-command), the workload is the `sh(1)` command.
|
|
||||||
|
|
||||||
### Workload root filesystem
|
|
||||||
|
|
||||||
For details of how the [runtime](#runtime) makes the
|
|
||||||
[container image](#container-image) chosen by the user available to
|
|
||||||
the workload process, see the
|
|
||||||
[Container creation](#container-creation) and [storage](#storage) sections.
|
|
||||||
|
|
||||||
Note that the workload is isolated from the [guest VM](#environments) environment by its
|
|
||||||
surrounding [container environment](#environments). The guest VM
|
|
||||||
environment where the container runs in is also isolated from the _outer_
|
|
||||||
[host environment](#environments) where the container manager runs.
|
|
||||||
|
|
||||||
## System overview
|
|
||||||
|
|
||||||
### Environments
|
|
||||||
|
|
||||||
The following terminology is used to describe the different or
|
|
||||||
environments (or contexts) various processes run in. It is necessary
|
|
||||||
to study this table closely to make sense of what follows:
|
|
||||||
|
|
||||||
| Type | Name | Virtualized | Containerized | rootfs | Rootfs device type | Mount type | Description |
|
|
||||||
|-|-|-|-|-|-|-|-|
|
|
||||||
| Host | Host | no `[1]` | no | Host specific | Host specific | Host specific | The environment provided by a standard, physical non virtualized system. |
|
|
||||||
| VM root | Guest VM | yes | no | rootfs inside the [guest image](#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. |
|
|
||||||
| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](#example-command)) | `kataShared` | [virtio FS](#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](#oci-bundle). |
|
|
||||||
|
|
||||||
**Key:**
|
|
||||||
|
|
||||||
- `[1]`: For simplicity, this document assumes the host environment
|
|
||||||
runs on physical hardware.
|
|
||||||
|
|
||||||
- `[2]`: See the [DAX](#dax) section.
|
|
||||||
|
|
||||||
> **Notes:**
|
|
||||||
>
|
|
||||||
> - The word "root" is used to mean _top level_ here in a similar
|
|
||||||
> manner to the term [rootfs](#root-filesystem).
|
|
||||||
>
|
|
||||||
> - The term "first level" prefix used above is important since it implies
|
|
||||||
> that it is possible to create multi level systems. However, they do
|
|
||||||
> not form part of a standard Kata Containers environment so will not
|
|
||||||
> be considered in this document.
|
|
||||||
|
|
||||||
The reasons for containerizing the [workload](#workload) inside the VM
|
|
||||||
are:
|
|
||||||
|
|
||||||
- Isolates the workload entirely from the VM environment.
|
|
||||||
- Provides better isolation between containers in a [pod](#kubernetes-support).
|
|
||||||
- Allows the workload to be managed and monitored through its cgroup
|
|
||||||
confinement.
|
|
||||||
|
|
||||||
### Container creation
|
|
||||||
|
|
||||||
The steps below show at a high level how a Kata Containers container is
|
|
||||||
created using the containerd container manager:
|
|
||||||
|
|
||||||
1. The user requests the creation of a container by running a command
|
|
||||||
like the [example command](#example-command).
|
|
||||||
1. The container manager daemon runs a single instance of the Kata
|
|
||||||
[runtime](#runtime).
|
|
||||||
1. The Kata runtime loads its [configuration file](#configuration).
|
|
||||||
1. The container manager calls a set of shimv2 API functions on the runtime.
|
|
||||||
1. The Kata runtime launches the configured [hypervisor](#hypervisor).
|
|
||||||
1. The hypervisor creates and starts (_boots_) a VM using the
|
|
||||||
[guest assets](#guest-assets):
|
|
||||||
|
|
||||||
- The hypervisor [DAX](#dax) shares the [guest image](#guest-image)
|
|
||||||
into the VM to become the VM [rootfs](#root-filesystem) (mounted on a `/dev/pmem*` device),
|
|
||||||
which is known as the [VM root environment](#environments).
|
|
||||||
- The hypervisor mounts the [OCI bundle](#oci-bundle), using [virtio FS](#virtio-fs),
|
|
||||||
into a container specific directory inside the VM's rootfs.
|
|
||||||
|
|
||||||
This container specific directory will become the
|
|
||||||
[container rootfs](#environments), known as the
|
|
||||||
[container environment](#environments).
|
|
||||||
|
|
||||||
1. The [agent](#agent) is started as part of the VM boot.
|
|
||||||
|
|
||||||
1. The runtime calls the agent's `CreateSandbox` API to request the
|
|
||||||
agent create a container:
|
|
||||||
|
|
||||||
1. The agent creates a [container environment](#environments)
|
|
||||||
in the container specific directory that contains the [container rootfs](#environments).
|
|
||||||
|
|
||||||
The container environment hosts the [workload](#workload) in the
|
|
||||||
[container rootfs](#environments) directory.
|
|
||||||
|
|
||||||
1. The agent spawns the workload inside the container environment.
|
|
||||||
|
|
||||||
> **Notes:**
|
|
||||||
>
|
|
||||||
> - The container environment created by the agent is equivalent to
|
|
||||||
> a container environment created by the
|
|
||||||
> [`runc`](https://github.com/opencontainers/runc) OCI runtime;
|
|
||||||
> Linux cgroups and namespaces are created inside the VM by the
|
|
||||||
> [guest kernel](#guest-kernel) to isolate the workload from the
|
|
||||||
> VM environment the container is created in. See the
|
|
||||||
> [Environments](#environments) section for an explanation of why
|
|
||||||
> this is done.
|
|
||||||
>
|
|
||||||
> - See the [guest image](#guest-image) section for details of
|
|
||||||
> exactly how the agent is started.
|
|
||||||
|
|
||||||
1. The container manager returns control of the container to the
|
|
||||||
user running the `ctr` command.
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> At this point, the container is running and:
|
|
||||||
>
|
|
||||||
> - The [workload](#workload) process ([`sh(1)` in the example](#example-command))
|
|
||||||
> is running in the [container environment](#environments).
|
|
||||||
> - The user is now able to interact with the workload
|
|
||||||
> (using the [`ctr` command in the example](#example-command)).
|
|
||||||
> - The [agent](#agent), running inside the VM is monitoring the
|
|
||||||
> [workload](#workload) process.
|
|
||||||
> - The [runtime](#runtime) is waiting for the agent's `WaitProcess` API
|
|
||||||
> call to complete.
|
|
||||||
|
|
||||||
Further details of these steps are provided in the sections below.
|
|
||||||
|
|
||||||
### Container shutdown
|
|
||||||
|
|
||||||
There are two possible ways for the container environment to be
|
|
||||||
terminated:
|
|
||||||
|
|
||||||
- When the [workload](#workload) exits.
|
|
||||||
|
|
||||||
This is the standard, or _graceful_ shutdown method.
|
|
||||||
|
|
||||||
- When the container manager forces the container to be deleted.
|
|
||||||
|
|
||||||
#### Workload exit
|
|
||||||
|
|
||||||
The [agent](#agent) will detect when the [workload](#workload) process
|
|
||||||
exits, capture its exit status (see `wait(2)`) and return that value
|
|
||||||
to the [runtime](#runtime) by specifying it as the response to the
|
|
||||||
`WaitProcess` agent API call made by the [runtime](#runtime).
|
|
||||||
|
|
||||||
The runtime then passes the value back to the container manager by the
|
|
||||||
`Wait` [shimv2 API](#shim-v2-architecture) call.
|
|
||||||
|
|
||||||
Once the workload has fully exited, the VM is no longer needed and the
|
|
||||||
runtime cleans up the environment (which includes terminating the
|
|
||||||
[hypervisor](#hypervisor) process).
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> When [agent tracing is enabled](../tracing.md#agent-shutdown-behaviour),
|
|
||||||
> the shutdown behaviour is different.
|
|
||||||
|
|
||||||
#### Container manager requested shutdown
|
|
||||||
|
|
||||||
If the container manager requests the container be deleted, the
|
|
||||||
[runtime](#runtime) will signal the agent by sending it a
|
|
||||||
`DestroySandbox` [ttRPC API](../../src/agent/protocols/protos/agent.proto) request.
|
|
||||||
|
|
||||||
## Guest assets
|
|
||||||
|
|
||||||
Kata Containers creates a VM in which to run one or more containers. It
|
|
||||||
does this by launching a [hypervisor](#hypervisor) to create the VM.
|
|
||||||
The hypervisor needs two assets for this task: a Linux kernel and a
|
|
||||||
small root filesystem image to boot the VM.
|
|
||||||
|
|
||||||
### Guest kernel
|
|
||||||
|
|
||||||
The [guest kernel](../../tools/packaging/kernel)
|
|
||||||
is passed to the hypervisor and used to boot the VM.
|
|
||||||
The default kernel provided in Kata Containers is highly optimized for
|
|
||||||
kernel boot time and minimal memory footprint, providing only those
|
|
||||||
services required by a container workload. It is based on the latest
|
|
||||||
Linux LTS (Long Term Support) [kernel](https://www.kernel.org).
|
|
||||||
|
|
||||||
### Guest image
|
|
||||||
|
|
||||||
The hypervisor uses an image file which provides a minimal root
|
|
||||||
filesystem used by the guest kernel to boot the VM and host the Kata
|
|
||||||
Container. Kata Containers supports both initrd and rootfs based
|
|
||||||
minimal guest images. The [default packages](../install/) provide both
|
|
||||||
an image and an initrd, both of which are created using the
|
|
||||||
[`osbuilder`](../../tools/osbuilder) tool.
|
|
||||||
|
|
||||||
> **Notes:**
|
|
||||||
>
|
|
||||||
> - Although initrd and rootfs based images are supported, not all
|
|
||||||
> [hypervisors](#hypervisor) support both types of image.
|
|
||||||
>
|
|
||||||
> - The guest image is *unrelated* to the image used in a container
|
|
||||||
> workload.
|
|
||||||
>
|
|
||||||
> For example, if a user creates a container that runs a shell in a
|
|
||||||
> BusyBox image, they will run that shell in a BusyBox environment.
|
|
||||||
> However, the guest image running inside the VM that is used to
|
|
||||||
> *host* that BusyBox image could be running Clear Linux, Ubuntu,
|
|
||||||
> Fedora or any other distribution potentially.
|
|
||||||
>
|
|
||||||
> The `osbuilder` tool provides
|
|
||||||
> [configurations for various common Linux distributions](../../tools/osbuilder/rootfs-builder)
|
|
||||||
> which can be built into either initrd or rootfs guest images.
|
|
||||||
>
|
|
||||||
> - If you are using a [packaged version of Kata
|
|
||||||
> Containers](../install), you can see image details by running the
|
|
||||||
> [`kata-collect-data.sh`](../../src/runtime/data/kata-collect-data.sh.in)
|
|
||||||
> script as `root` and looking at the "Image details" section of the
|
|
||||||
> output.
|
|
||||||
|
|
||||||
#### Root filesystem image
|
|
||||||
|
|
||||||
The default packaged rootfs image, sometimes referred to as the _mini
|
|
||||||
O/S_, is a highly optimized container bootstrap system.
|
|
||||||
|
|
||||||
If this image type is [configured](#configuration), when the user runs
|
|
||||||
the [example command](#example-command):
|
|
||||||
|
|
||||||
- The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor).
|
|
||||||
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
|
||||||
- The kernel will start the init daemon as PID 1 (`systemd`) inside the VM root environment.
|
|
||||||
- `systemd`, running inside the mini-OS context, will launch the [agent](#agent)
|
|
||||||
in the root context of the VM.
|
|
||||||
- The agent will create a new container environment, setting its root
|
|
||||||
filesystem to that requested by the user (Ubuntu in [the example](#example-command)).
|
|
||||||
- The agent will then execute the command (`sh(1)` in [the example](#example-command))
|
|
||||||
inside the new container.
|
|
||||||
|
|
||||||
The table below summarises the default mini O/S showing the
|
|
||||||
environments that are created, the services running in those
|
|
||||||
environments (for all platforms) and the root filesystem used by
|
|
||||||
each service:
|
|
||||||
|
|
||||||
| Process | Environment | systemd service? | rootfs | User accessible | Notes |
|
|
||||||
|-|-|-|-|-|-|
|
|
||||||
| systemd | VM root | n/a | [VM guest image](#guest-image)| [debug console][debug-console] | The init daemon, running as PID 1 |
|
|
||||||
| [Agent](#agent) | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Runs as a systemd service |
|
|
||||||
| `chronyd` | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Used to synchronise the time with the host |
|
|
||||||
| container workload (`sh(1)` in [the example](#example-command)) | VM container | no | User specified (Ubuntu in [the example](#example-command)) | [exec command](#exec-command) | Managed by the agent |
|
|
||||||
|
|
||||||
See also the [process overview](#process-overview).
|
|
||||||
|
|
||||||
> **Notes:**
|
|
||||||
>
|
|
||||||
> - The "User accessible" column shows how an administrator can access
|
|
||||||
> the environment.
|
|
||||||
>
|
|
||||||
> - The container workload is running inside a full container
|
|
||||||
> environment which itself is running within a VM environment.
|
|
||||||
>
|
|
||||||
> - See the [configuration files for the `osbuilder` tool](../../tools/osbuilder/rootfs-builder)
|
|
||||||
> for details of the default distribution for platforms other than
|
|
||||||
> Intel x86_64.
|
|
||||||
|
|
||||||
#### Initrd image
|
|
||||||
|
|
||||||
The initrd image is a compressed `cpio(1)` archive, created from a
|
|
||||||
rootfs which is loaded into memory and used as part of the Linux
|
|
||||||
startup process. During startup, the kernel unpacks it into a special
|
|
||||||
instance of a `tmpfs` mount that becomes the initial root filesystem.
|
|
||||||
|
|
||||||
If this image type is [configured](#configuration), when the user runs
|
|
||||||
the [example command](#example-command):
|
|
||||||
|
|
||||||
- The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor).
|
|
||||||
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
|
||||||
- The kernel will start the init daemon as PID 1 (the [agent](#agent))
|
|
||||||
inside the VM root environment.
|
|
||||||
- The [agent](#agent) will create a new container environment, setting its root
|
|
||||||
filesystem to that requested by the user (`ubuntu` in
|
|
||||||
[the example](#example-command)).
|
|
||||||
- The agent will then execute the command (`sh(1)` in [the example](#example-command))
|
|
||||||
inside the new container.
|
|
||||||
|
|
||||||
The table below summarises the default mini O/S showing the environments that are created,
|
|
||||||
the processes running in those environments (for all platforms) and
|
|
||||||
the root filesystem used by each service:
|
|
||||||
|
|
||||||
| Process | Environment | rootfs | User accessible | Notes |
|
|
||||||
|-|-|-|-|-|
|
|
||||||
| [Agent](#agent) | VM root | [VM guest image](#guest-image) | [debug console][debug-console] | Runs as the init daemon (PID 1) |
|
|
||||||
| container workload | VM container | User specified (Ubuntu in this example) | [exec command](#exec-command) | Managed by the agent |
|
|
||||||
|
|
||||||
> **Notes:**
|
|
||||||
>
|
|
||||||
> - The "User accessible" column shows how an administrator can access
|
|
||||||
> the environment.
|
|
||||||
>
|
|
||||||
> - It is possible to use a standard init daemon such as systemd with
|
|
||||||
> an initrd image if this is desirable.
|
|
||||||
|
|
||||||
See also the [process overview](#process-overview).
|
|
||||||
|
|
||||||
#### Image summary
|
|
||||||
|
|
||||||
| Image type | Default distro | Init daemon | Reason | Notes |
|
|
||||||
|-|-|-|-|-|
|
|
||||||
| [image](#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility |
|
|
||||||
| [initrd](#initrd-image) | [Alpine Linux](https://alpinelinux.org) | Kata [agent](#agent) (as no systemd support) | Security hardened and tiny C library |
|
|
||||||
|
|
||||||
See also:
|
|
||||||
|
|
||||||
- The [osbuilder](../../tools/osbuilder) tool
|
|
||||||
|
|
||||||
This is used to build all default image types.
|
|
||||||
|
|
||||||
- The [versions database](../../versions.yaml)
|
|
||||||
|
|
||||||
The `default-image-name` and `default-initrd-name` options specify
|
|
||||||
the default distributions for each image type.
|
|
||||||
|
|
||||||
## Hypervisor
|
|
||||||
|
|
||||||
The [hypervisor](../hypervisors.md) specified in the
|
|
||||||
[configuration file](#configuration) creates a VM to host the
|
|
||||||
[agent](#agent) and the [workload](#workload) inside the
|
|
||||||
[container environment](#environments).
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> The hypervisor process runs inside an environment slightly different
|
|
||||||
> to the host environment:
|
|
||||||
>
|
|
||||||
> - It is run in a different cgroup environment to the host.
|
|
||||||
> - It is given a separate network namespace from the host.
|
|
||||||
> - If the [OCI configuration specifies a SELinux label](https://github.com/opencontainers/runtime-spec/blob/main/config.md#linux-process),
|
|
||||||
> the hypervisor process will run with that label (*not* the workload running inside the hypervisor's VM).
|
|
||||||
|
|
||||||
## Agent
|
|
||||||
|
|
||||||
The Kata Containers agent ([`kata-agent`](../../src/agent)), written
|
|
||||||
in the [Rust programming language](https://www.rust-lang.org), is a
|
|
||||||
long running process that runs inside the VM. It acts as the
|
|
||||||
supervisor for managing the containers and the [workload](#workload)
|
|
||||||
running within those containers. Only a single agent process is run
|
|
||||||
for each VM created.
|
|
||||||
|
|
||||||
### Agent communications protocol
|
|
||||||
|
|
||||||
The agent communicates with the other Kata components (primarily the
|
|
||||||
[runtime](#runtime)) using a
|
|
||||||
[`ttRPC`](https://github.com/containerd/ttrpc-rust) based
|
|
||||||
[protocol](../../src/agent/protocols/protos).
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> If you wish to learn more about this protocol, a practical way to do
|
|
||||||
> so is to experiment with the
|
|
||||||
> [agent control tool](#agent-control-tool) on a test system.
|
|
||||||
> This tool is for test and development purposes only and can send
|
|
||||||
> arbitrary ttRPC agent API commands to the [agent](#agent).
|
|
||||||
|
|
||||||
## Runtime
|
|
||||||
|
|
||||||
The Kata Containers runtime (the [`containerd-shim-kata-v2`](../../src/runtime/cmd/containerd-shim-kata-v2
|
|
||||||
) binary) is a [shimv2](#shim-v2-architecture) compatible runtime.
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> The Kata Containers runtime is sometimes referred to as the Kata
|
|
||||||
> _shim_. Both terms are correct since the `containerd-shim-kata-v2`
|
|
||||||
> is a container runtime, and that runtime implements the containerd
|
|
||||||
> shim v2 API.
|
|
||||||
|
|
||||||
The runtime makes heavy use of the [`virtcontainers`
|
|
||||||
package](../../src/runtime/virtcontainers), which provides a generic,
|
|
||||||
runtime-specification agnostic, hardware-virtualized containers
|
|
||||||
library.
|
|
||||||
|
|
||||||
The runtime is responsible for starting the [hypervisor](#hypervisor)
|
|
||||||
and it's VM, and communicating with the [agent](#agent) using a
|
|
||||||
[ttRPC based protocol](#agent-communications-protocol) over a VSOCK
|
|
||||||
socket that provides a communications link between the VM and the
|
|
||||||
host.
|
|
||||||
|
|
||||||
This protocol allows the runtime to send container management commands
|
|
||||||
to the agent. The protocol is also used to carry the standard I/O
|
|
||||||
streams (`stdout`, `stderr`, `stdin`) between the containers and
|
|
||||||
container managers (such as CRI-O or containerd).
|
|
||||||
|
|
||||||
## Utility program
|
|
||||||
|
|
||||||
The `kata-runtime` binary is a utility program that provides
|
|
||||||
administrative commands to manipulate and query a Kata Containers
|
|
||||||
installation.
|
|
||||||
|
|
||||||
> **Note:**
|
|
||||||
>
|
|
||||||
> In Kata 1.x, this program also acted as the main
|
|
||||||
> [runtime](#runtime), but this is no longer required due to the
|
|
||||||
> improved shimv2 architecture.
|
|
||||||
|
|
||||||
### exec command
|
|
||||||
|
|
||||||
The `exec` command allows an administrator or developer to enter the
|
|
||||||
[VM root environment](#environments) which is not accessible by the container
|
|
||||||
[workload](#workload).
|
|
||||||
|
|
||||||
See [the developer guide](../Developer-Guide.md#connect-to-debug-console) for further details.
|
|
||||||
|
|
||||||
### Configuration
|
|
||||||
|
|
||||||
See the [configuration file details](../../src/runtime/README.md#configuration).
|
|
||||||
|
|
||||||
The configuration file is also used to enable runtime [debug output](../Developer-Guide.md#enable-full-debug).
|
|
||||||
|
|
||||||
## Process overview
|
|
||||||
|
|
||||||
The table below shows an example of the main processes running in the
|
|
||||||
different [environments](#environments) when a Kata Container is
|
|
||||||
created with containerd using our [example command](#example-command):
|
|
||||||
|
|
||||||
| Description | Host | VM root environment | VM container environment |
|
|
||||||
|-|-|-|-|
|
|
||||||
| Container manager | `containerd` | |
|
|
||||||
| Kata Containers | [runtime](#runtime), [`virtiofsd`](#virtio-fs), [hypervisor](#hypervisor) | [agent](#agent) |
|
|
||||||
| User [workload](#workload) | | | [`ubuntu sh`](#example-command) |
|
|
||||||
|
|
||||||
## Networking
|
|
||||||
|
|
||||||
Containers will typically live in their own, possibly shared, networking namespace.
|
|
||||||
At some point in a container lifecycle, container engines will set up that namespace
|
|
||||||
to add the container to a network which is isolated from the host network, but
|
|
||||||
which is shared between containers
|
|
||||||
|
|
||||||
In order to do so, container engines will usually add one end of a virtual
|
|
||||||
ethernet (`veth`) pair into the container networking namespace. The other end of
|
|
||||||
the `veth` pair is added to the host networking namespace.
|
|
||||||
|
|
||||||
This is a very namespace-centric approach as many hypervisors or VM
|
|
||||||
Managers (VMMs) such as `virt-manager` cannot handle `veth`
|
|
||||||
interfaces. Typically, `TAP` interfaces are created for VM
|
|
||||||
connectivity.
|
|
||||||
|
|
||||||
To overcome incompatibility between typical container engines expectations
|
|
||||||
and virtual machines, Kata Containers networking transparently connects `veth`
|
|
||||||
interfaces with `TAP` ones using Traffic Control:
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
With a TC filter in place, a redirection is created between the container network and the
|
|
||||||
virtual machine. As an example, the CNI may create a device, `eth0`, in the container's network
|
|
||||||
namespace, which is a VETH device. Kata Containers will create a tap device for the VM, `tap0_kata`,
|
|
||||||
and setup a TC redirection filter to mirror traffic from `eth0`'s ingress to `tap0_kata`'s egress,
|
|
||||||
and a second to mirror traffic from `tap0_kata`'s ingress to `eth0`'s egress.
|
|
||||||
|
|
||||||
Kata Containers maintains support for MACVTAP, which was an earlier implementation used in Kata. TC-filter
|
|
||||||
is the default because it allows for simpler configuration, better CNI plugin compatibility, and performance
|
|
||||||
on par with MACVTAP.
|
|
||||||
|
|
||||||
Kata Containers has deprecated support for bridge due to lacking performance relative to TC-filter and MACVTAP.
|
|
||||||
|
|
||||||
Kata Containers supports both
|
|
||||||
[CNM](https://github.com/docker/libnetwork/blob/master/docs/design.md#the-container-network-model)
|
|
||||||
and [CNI](https://github.com/containernetworking/cni) for networking management.
|
|
||||||
|
|
||||||
### Network Hotplug
|
|
||||||
|
|
||||||
Kata Containers has developed a set of network sub-commands and APIs to add, list and
|
|
||||||
remove a guest network endpoint and to manipulate the guest route table.
|
|
||||||
|
|
||||||
The following diagram illustrates the Kata Containers network hotplug workflow.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
## Storage
|
|
||||||
|
|
||||||
### virtio SCSI
|
|
||||||
|
|
||||||
If a block-based graph driver is [configured](#configuration),
|
|
||||||
`virtio-scsi` is used to _share_ the workload image (such as
|
|
||||||
`busybox:latest`) into the container's environment inside the VM.
|
|
||||||
|
|
||||||
### virtio FS
|
|
||||||
|
|
||||||
If a block-based graph driver is _not_ [configured](#configuration), a
|
|
||||||
[`virtio-fs`](https://virtio-fs.gitlab.io) (`VIRTIO`) overlay
|
|
||||||
filesystem mount point is used to _share_ the workload image instead. The
|
|
||||||
[agent](#agent) uses this mount point as the root filesystem for the
|
|
||||||
container processes.
|
|
||||||
|
|
||||||
For virtio-fs, the [runtime](#runtime) starts one `virtiofsd` daemon
|
|
||||||
(that runs in the host context) for each VM created.
|
|
||||||
|
|
||||||
### Devicemapper
|
|
||||||
|
|
||||||
The
|
|
||||||
[devicemapper `snapshotter`](https://github.com/containerd/containerd/tree/master/snapshots/devmapper)
|
|
||||||
is a special case. The `snapshotter` uses dedicated block devices
|
|
||||||
rather than formatted filesystems, and operates at the block level
|
|
||||||
rather than the file level. This knowledge is used to directly use the
|
|
||||||
underlying block device instead of the overlay file system for the
|
|
||||||
container root file system. The block device maps to the top
|
|
||||||
read-write layer for the overlay. This approach gives much better I/O
|
|
||||||
performance compared to using `virtio-fs` to share the container file
|
|
||||||
system.
|
|
||||||
|
|
||||||
#### Hot plug and unplug
|
|
||||||
|
|
||||||
Kata Containers has the ability to hot plug add and hot plug remove
|
|
||||||
block devices. This makes it possible to use block devices for
|
|
||||||
containers started after the VM has been launched.
|
|
||||||
|
|
||||||
Users can check to see if the container uses the `devicemapper` block
|
|
||||||
device as its rootfs by calling `mount(8)` within the container. If
|
|
||||||
the `devicemapper` block device is used, the root filesystem (`/`)
|
|
||||||
will be mounted from `/dev/vda`. Users can disable direct mounting of
|
|
||||||
the underlying block device through the runtime
|
|
||||||
[configuration](#configuration).
|
|
||||||
|
|
||||||
## Kubernetes support
|
|
||||||
|
|
||||||
[Kubernetes](https://github.com/kubernetes/kubernetes/), or K8s, is a popular open source
|
|
||||||
container orchestration engine. In Kubernetes, a set of containers sharing resources
|
|
||||||
such as networking, storage, mount, PID, etc. is called a
|
|
||||||
[pod](https://kubernetes.io/docs/user-guide/pods/).
|
|
||||||
|
|
||||||
A node can have multiple pods, but at a minimum, a node within a Kubernetes cluster
|
|
||||||
only needs to run a container runtime and a container agent (called a
|
|
||||||
[Kubelet](https://kubernetes.io/docs/admin/kubelet/)).
|
|
||||||
|
|
||||||
Kata Containers represents a Kubelet pod as a VM.
|
|
||||||
|
|
||||||
A Kubernetes cluster runs a control plane where a scheduler (typically
|
|
||||||
running on a dedicated master node) calls into a compute Kubelet. This
|
|
||||||
Kubelet instance is responsible for managing the lifecycle of pods
|
|
||||||
within the nodes and eventually relies on a container runtime to
|
|
||||||
handle execution. The Kubelet architecture decouples lifecycle
|
|
||||||
management from container execution through a dedicated gRPC based
|
|
||||||
[Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/container-runtime-interface-v1.md).
|
|
||||||
|
|
||||||
In other words, a Kubelet is a CRI client and expects a CRI
|
|
||||||
implementation to handle the server side of the interface.
|
|
||||||
[CRI-O](https://github.com/kubernetes-incubator/cri-o) and
|
|
||||||
[containerd](https://github.com/containerd/containerd/) are CRI
|
|
||||||
implementations that rely on
|
|
||||||
[OCI](https://github.com/opencontainers/runtime-spec) compatible
|
|
||||||
runtimes for managing container instances.
|
|
||||||
|
|
||||||
Kata Containers is an officially supported CRI-O and containerd
|
|
||||||
runtime. Refer to the following guides on how to set up Kata
|
|
||||||
Containers with Kubernetes:
|
|
||||||
|
|
||||||
- [How to use Kata Containers and containerd](../how-to/containerd-kata.md)
|
|
||||||
- [Run Kata Containers with Kubernetes](../how-to/run-kata-with-k8s.md)
|
|
||||||
|
|
||||||
#### OCI annotations
|
|
||||||
|
|
||||||
In order for the Kata Containers [runtime](#runtime) (or any VM based OCI compatible
|
|
||||||
runtime) to be able to understand if it needs to create a full VM or if it
|
|
||||||
has to create a new container inside an existing pod's VM, CRI-O adds
|
|
||||||
specific annotations to the OCI configuration file (`config.json`) which is passed to
|
|
||||||
the OCI compatible runtime.
|
|
||||||
|
|
||||||
Before calling its runtime, CRI-O will always add a `io.kubernetes.cri-o.ContainerType`
|
|
||||||
annotation to the `config.json` configuration file it produces from the Kubelet CRI
|
|
||||||
request. The `io.kubernetes.cri-o.ContainerType` annotation can either be set to `sandbox`
|
|
||||||
or `container`. Kata Containers will then use this annotation to decide if it needs to
|
|
||||||
respectively create a virtual machine or a container inside a virtual machine associated
|
|
||||||
with a Kubernetes pod:
|
|
||||||
|
|
||||||
| Annotation value | Kata VM created? | Kata container created? |
|
|
||||||
|-|-|-|
|
|
||||||
| `sandbox` | yes | yes (inside new VM) |
|
|
||||||
| `container`| no | yes (in existing VM) |
|
|
||||||
|
|
||||||
#### Mixing VM based and namespace based runtimes
|
|
||||||
|
|
||||||
> **Note:** Since Kubernetes 1.12, the [`Kubernetes RuntimeClass`](https://kubernetes.io/docs/concepts/containers/runtime-class/)
|
|
||||||
> has been supported and the user can specify runtime without the non-standardized annotations.
|
|
||||||
|
|
||||||
With `RuntimeClass`, users can define Kata Containers as a
|
|
||||||
`RuntimeClass` and then explicitly specify that a pod must be created
|
|
||||||
as a Kata Containers pod. For details, please refer to [How to use
|
|
||||||
Kata Containers and containerd](../../docs/how-to/containerd-kata.md).
|
|
||||||
|
|
||||||
## Tracing
|
|
||||||
|
|
||||||
The [tracing document](../tracing.md) provides details on the tracing
|
|
||||||
architecture.
|
|
||||||
|
|
||||||
# Appendices
|
|
||||||
|
|
||||||
## DAX
|
|
||||||
|
|
||||||
Kata Containers utilizes the Linux kernel DAX
|
|
||||||
[(Direct Access filesystem)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.rst?h=v5.14)
|
|
||||||
feature to efficiently map the [guest image](#guest-image) in the
|
|
||||||
[host environment](#environments) into the
|
|
||||||
[guest VM environment](#environments) to become the VM's
|
|
||||||
[rootfs](#root-filesystem).
|
|
||||||
|
|
||||||
If the [configured](#configuration) [hypervisor](#hypervisor) is set
|
|
||||||
to either QEMU or Cloud Hypervisor, DAX is used with the feature shown
|
|
||||||
in the table below:
|
|
||||||
|
|
||||||
| Hypervisor | Feature used | rootfs device type |
|
|
||||||
|-|-|-|
|
|
||||||
| Cloud Hypervisor (CH) | `dax` `FsConfig` configuration option | PMEM (emulated Persistent Memory device) |
|
|
||||||
| QEMU | NVDIMM memory device with a memory file backend | NVDIMM (emulated Non-Volatile Dual In-line Memory Module device) |
|
|
||||||
|
|
||||||
The features in the table above are equivalent in that they provide a memory-mapped
|
|
||||||
virtual device which is used to DAX map the VM's
|
|
||||||
[rootfs](#root-filesystem) into the [VM guest](#environments) memory
|
|
||||||
address space.
|
|
||||||
|
|
||||||
The VM is then booted, specifying the `root=` kernel parameter to make
|
|
||||||
the [guest kernel](#guest-kernel) use the appropriate emulated device
|
|
||||||
as its rootfs.
|
|
||||||
|
|
||||||
### DAX advantages
|
|
||||||
|
|
||||||
Mapping files using [DAX](#dax) provides a number of benefits over
|
|
||||||
more traditional VM file and device mapping mechanisms:
|
|
||||||
|
|
||||||
- Mapping as a direct access device allows the guest to directly
|
|
||||||
access the host memory pages (such as via Execute In Place (XIP)),
|
|
||||||
bypassing the [guest kernel](#guest-kernel)'s page cache. This
|
|
||||||
zero copy provides both time and space optimizations.
|
|
||||||
|
|
||||||
- Mapping as a direct access device inside the VM allows pages from the
|
|
||||||
host to be demand loaded using page faults, rather than having to make requests
|
|
||||||
via a virtualized device (causing expensive VM exits/hypercalls), thus providing
|
|
||||||
a speed optimization.
|
|
||||||
|
|
||||||
- Utilizing `mmap(2)`'s `MAP_SHARED` shared memory option on the host
|
|
||||||
allows the host to efficiently share pages.
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
For further details of the use of NVDIMM with QEMU, see the [QEMU
|
|
||||||
project documentation](https://www.qemu.org).
|
|
||||||
|
|
||||||
## Agent control tool
|
|
||||||
|
|
||||||
The [agent control tool](../../src/tools/agent-ctl) is a test and
|
|
||||||
development tool that can be used to learn more about a Kata
|
|
||||||
Containers system.
|
|
||||||
|
|
||||||
## Terminology
|
|
||||||
|
|
||||||
See the [project glossary](../../Glossary.md).
|
|
||||||
|
|
||||||
[debug-console]: ../Developer-Guide.md#connect-to-debug-console
|
|
479
docs/design/architecture/README.md
Normal file
479
docs/design/architecture/README.md
Normal file
@ -0,0 +1,479 @@
|
|||||||
|
# Kata Containers Architecture
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Kata Containers is an open source community working to build a secure
|
||||||
|
container [runtime](#runtime) with lightweight virtual machines (VM's)
|
||||||
|
that feel and perform like standard Linux containers, but provide
|
||||||
|
stronger [workload](#workload) isolation using hardware
|
||||||
|
[virtualization](#virtualization) technology as a second layer of
|
||||||
|
defence.
|
||||||
|
|
||||||
|
Kata Containers runs on [multiple architectures](../../../src/runtime/README.md#platform-support)
|
||||||
|
and supports [multiple hypervisors](../../hypervisors.md).
|
||||||
|
|
||||||
|
This document is a summary of the Kata Containers architecture.
|
||||||
|
|
||||||
|
## Background knowledge
|
||||||
|
|
||||||
|
This document assumes the reader understands a number of concepts
|
||||||
|
related to containers and file systems. The
|
||||||
|
[background](background.md) document explains these concepts.
|
||||||
|
|
||||||
|
## Example command
|
||||||
|
|
||||||
|
This document makes use of a particular [example
|
||||||
|
command](example-command.md) throughout the text to illustrate certain
|
||||||
|
concepts.
|
||||||
|
|
||||||
|
## Virtualization
|
||||||
|
|
||||||
|
For details on how Kata Containers maps container concepts to VM
|
||||||
|
technologies, and how this is realized in the multiple hypervisors and
|
||||||
|
VMMs that Kata supports see the
|
||||||
|
[virtualization documentation](../virtualization.md).
|
||||||
|
|
||||||
|
## Compatibility
|
||||||
|
|
||||||
|
The [Kata Containers runtime](../../../src/runtime) is compatible with
|
||||||
|
the [OCI](https://github.com/opencontainers)
|
||||||
|
[runtime specification](https://github.com/opencontainers/runtime-spec)
|
||||||
|
and therefore works seamlessly with the
|
||||||
|
[Kubernetes Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/container-runtime-interface.md)
|
||||||
|
through the [CRI-O](https://github.com/kubernetes-incubator/cri-o)
|
||||||
|
and [containerd](https://github.com/containerd/containerd)
|
||||||
|
implementations.
|
||||||
|
|
||||||
|
Kata Containers provides a ["shimv2"](#shim-v2-architecture) compatible runtime.
|
||||||
|
|
||||||
|
## Shim v2 architecture
|
||||||
|
|
||||||
|
The Kata Containers runtime is shim v2 ("shimv2") compatible. This
|
||||||
|
section explains what this means.
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> For a comparison with the Kata 1.x architecture, see
|
||||||
|
> [the architectural history document](history.md).
|
||||||
|
|
||||||
|
The
|
||||||
|
[containerd runtime shimv2 architecture](https://github.com/containerd/containerd/tree/main/runtime/v2)
|
||||||
|
or _shim API_ architecture resolves the issues with the old
|
||||||
|
architecture by defining a set of shimv2 APIs that a compatible
|
||||||
|
runtime implementation must supply. Rather than calling the runtime
|
||||||
|
binary multiple times for each new container, the shimv2 architecture
|
||||||
|
runs a single instance of the runtime binary (for any number of
|
||||||
|
containers). This improves performance and resolves the state handling
|
||||||
|
issue.
|
||||||
|
|
||||||
|
The shimv2 API is similar to the
|
||||||
|
[OCI runtime](https://github.com/opencontainers/runtime-spec)
|
||||||
|
API in terms of the way the container lifecycle is split into
|
||||||
|
different verbs. Rather than calling the runtime multiple times, the
|
||||||
|
container manager creates a socket and passes it to the shimv2
|
||||||
|
runtime. The socket is a bi-directional communication channel that
|
||||||
|
uses a gRPC based protocol to allow the container manager to send API
|
||||||
|
calls to the runtime, which returns the result to the container
|
||||||
|
manager using the same channel.
|
||||||
|
|
||||||
|
The shimv2 architecture allows running several containers per VM to
|
||||||
|
support container engines that require multiple containers running
|
||||||
|
inside a pod.
|
||||||
|
|
||||||
|
With the new architecture [Kubernetes](kubernetes.md) can
|
||||||
|
launch both Pod and OCI compatible containers with a single
|
||||||
|
[runtime](#runtime) shim per Pod, rather than `2N+1` shims. No stand
|
||||||
|
alone `kata-proxy` process is required, even if VSOCK is not
|
||||||
|
available.
|
||||||
|
|
||||||
|
## Workload
|
||||||
|
|
||||||
|
The workload is the command the user requested to run in the
|
||||||
|
container and is specified in the [OCI bundle](background.md#oci-bundle)'s
|
||||||
|
configuration file.
|
||||||
|
|
||||||
|
In our [example](example-command.md), the workload is the `sh(1)` command.
|
||||||
|
|
||||||
|
### Workload root filesystem
|
||||||
|
|
||||||
|
For details of how the [runtime](#runtime) makes the
|
||||||
|
[container image](background.md#container-image) chosen by the user available to
|
||||||
|
the workload process, see the
|
||||||
|
[Container creation](#container-creation) and [storage](#storage) sections.
|
||||||
|
|
||||||
|
Note that the workload is isolated from the [guest VM](#environments) environment by its
|
||||||
|
surrounding [container environment](#environments). The guest VM
|
||||||
|
environment where the container runs in is also isolated from the _outer_
|
||||||
|
[host environment](#environments) where the container manager runs.
|
||||||
|
|
||||||
|
## System overview
|
||||||
|
|
||||||
|
### Environments
|
||||||
|
|
||||||
|
The following terminology is used to describe the different or
|
||||||
|
environments (or contexts) various processes run in. It is necessary
|
||||||
|
to study this table closely to make sense of what follows:
|
||||||
|
|
||||||
|
| Type | Name | Virtualized | Containerized | rootfs | Rootfs device type | Mount type | Description |
|
||||||
|
|-|-|-|-|-|-|-|-|
|
||||||
|
| Host | Host | no `[1]` | no | Host specific | Host specific | Host specific | The environment provided by a standard, physical non virtualized system. |
|
||||||
|
| VM root | Guest VM | yes | no | rootfs inside the [guest image](guest-assets.md#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. |
|
||||||
|
| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](example-command.md)) | `kataShared` | [virtio FS](storage.md#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](background.md#oci-bundle). |
|
||||||
|
|
||||||
|
**Key:**
|
||||||
|
|
||||||
|
- `[1]`: For simplicity, this document assumes the host environment
|
||||||
|
runs on physical hardware.
|
||||||
|
|
||||||
|
- `[2]`: See the [DAX](#dax) section.
|
||||||
|
|
||||||
|
> **Notes:**
|
||||||
|
>
|
||||||
|
> - The word "root" is used to mean _top level_ here in a similar
|
||||||
|
> manner to the term [rootfs](background.md#root-filesystem).
|
||||||
|
>
|
||||||
|
> - The term "first level" prefix used above is important since it implies
|
||||||
|
> that it is possible to create multi level systems. However, they do
|
||||||
|
> not form part of a standard Kata Containers environment so will not
|
||||||
|
> be considered in this document.
|
||||||
|
|
||||||
|
The reasons for containerizing the [workload](#workload) inside the VM
|
||||||
|
are:
|
||||||
|
|
||||||
|
- Isolates the workload entirely from the VM environment.
|
||||||
|
- Provides better isolation between containers in a [pod](kubernetes.md).
|
||||||
|
- Allows the workload to be managed and monitored through its cgroup
|
||||||
|
confinement.
|
||||||
|
|
||||||
|
### Container creation
|
||||||
|
|
||||||
|
The steps below show at a high level how a Kata Containers container is
|
||||||
|
created using the containerd container manager:
|
||||||
|
|
||||||
|
1. The user requests the creation of a container by running a command
|
||||||
|
like the [example command](example-command.md).
|
||||||
|
1. The container manager daemon runs a single instance of the Kata
|
||||||
|
[runtime](#runtime).
|
||||||
|
1. The Kata runtime loads its [configuration file](#configuration).
|
||||||
|
1. The container manager calls a set of shimv2 API functions on the runtime.
|
||||||
|
1. The Kata runtime launches the configured [hypervisor](#hypervisor).
|
||||||
|
1. The hypervisor creates and starts (_boots_) a VM using the
|
||||||
|
[guest assets](guest-assets.md#guest-assets):
|
||||||
|
|
||||||
|
- The hypervisor [DAX](#dax) shares the
|
||||||
|
[guest image](guest-assets.md#guest-image)
|
||||||
|
into the VM to become the VM [rootfs](background.md#root-filesystem) (mounted on a `/dev/pmem*` device),
|
||||||
|
which is known as the [VM root environment](#environments).
|
||||||
|
- The hypervisor mounts the [OCI bundle](background.md#oci-bundle), using [virtio FS](storage.md#virtio-fs),
|
||||||
|
into a container specific directory inside the VM's rootfs.
|
||||||
|
|
||||||
|
This container specific directory will become the
|
||||||
|
[container rootfs](#environments), known as the
|
||||||
|
[container environment](#environments).
|
||||||
|
|
||||||
|
1. The [agent](#agent) is started as part of the VM boot.
|
||||||
|
|
||||||
|
1. The runtime calls the agent's `CreateSandbox` API to request the
|
||||||
|
agent create a container:
|
||||||
|
|
||||||
|
1. The agent creates a [container environment](#environments)
|
||||||
|
in the container specific directory that contains the [container rootfs](#environments).
|
||||||
|
|
||||||
|
The container environment hosts the [workload](#workload) in the
|
||||||
|
[container rootfs](#environments) directory.
|
||||||
|
|
||||||
|
1. The agent spawns the workload inside the container environment.
|
||||||
|
|
||||||
|
> **Notes:**
|
||||||
|
>
|
||||||
|
> - The container environment created by the agent is equivalent to
|
||||||
|
> a container environment created by the
|
||||||
|
> [`runc`](https://github.com/opencontainers/runc) OCI runtime;
|
||||||
|
> Linux cgroups and namespaces are created inside the VM by the
|
||||||
|
> [guest kernel](guest-assets.md#guest-kernel) to isolate the
|
||||||
|
> workload from the VM environment the container is created in.
|
||||||
|
> See the [Environments](#environments) section for an
|
||||||
|
> explanation of why this is done.
|
||||||
|
>
|
||||||
|
> - See the [guest image](guest-assets.md#guest-image) section for
|
||||||
|
> details of exactly how the agent is started.
|
||||||
|
|
||||||
|
1. The container manager returns control of the container to the
|
||||||
|
user running the `ctr` command.
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> At this point, the container is running and:
|
||||||
|
>
|
||||||
|
> - The [workload](#workload) process ([`sh(1)` in the example](example-command.md))
|
||||||
|
> is running in the [container environment](#environments).
|
||||||
|
> - The user is now able to interact with the workload
|
||||||
|
> (using the [`ctr` command in the example](example-command.md)).
|
||||||
|
> - The [agent](#agent), running inside the VM is monitoring the
|
||||||
|
> [workload](#workload) process.
|
||||||
|
> - The [runtime](#runtime) is waiting for the agent's `WaitProcess` API
|
||||||
|
> call to complete.
|
||||||
|
|
||||||
|
Further details of these steps are provided in the sections below.
|
||||||
|
|
||||||
|
### Container shutdown
|
||||||
|
|
||||||
|
There are two possible ways for the container environment to be
|
||||||
|
terminated:
|
||||||
|
|
||||||
|
- When the [workload](#workload) exits.
|
||||||
|
|
||||||
|
This is the standard, or _graceful_ shutdown method.
|
||||||
|
|
||||||
|
- When the container manager forces the container to be deleted.
|
||||||
|
|
||||||
|
#### Workload exit
|
||||||
|
|
||||||
|
The [agent](#agent) will detect when the [workload](#workload) process
|
||||||
|
exits, capture its exit status (see `wait(2)`) and return that value
|
||||||
|
to the [runtime](#runtime) by specifying it as the response to the
|
||||||
|
`WaitProcess` agent API call made by the [runtime](#runtime).
|
||||||
|
|
||||||
|
The runtime then passes the value back to the container manager by the
|
||||||
|
`Wait` [shimv2 API](#shim-v2-architecture) call.
|
||||||
|
|
||||||
|
Once the workload has fully exited, the VM is no longer needed and the
|
||||||
|
runtime cleans up the environment (which includes terminating the
|
||||||
|
[hypervisor](#hypervisor) process).
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> When [agent tracing is enabled](../../tracing.md#agent-shutdown-behaviour),
|
||||||
|
> the shutdown behaviour is different.
|
||||||
|
|
||||||
|
#### Container manager requested shutdown
|
||||||
|
|
||||||
|
If the container manager requests the container be deleted, the
|
||||||
|
[runtime](#runtime) will signal the agent by sending it a
|
||||||
|
`DestroySandbox` [ttRPC API](../../../src/agent/protocols/protos/agent.proto) request.
|
||||||
|
|
||||||
|
## Guest assets
|
||||||
|
|
||||||
|
The guest assets comprise a guest image and a guest kernel that are
|
||||||
|
used by the [hypervisor](#hypervisor).
|
||||||
|
|
||||||
|
See the [guest assets](guest-assets.md) document for further
|
||||||
|
information.
|
||||||
|
|
||||||
|
## Hypervisor
|
||||||
|
|
||||||
|
The [hypervisor](../../hypervisors.md) specified in the
|
||||||
|
[configuration file](#configuration) creates a VM to host the
|
||||||
|
[agent](#agent) and the [workload](#workload) inside the
|
||||||
|
[container environment](#environments).
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> The hypervisor process runs inside an environment slightly different
|
||||||
|
> to the host environment:
|
||||||
|
>
|
||||||
|
> - It is run in a different cgroup environment to the host.
|
||||||
|
> - It is given a separate network namespace from the host.
|
||||||
|
> - If the [OCI configuration specifies a SELinux label](https://github.com/opencontainers/runtime-spec/blob/main/config.md#linux-process),
|
||||||
|
> the hypervisor process will run with that label (*not* the workload running inside the hypervisor's VM).
|
||||||
|
|
||||||
|
## Agent
|
||||||
|
|
||||||
|
The Kata Containers agent ([`kata-agent`](../../../src/agent)), written
|
||||||
|
in the [Rust programming language](https://www.rust-lang.org), is a
|
||||||
|
long running process that runs inside the VM. It acts as the
|
||||||
|
supervisor for managing the containers and the [workload](#workload)
|
||||||
|
running within those containers. Only a single agent process is run
|
||||||
|
for each VM created.
|
||||||
|
|
||||||
|
### Agent communications protocol
|
||||||
|
|
||||||
|
The agent communicates with the other Kata components (primarily the
|
||||||
|
[runtime](#runtime)) using a
|
||||||
|
[`ttRPC`](https://github.com/containerd/ttrpc-rust) based
|
||||||
|
[protocol](../../../src/agent/protocols/protos).
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> If you wish to learn more about this protocol, a practical way to do
|
||||||
|
> so is to experiment with the
|
||||||
|
> [agent control tool](#agent-control-tool) on a test system.
|
||||||
|
> This tool is for test and development purposes only and can send
|
||||||
|
> arbitrary ttRPC agent API commands to the [agent](#agent).
|
||||||
|
|
||||||
|
## Runtime
|
||||||
|
|
||||||
|
The Kata Containers runtime (the [`containerd-shim-kata-v2`](../../../src/runtime/cmd/containerd-shim-kata-v2
|
||||||
|
) binary) is a [shimv2](#shim-v2-architecture) compatible runtime.
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> The Kata Containers runtime is sometimes referred to as the Kata
|
||||||
|
> _shim_. Both terms are correct since the `containerd-shim-kata-v2`
|
||||||
|
> is a container runtime, and that runtime implements the containerd
|
||||||
|
> shim v2 API.
|
||||||
|
|
||||||
|
The runtime makes heavy use of the [`virtcontainers`
|
||||||
|
package](../../../src/runtime/virtcontainers), which provides a generic,
|
||||||
|
runtime-specification agnostic, hardware-virtualized containers
|
||||||
|
library.
|
||||||
|
|
||||||
|
The runtime is responsible for starting the [hypervisor](#hypervisor)
|
||||||
|
and it's VM, and communicating with the [agent](#agent) using a
|
||||||
|
[ttRPC based protocol](#agent-communications-protocol) over a VSOCK
|
||||||
|
socket that provides a communications link between the VM and the
|
||||||
|
host.
|
||||||
|
|
||||||
|
This protocol allows the runtime to send container management commands
|
||||||
|
to the agent. The protocol is also used to carry the standard I/O
|
||||||
|
streams (`stdout`, `stderr`, `stdin`) between the containers and
|
||||||
|
container managers (such as CRI-O or containerd).
|
||||||
|
|
||||||
|
## Utility program
|
||||||
|
|
||||||
|
The `kata-runtime` binary is a utility program that provides
|
||||||
|
administrative commands to manipulate and query a Kata Containers
|
||||||
|
installation.
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> In Kata 1.x, this program also acted as the main
|
||||||
|
> [runtime](#runtime), but this is no longer required due to the
|
||||||
|
> improved shimv2 architecture.
|
||||||
|
|
||||||
|
### exec command
|
||||||
|
|
||||||
|
The `exec` command allows an administrator or developer to enter the
|
||||||
|
[VM root environment](#environments) which is not accessible by the container
|
||||||
|
[workload](#workload).
|
||||||
|
|
||||||
|
See [the developer guide](../../Developer-Guide.md#connect-to-debug-console) for further details.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
See the [configuration file details](../../../src/runtime/README.md#configuration).
|
||||||
|
|
||||||
|
The configuration file is also used to enable runtime [debug output](../../Developer-Guide.md#enable-full-debug).
|
||||||
|
|
||||||
|
## Process overview
|
||||||
|
|
||||||
|
The table below shows an example of the main processes running in the
|
||||||
|
different [environments](#environments) when a Kata Container is
|
||||||
|
created with containerd using our [example command](example-command.md):
|
||||||
|
|
||||||
|
| Description | Host | VM root environment | VM container environment |
|
||||||
|
|-|-|-|-|
|
||||||
|
| Container manager | `containerd` | |
|
||||||
|
| Kata Containers | [runtime](#runtime), [`virtiofsd`](storage.md#virtio-fs), [hypervisor](#hypervisor) | [agent](#agent) |
|
||||||
|
| User [workload](#workload) | | | [`ubuntu sh`](example-command.md) |
|
||||||
|
|
||||||
|
## Networking
|
||||||
|
|
||||||
|
See the [networking document](networking.md).
|
||||||
|
|
||||||
|
## Storage
|
||||||
|
|
||||||
|
See the [storage document](storage.md).
|
||||||
|
|
||||||
|
## Kubernetes support
|
||||||
|
|
||||||
|
See the [Kubernetes document](kubernetes.md).
|
||||||
|
|
||||||
|
#### OCI annotations
|
||||||
|
|
||||||
|
In order for the Kata Containers [runtime](#runtime) (or any VM based OCI compatible
|
||||||
|
runtime) to be able to understand if it needs to create a full VM or if it
|
||||||
|
has to create a new container inside an existing pod's VM, CRI-O adds
|
||||||
|
specific annotations to the OCI configuration file (`config.json`) which is passed to
|
||||||
|
the OCI compatible runtime.
|
||||||
|
|
||||||
|
Before calling its runtime, CRI-O will always add a `io.kubernetes.cri-o.ContainerType`
|
||||||
|
annotation to the `config.json` configuration file it produces from the Kubelet CRI
|
||||||
|
request. The `io.kubernetes.cri-o.ContainerType` annotation can either be set to `sandbox`
|
||||||
|
or `container`. Kata Containers will then use this annotation to decide if it needs to
|
||||||
|
respectively create a virtual machine or a container inside a virtual machine associated
|
||||||
|
with a Kubernetes pod:
|
||||||
|
|
||||||
|
| Annotation value | Kata VM created? | Kata container created? |
|
||||||
|
|-|-|-|
|
||||||
|
| `sandbox` | yes | yes (inside new VM) |
|
||||||
|
| `container`| no | yes (in existing VM) |
|
||||||
|
|
||||||
|
#### Mixing VM based and namespace based runtimes
|
||||||
|
|
||||||
|
> **Note:** Since Kubernetes 1.12, the [`Kubernetes RuntimeClass`](https://kubernetes.io/docs/concepts/containers/runtime-class/)
|
||||||
|
> has been supported and the user can specify runtime without the non-standardized annotations.
|
||||||
|
|
||||||
|
With `RuntimeClass`, users can define Kata Containers as a
|
||||||
|
`RuntimeClass` and then explicitly specify that a pod must be created
|
||||||
|
as a Kata Containers pod. For details, please refer to [How to use
|
||||||
|
Kata Containers and containerd](../../../docs/how-to/containerd-kata.md).
|
||||||
|
|
||||||
|
## Tracing
|
||||||
|
|
||||||
|
The [tracing document](../../tracing.md) provides details on the tracing
|
||||||
|
architecture.
|
||||||
|
|
||||||
|
# Appendices
|
||||||
|
|
||||||
|
## DAX
|
||||||
|
|
||||||
|
Kata Containers utilizes the Linux kernel DAX
|
||||||
|
[(Direct Access filesystem)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.rst?h=v5.14)
|
||||||
|
feature to efficiently map the [guest image](guest-assets.md#guest-image) in the
|
||||||
|
[host environment](#environments) into the
|
||||||
|
[guest VM environment](#environments) to become the VM's
|
||||||
|
[rootfs](background.md#root-filesystem).
|
||||||
|
|
||||||
|
If the [configured](#configuration) [hypervisor](#hypervisor) is set
|
||||||
|
to either QEMU or Cloud Hypervisor, DAX is used with the feature shown
|
||||||
|
in the table below:
|
||||||
|
|
||||||
|
| Hypervisor | Feature used | rootfs device type |
|
||||||
|
|-|-|-|
|
||||||
|
| Cloud Hypervisor (CH) | `dax` `FsConfig` configuration option | PMEM (emulated Persistent Memory device) |
|
||||||
|
| QEMU | NVDIMM memory device with a memory file backend | NVDIMM (emulated Non-Volatile Dual In-line Memory Module device) |
|
||||||
|
|
||||||
|
The features in the table above are equivalent in that they provide a memory-mapped
|
||||||
|
virtual device which is used to DAX map the VM's
|
||||||
|
[rootfs](background.md#root-filesystem) into the [VM guest](#environments) memory
|
||||||
|
address space.
|
||||||
|
|
||||||
|
The VM is then booted, specifying the `root=` kernel parameter to make
|
||||||
|
the [guest kernel](guest-assets.md#guest-kernel) use the appropriate emulated device
|
||||||
|
as its rootfs.
|
||||||
|
|
||||||
|
### DAX advantages
|
||||||
|
|
||||||
|
Mapping files using [DAX](#dax) provides a number of benefits over
|
||||||
|
more traditional VM file and device mapping mechanisms:
|
||||||
|
|
||||||
|
- Mapping as a direct access device allows the guest to directly
|
||||||
|
access the host memory pages (such as via Execute In Place (XIP)),
|
||||||
|
bypassing the [guest kernel](guest-assets.md#guest-kernel)'s page cache. This
|
||||||
|
zero copy provides both time and space optimizations.
|
||||||
|
|
||||||
|
- Mapping as a direct access device inside the VM allows pages from the
|
||||||
|
host to be demand loaded using page faults, rather than having to make requests
|
||||||
|
via a virtualized device (causing expensive VM exits/hypercalls), thus providing
|
||||||
|
a speed optimization.
|
||||||
|
|
||||||
|
- Utilizing `mmap(2)`'s `MAP_SHARED` shared memory option on the host
|
||||||
|
allows the host to efficiently share pages.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
For further details of the use of NVDIMM with QEMU, see the [QEMU
|
||||||
|
project documentation](https://www.qemu.org).
|
||||||
|
|
||||||
|
## Agent control tool
|
||||||
|
|
||||||
|
The [agent control tool](../../../src/tools/agent-ctl) is a test and
|
||||||
|
development tool that can be used to learn more about a Kata
|
||||||
|
Containers system.
|
||||||
|
|
||||||
|
## Terminology
|
||||||
|
|
||||||
|
See the [project glossary](../../../Glossary.md).
|
||||||
|
|
||||||
|
[debug-console]: ../../Developer-Guide.md#connect-to-debug-console
|
81
docs/design/architecture/background.md
Normal file
81
docs/design/architecture/background.md
Normal file
@ -0,0 +1,81 @@
|
|||||||
|
# Kata Containers architecture background knowledge
|
||||||
|
|
||||||
|
The following sections explain some of the background concepts
|
||||||
|
required to understand the [architecture document](README.md).
|
||||||
|
|
||||||
|
## Root filesystem
|
||||||
|
|
||||||
|
This document uses the term _rootfs_ to refer to a root filesystem
|
||||||
|
which is mounted as the top-level directory ("`/`") and often referred
|
||||||
|
to as _slash_.
|
||||||
|
|
||||||
|
It is important to understand this term since the overall system uses
|
||||||
|
multiple different rootfs's (as explained in the
|
||||||
|
[Environments](README.md#environments) section.
|
||||||
|
|
||||||
|
## Container image
|
||||||
|
|
||||||
|
In the [example command](example-command.md) the user has specified the
|
||||||
|
type of container they wish to run via the container image name:
|
||||||
|
`ubuntu`. This image name corresponds to a _container image_ that can
|
||||||
|
be used to create a container with an Ubuntu Linux environment. Hence,
|
||||||
|
in our [example](example-command.md), the `sh(1)` command will be run
|
||||||
|
inside a container which has an Ubuntu rootfs.
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> The term _container image_ is confusing since the image in question
|
||||||
|
> is **not** a container: it is simply a set of files (_an image_)
|
||||||
|
> that can be used to _create_ a container. The term _container
|
||||||
|
> template_ would be more accurate but the term _container image_ is
|
||||||
|
> commonly used so this document uses the standard term.
|
||||||
|
|
||||||
|
For the purposes of this document, the most important part of the
|
||||||
|
[example command line](example-command.md) is the container image the
|
||||||
|
user has requested. Normally, the container manager will _pull_
|
||||||
|
(download) a container image from a remote site and store a copy
|
||||||
|
locally. This local container image is used by the container manager
|
||||||
|
to create an [OCI bundle](#oci-bundle) which will form the environment
|
||||||
|
the container will run in. After creating the OCI bundle, the
|
||||||
|
container manager launches a [runtime](README.md#runtime) which will create the
|
||||||
|
container using the provided OCI bundle.
|
||||||
|
|
||||||
|
## OCI bundle
|
||||||
|
|
||||||
|
To understand what follows, it is important to know at a high level
|
||||||
|
how an OCI ([Open Containers Initiative](https://opencontainers.org)) compatible container is created.
|
||||||
|
|
||||||
|
An OCI compatible container is created by taking a
|
||||||
|
[container image](#container-image) and converting the embedded rootfs
|
||||||
|
into an
|
||||||
|
[OCI rootfs bundle](https://github.com/opencontainers/runtime-spec/blob/main/bundle.md),
|
||||||
|
or more simply, an _OCI bundle_.
|
||||||
|
|
||||||
|
An OCI bundle is a `tar(1)` archive normally created by a container
|
||||||
|
manager which is passed to an OCI [runtime](README.md#runtime) which converts
|
||||||
|
it into a full container rootfs. The bundle contains two assets:
|
||||||
|
|
||||||
|
- A container image [rootfs](#root-filesystem)
|
||||||
|
|
||||||
|
This is simply a directory of files that will be used to represent
|
||||||
|
the rootfs for the container.
|
||||||
|
|
||||||
|
For the [example command](example-command.md), the directory will
|
||||||
|
contain the files necessary to create a minimal Ubuntu root
|
||||||
|
filesystem.
|
||||||
|
|
||||||
|
- An [OCI configuration file](https://github.com/opencontainers/runtime-spec/blob/main/config.md)
|
||||||
|
|
||||||
|
This is a JSON file called `config.json`.
|
||||||
|
|
||||||
|
The container manager will create this file so that:
|
||||||
|
|
||||||
|
- The `root.path` value is set to the full path of the specified
|
||||||
|
container rootfs.
|
||||||
|
|
||||||
|
In [the example](example-command.md) this value will be `ubuntu`.
|
||||||
|
|
||||||
|
- The `process.args` array specifies the list of commands the user
|
||||||
|
wishes to run. This is known as the [workload](README.md#workload).
|
||||||
|
|
||||||
|
In [the example](example-command.md) the workload is `sh(1)`.
|
30
docs/design/architecture/example-command.md
Normal file
30
docs/design/architecture/example-command.md
Normal file
@ -0,0 +1,30 @@
|
|||||||
|
# Example command
|
||||||
|
|
||||||
|
The following containerd command creates a container. It is referred
|
||||||
|
to throughout the architecture document to help explain various points:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ sudo ctr run --runtime "io.containerd.kata.v2" --rm -t "quay.io/libpod/ubuntu:latest" foo sh
|
||||||
|
```
|
||||||
|
|
||||||
|
This command requests that containerd:
|
||||||
|
|
||||||
|
- Create a container (`ctr run`).
|
||||||
|
- Use the Kata [shimv2](README.md#shim-v2-architecture) runtime (`--runtime "io.containerd.kata.v2"`).
|
||||||
|
- Delete the container when it [exits](README.md#workload-exit) (`--rm`).
|
||||||
|
- Attach the container to the user's terminal (`-t`).
|
||||||
|
- Use the Ubuntu Linux [container image](background.md#container-image)
|
||||||
|
to create the container [rootfs](background.md#root-filesystem) that will become
|
||||||
|
the [container environment](README.md#environments)
|
||||||
|
(`quay.io/libpod/ubuntu:latest`).
|
||||||
|
- Create the container with the name "`foo`".
|
||||||
|
- Run the `sh(1)` command in the Ubuntu rootfs based container
|
||||||
|
environment.
|
||||||
|
|
||||||
|
The command specified here is referred to as the [workload](README.md#workload).
|
||||||
|
|
||||||
|
> **Note:**
|
||||||
|
>
|
||||||
|
> For the purposes of this document and to keep explanations
|
||||||
|
> simpler, we assume the user is running this command in the
|
||||||
|
> [host environment](README.md#environments).
|
150
docs/design/architecture/guest-assets.md
Normal file
150
docs/design/architecture/guest-assets.md
Normal file
@ -0,0 +1,150 @@
|
|||||||
|
# Guest assets
|
||||||
|
|
||||||
|
Kata Containers creates a VM in which to run one or more containers.
|
||||||
|
It does this by launching a [hypervisor](README.md#hypervisor) to
|
||||||
|
create the VM. The hypervisor needs two assets for this task: a Linux
|
||||||
|
kernel and a small root filesystem image to boot the VM.
|
||||||
|
|
||||||
|
## Guest kernel
|
||||||
|
|
||||||
|
The [guest kernel](../../../tools/packaging/kernel)
|
||||||
|
is passed to the hypervisor and used to boot the VM.
|
||||||
|
The default kernel provided in Kata Containers is highly optimized for
|
||||||
|
kernel boot time and minimal memory footprint, providing only those
|
||||||
|
services required by a container workload. It is based on the latest
|
||||||
|
Linux LTS (Long Term Support) [kernel](https://www.kernel.org).
|
||||||
|
|
||||||
|
## Guest image
|
||||||
|
|
||||||
|
The hypervisor uses an image file which provides a minimal root
|
||||||
|
filesystem used by the guest kernel to boot the VM and host the Kata
|
||||||
|
Container. Kata Containers supports both initrd and rootfs based
|
||||||
|
minimal guest images. The [default packages](../../install/) provide both
|
||||||
|
an image and an initrd, both of which are created using the
|
||||||
|
[`osbuilder`](../../../tools/osbuilder) tool.
|
||||||
|
|
||||||
|
> **Notes:**
|
||||||
|
>
|
||||||
|
> - Although initrd and rootfs based images are supported, not all
|
||||||
|
> [hypervisors](README.md#hypervisor) support both types of image.
|
||||||
|
>
|
||||||
|
> - The guest image is *unrelated* to the image used in a container
|
||||||
|
> workload.
|
||||||
|
>
|
||||||
|
> For example, if a user creates a container that runs a shell in a
|
||||||
|
> BusyBox image, they will run that shell in a BusyBox environment.
|
||||||
|
> However, the guest image running inside the VM that is used to
|
||||||
|
> *host* that BusyBox image could be running Clear Linux, Ubuntu,
|
||||||
|
> Fedora or any other distribution potentially.
|
||||||
|
>
|
||||||
|
> The `osbuilder` tool provides
|
||||||
|
> [configurations for various common Linux distributions](../../../tools/osbuilder/rootfs-builder)
|
||||||
|
> which can be built into either initrd or rootfs guest images.
|
||||||
|
>
|
||||||
|
> - If you are using a [packaged version of Kata
|
||||||
|
> Containers](../../install), you can see image details by running the
|
||||||
|
> [`kata-collect-data.sh`](../../../src/runtime/data/kata-collect-data.sh.in)
|
||||||
|
> script as `root` and looking at the "Image details" section of the
|
||||||
|
> output.
|
||||||
|
|
||||||
|
#### Root filesystem image
|
||||||
|
|
||||||
|
The default packaged rootfs image, sometimes referred to as the _mini
|
||||||
|
O/S_, is a highly optimized container bootstrap system.
|
||||||
|
|
||||||
|
If this image type is [configured](README.md#configuration), when the
|
||||||
|
user runs the [example command](example-command.md):
|
||||||
|
|
||||||
|
- The [runtime](README.md#runtime) will launch the configured [hypervisor](README.md#hypervisor).
|
||||||
|
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
||||||
|
- The kernel will start the init daemon as PID 1 (`systemd`) inside the VM root environment.
|
||||||
|
- `systemd`, running inside the mini-OS context, will launch the [agent](README.md#agent)
|
||||||
|
in the root context of the VM.
|
||||||
|
- The agent will create a new container environment, setting its root
|
||||||
|
filesystem to that requested by the user (Ubuntu in [the example](example-command.md)).
|
||||||
|
- The agent will then execute the command (`sh(1)` in [the example](example-command.md))
|
||||||
|
inside the new container.
|
||||||
|
|
||||||
|
The table below summarises the default mini O/S showing the
|
||||||
|
environments that are created, the services running in those
|
||||||
|
environments (for all platforms) and the root filesystem used by
|
||||||
|
each service:
|
||||||
|
|
||||||
|
| Process | Environment | systemd service? | rootfs | User accessible | Notes |
|
||||||
|
|-|-|-|-|-|-|
|
||||||
|
| systemd | VM root | n/a | [VM guest image](#guest-image)| [debug console][debug-console] | The init daemon, running as PID 1 |
|
||||||
|
| [Agent](README.md#agent) | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Runs as a systemd service |
|
||||||
|
| `chronyd` | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Used to synchronise the time with the host |
|
||||||
|
| container workload (`sh(1)` in [the example](example-command.md)) | VM container | no | User specified (Ubuntu in [the example](example-command.md)) | [exec command](README.md#exec-command) | Managed by the agent |
|
||||||
|
|
||||||
|
See also the [process overview](README.md#process-overview).
|
||||||
|
|
||||||
|
> **Notes:**
|
||||||
|
>
|
||||||
|
> - The "User accessible" column shows how an administrator can access
|
||||||
|
> the environment.
|
||||||
|
>
|
||||||
|
> - The container workload is running inside a full container
|
||||||
|
> environment which itself is running within a VM environment.
|
||||||
|
>
|
||||||
|
> - See the [configuration files for the `osbuilder` tool](../../../tools/osbuilder/rootfs-builder)
|
||||||
|
> for details of the default distribution for platforms other than
|
||||||
|
> Intel x86_64.
|
||||||
|
|
||||||
|
#### Initrd image
|
||||||
|
|
||||||
|
The initrd image is a compressed `cpio(1)` archive, created from a
|
||||||
|
rootfs which is loaded into memory and used as part of the Linux
|
||||||
|
startup process. During startup, the kernel unpacks it into a special
|
||||||
|
instance of a `tmpfs` mount that becomes the initial root filesystem.
|
||||||
|
|
||||||
|
If this image type is [configured](README.md#configuration), when the user runs
|
||||||
|
the [example command](example-command.md):
|
||||||
|
|
||||||
|
- The [runtime](README.md#runtime) will launch the configured [hypervisor](README.md#hypervisor).
|
||||||
|
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
||||||
|
- The kernel will start the init daemon as PID 1 (the
|
||||||
|
[agent](README.md#agent))
|
||||||
|
inside the VM root environment.
|
||||||
|
- The [agent](README.md#agent) will create a new container environment, setting its root
|
||||||
|
filesystem to that requested by the user (`ubuntu` in
|
||||||
|
[the example](example-command.md)).
|
||||||
|
- The agent will then execute the command (`sh(1)` in [the example](example-command.md))
|
||||||
|
inside the new container.
|
||||||
|
|
||||||
|
The table below summarises the default mini O/S showing the environments that are created,
|
||||||
|
the processes running in those environments (for all platforms) and
|
||||||
|
the root filesystem used by each service:
|
||||||
|
|
||||||
|
| Process | Environment | rootfs | User accessible | Notes |
|
||||||
|
|-|-|-|-|-|
|
||||||
|
| [Agent](README.md#agent) | VM root | [VM guest image](#guest-image) | [debug console][debug-console] | Runs as the init daemon (PID 1) |
|
||||||
|
| container workload | VM container | User specified (Ubuntu in this example) | [exec command](README.md#exec-command) | Managed by the agent |
|
||||||
|
|
||||||
|
> **Notes:**
|
||||||
|
>
|
||||||
|
> - The "User accessible" column shows how an administrator can access
|
||||||
|
> the environment.
|
||||||
|
>
|
||||||
|
> - It is possible to use a standard init daemon such as systemd with
|
||||||
|
> an initrd image if this is desirable.
|
||||||
|
|
||||||
|
See also the [process overview](README.md#process-overview).
|
||||||
|
|
||||||
|
#### Image summary
|
||||||
|
|
||||||
|
| Image type | Default distro | Init daemon | Reason | Notes |
|
||||||
|
|-|-|-|-|-|
|
||||||
|
| [image](background.md#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility |
|
||||||
|
| [initrd](#initrd-image) | [Alpine Linux](https://alpinelinux.org) | Kata [agent](README.md#agent) (as no systemd support) | Security hardened and tiny C library |
|
||||||
|
|
||||||
|
See also:
|
||||||
|
|
||||||
|
- The [osbuilder](../../../tools/osbuilder) tool
|
||||||
|
|
||||||
|
This is used to build all default image types.
|
||||||
|
|
||||||
|
- The [versions database](../../../versions.yaml)
|
||||||
|
|
||||||
|
The `default-image-name` and `default-initrd-name` options specify
|
||||||
|
the default distributions for each image type.
|
41
docs/design/architecture/history.md
Normal file
41
docs/design/architecture/history.md
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
# History
|
||||||
|
|
||||||
|
## Kata 1.x architecture
|
||||||
|
|
||||||
|
In the old [Kata 1.x architecture](https://github.com/kata-containers/documentation/blob/master/design/architecture.md),
|
||||||
|
the Kata [runtime](README.md#runtime) was an executable called `kata-runtime`.
|
||||||
|
The container manager called this executable multiple times when
|
||||||
|
creating each container. Each time the runtime was called a different
|
||||||
|
OCI command-line verb was provided. This architecture was simple, but
|
||||||
|
not well suited to creating VM based containers due to the issue of
|
||||||
|
handling state between calls. Additionally, the architecture suffered
|
||||||
|
from performance issues related to continually having to spawn new
|
||||||
|
instances of the runtime binary, and
|
||||||
|
[Kata shim](https://github.com/kata-containers/shim) and
|
||||||
|
[Kata proxy](https://github.com/kata-containers/proxy) processes for systems
|
||||||
|
that did not provide VSOCK.
|
||||||
|
|
||||||
|
## Kata 2.x architecture
|
||||||
|
|
||||||
|
See the ["shimv2"](README.md#shim-v2-architecture) section of the
|
||||||
|
architecture document.
|
||||||
|
|
||||||
|
## Architectural comparison
|
||||||
|
|
||||||
|
| Kata version | Kata Runtime process calls | Kata shim processes | Kata proxy processes (if no VSOCK) |
|
||||||
|
|-|-|-|-|
|
||||||
|
| 1.x | multiple per container | 1 per container connection | 1 |
|
||||||
|
| 2.x | 1 per VM (hosting any number of containers) | 0 | 0 |
|
||||||
|
|
||||||
|
> **Notes:**
|
||||||
|
>
|
||||||
|
> - A single VM can host one or more containers.
|
||||||
|
>
|
||||||
|
> - The "Kata shim processes" column refers to the old
|
||||||
|
> [Kata shim](https://github.com/kata-containers/shim) (`kata-shim` binary),
|
||||||
|
> *not* the new shimv2 runtime instance (`containerd-shim-kata-v2` binary).
|
||||||
|
|
||||||
|
The diagram below shows how the original architecture was simplified
|
||||||
|
with the advent of shimv2.
|
||||||
|
|
||||||
|

|
35
docs/design/architecture/kubernetes.md
Normal file
35
docs/design/architecture/kubernetes.md
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
# Kubernetes support
|
||||||
|
|
||||||
|
[Kubernetes](https://github.com/kubernetes/kubernetes/), or K8s, is a popular open source
|
||||||
|
container orchestration engine. In Kubernetes, a set of containers sharing resources
|
||||||
|
such as networking, storage, mount, PID, etc. is called a
|
||||||
|
[pod](https://kubernetes.io/docs/user-guide/pods/).
|
||||||
|
|
||||||
|
A node can have multiple pods, but at a minimum, a node within a Kubernetes cluster
|
||||||
|
only needs to run a container runtime and a container agent (called a
|
||||||
|
[Kubelet](https://kubernetes.io/docs/admin/kubelet/)).
|
||||||
|
|
||||||
|
Kata Containers represents a Kubelet pod as a VM.
|
||||||
|
|
||||||
|
A Kubernetes cluster runs a control plane where a scheduler (typically
|
||||||
|
running on a dedicated master node) calls into a compute Kubelet. This
|
||||||
|
Kubelet instance is responsible for managing the lifecycle of pods
|
||||||
|
within the nodes and eventually relies on a container runtime to
|
||||||
|
handle execution. The Kubelet architecture decouples lifecycle
|
||||||
|
management from container execution through a dedicated gRPC based
|
||||||
|
[Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/container-runtime-interface-v1.md).
|
||||||
|
|
||||||
|
In other words, a Kubelet is a CRI client and expects a CRI
|
||||||
|
implementation to handle the server side of the interface.
|
||||||
|
[CRI-O](https://github.com/kubernetes-incubator/cri-o) and
|
||||||
|
[containerd](https://github.com/containerd/containerd/) are CRI
|
||||||
|
implementations that rely on
|
||||||
|
[OCI](https://github.com/opencontainers/runtime-spec) compatible
|
||||||
|
runtimes for managing container instances.
|
||||||
|
|
||||||
|
Kata Containers is an officially supported CRI-O and containerd
|
||||||
|
runtime. Refer to the following guides on how to set up Kata
|
||||||
|
Containers with Kubernetes:
|
||||||
|
|
||||||
|
- [How to use Kata Containers and containerd](../../how-to/containerd-kata.md)
|
||||||
|
- [Run Kata Containers with Kubernetes](../../how-to/run-kata-with-k8s.md)
|
48
docs/design/architecture/networking.md
Normal file
48
docs/design/architecture/networking.md
Normal file
@ -0,0 +1,48 @@
|
|||||||
|
# Networking
|
||||||
|
|
||||||
|
See the [networking document](networking.md).
|
||||||
|
|
||||||
|
Containers will typically live in their own, possibly shared, networking namespace.
|
||||||
|
At some point in a container lifecycle, container engines will set up that namespace
|
||||||
|
to add the container to a network which is isolated from the host network, but
|
||||||
|
which is shared between containers
|
||||||
|
|
||||||
|
In order to do so, container engines will usually add one end of a virtual
|
||||||
|
ethernet (`veth`) pair into the container networking namespace. The other end of
|
||||||
|
the `veth` pair is added to the host networking namespace.
|
||||||
|
|
||||||
|
This is a very namespace-centric approach as many hypervisors or VM
|
||||||
|
Managers (VMMs) such as `virt-manager` cannot handle `veth`
|
||||||
|
interfaces. Typically, `TAP` interfaces are created for VM
|
||||||
|
connectivity.
|
||||||
|
|
||||||
|
To overcome incompatibility between typical container engines expectations
|
||||||
|
and virtual machines, Kata Containers networking transparently connects `veth`
|
||||||
|
interfaces with `TAP` ones using Traffic Control:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
With a TC filter in place, a redirection is created between the container network and the
|
||||||
|
virtual machine. As an example, the CNI may create a device, `eth0`, in the container's network
|
||||||
|
namespace, which is a VETH device. Kata Containers will create a tap device for the VM, `tap0_kata`,
|
||||||
|
and setup a TC redirection filter to mirror traffic from `eth0`'s ingress to `tap0_kata`'s egress,
|
||||||
|
and a second to mirror traffic from `tap0_kata`'s ingress to `eth0`'s egress.
|
||||||
|
|
||||||
|
Kata Containers maintains support for MACVTAP, which was an earlier implementation used in Kata. TC-filter
|
||||||
|
is the default because it allows for simpler configuration, better CNI plugin compatibility, and performance
|
||||||
|
on par with MACVTAP.
|
||||||
|
|
||||||
|
Kata Containers has deprecated support for bridge due to lacking performance relative to TC-filter and MACVTAP.
|
||||||
|
|
||||||
|
Kata Containers supports both
|
||||||
|
[CNM](https://github.com/docker/libnetwork/blob/master/docs/design.md#the-container-network-model)
|
||||||
|
and [CNI](https://github.com/containernetworking/cni) for networking management.
|
||||||
|
|
||||||
|
## Network Hotplug
|
||||||
|
|
||||||
|
Kata Containers has developed a set of network sub-commands and APIs to add, list and
|
||||||
|
remove a guest network endpoint and to manipulate the guest route table.
|
||||||
|
|
||||||
|
The following diagram illustrates the Kata Containers network hotplug workflow.
|
||||||
|
|
||||||
|

|
44
docs/design/architecture/storage.md
Normal file
44
docs/design/architecture/storage.md
Normal file
@ -0,0 +1,44 @@
|
|||||||
|
# Storage
|
||||||
|
|
||||||
|
## virtio SCSI
|
||||||
|
|
||||||
|
If a block-based graph driver is [configured](README.md#configuration),
|
||||||
|
`virtio-scsi` is used to _share_ the workload image (such as
|
||||||
|
`busybox:latest`) into the container's environment inside the VM.
|
||||||
|
|
||||||
|
## virtio FS
|
||||||
|
|
||||||
|
If a block-based graph driver is _not_ [configured](README.md#configuration), a
|
||||||
|
[`virtio-fs`](https://virtio-fs.gitlab.io) (`VIRTIO`) overlay
|
||||||
|
filesystem mount point is used to _share_ the workload image instead. The
|
||||||
|
[agent](README.md#agent) uses this mount point as the root filesystem for the
|
||||||
|
container processes.
|
||||||
|
|
||||||
|
For virtio-fs, the [runtime](README.md#runtime) starts one `virtiofsd` daemon
|
||||||
|
(that runs in the host context) for each VM created.
|
||||||
|
|
||||||
|
## Devicemapper
|
||||||
|
|
||||||
|
The
|
||||||
|
[devicemapper `snapshotter`](https://github.com/containerd/containerd/tree/master/snapshots/devmapper)
|
||||||
|
is a special case. The `snapshotter` uses dedicated block devices
|
||||||
|
rather than formatted filesystems, and operates at the block level
|
||||||
|
rather than the file level. This knowledge is used to directly use the
|
||||||
|
underlying block device instead of the overlay file system for the
|
||||||
|
container root file system. The block device maps to the top
|
||||||
|
read-write layer for the overlay. This approach gives much better I/O
|
||||||
|
performance compared to using `virtio-fs` to share the container file
|
||||||
|
system.
|
||||||
|
|
||||||
|
#### Hot plug and unplug
|
||||||
|
|
||||||
|
Kata Containers has the ability to hot plug add and hot plug remove
|
||||||
|
block devices. This makes it possible to use block devices for
|
||||||
|
containers started after the VM has been launched.
|
||||||
|
|
||||||
|
Users can check to see if the container uses the `devicemapper` block
|
||||||
|
device as its rootfs by calling `mount(8)` within the container. If
|
||||||
|
the `devicemapper` block device is used, the root filesystem (`/`)
|
||||||
|
will be mounted from `/dev/vda`. Users can disable direct mounting of
|
||||||
|
the underlying block device through the runtime
|
||||||
|
[configuration](README.md#configuration).
|
@ -41,7 +41,7 @@ Kata Containers with QEMU has complete compatibility with Kubernetes.
|
|||||||
Depending on the host architecture, Kata Containers supports various machine types,
|
Depending on the host architecture, Kata Containers supports various machine types,
|
||||||
for example `pc` and `q35` on x86 systems, `virt` on ARM systems and `pseries` on IBM Power systems. The default Kata Containers
|
for example `pc` and `q35` on x86 systems, `virt` on ARM systems and `pseries` on IBM Power systems. The default Kata Containers
|
||||||
machine type is `pc`. The machine type and its [`Machine accelerators`](#machine-accelerators) can
|
machine type is `pc`. The machine type and its [`Machine accelerators`](#machine-accelerators) can
|
||||||
be changed by editing the runtime [`configuration`](./architecture.md/#configuration) file.
|
be changed by editing the runtime [`configuration`](architecture/README.md#configuration) file.
|
||||||
|
|
||||||
Devices and features used:
|
Devices and features used:
|
||||||
- virtio VSOCK or virtio serial
|
- virtio VSOCK or virtio serial
|
||||||
|
@ -76,7 +76,7 @@ then a new configuration file can be [created](#configure-kata-containers)
|
|||||||
and [configured][7].
|
and [configured][7].
|
||||||
|
|
||||||
[1]: https://docs.snapcraft.io/snaps/intro
|
[1]: https://docs.snapcraft.io/snaps/intro
|
||||||
[2]: ../docs/design/architecture.md#root-filesystem-image
|
[2]: ../docs/design/architecture/README.md#root-filesystem-image
|
||||||
[3]: https://docs.snapcraft.io/reference/confinement#classic
|
[3]: https://docs.snapcraft.io/reference/confinement#classic
|
||||||
[4]: https://github.com/kata-containers/runtime#configuration
|
[4]: https://github.com/kata-containers/runtime#configuration
|
||||||
[5]: https://docs.docker.com/engine/reference/commandline/dockerd
|
[5]: https://docs.docker.com/engine/reference/commandline/dockerd
|
||||||
|
@ -6,14 +6,14 @@ The Kata agent is a long running process that runs inside the Virtual Machine
|
|||||||
(VM) (also known as the "pod" or "sandbox").
|
(VM) (also known as the "pod" or "sandbox").
|
||||||
|
|
||||||
The agent is packaged inside the Kata Containers
|
The agent is packaged inside the Kata Containers
|
||||||
[guest image](../../docs/design/architecture.md#guest-image)
|
[guest image](../../docs/design/architecture/README.md#guest-image)
|
||||||
which is used to boot the VM. Once the runtime has launched the configured
|
which is used to boot the VM. Once the runtime has launched the configured
|
||||||
[hypervisor](../../docs/hypervisors.md) to create a new VM, the agent is
|
[hypervisor](../../docs/hypervisors.md) to create a new VM, the agent is
|
||||||
started. From this point on, the agent is responsible for creating and
|
started. From this point on, the agent is responsible for creating and
|
||||||
managing the life cycle of the containers inside the VM.
|
managing the life cycle of the containers inside the VM.
|
||||||
|
|
||||||
For further details, see the
|
For further details, see the
|
||||||
[architecture document](../../docs/design/architecture.md).
|
[architecture document](../../docs/design/architecture).
|
||||||
|
|
||||||
## Audience
|
## Audience
|
||||||
|
|
||||||
|
@ -70,7 +70,7 @@ See the
|
|||||||
|
|
||||||
## Architecture overview
|
## Architecture overview
|
||||||
|
|
||||||
See the [architecture overview](../../docs/design/architecture.md)
|
See the [architecture overview](../../docs/design/architecture)
|
||||||
for details on the Kata Containers design.
|
for details on the Kata Containers design.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
@ -135,7 +135,7 @@ There are three drawbacks about using CNM instead of CNI:
|
|||||||
|
|
||||||
# Storage
|
# Storage
|
||||||
|
|
||||||
See [Kata Containers Architecture](../../../docs/design/architecture.md#storage).
|
See [Kata Containers Architecture](../../../docs/design/architecture/README.md#storage).
|
||||||
|
|
||||||
# Devices
|
# Devices
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user