mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-07-31 23:36:12 +00:00
Merge pull request #3287 from jodh-intel/docs-split-arch-doc
Split architecture doc into separate files
This commit is contained in:
commit
2ebae2d279
@ -41,7 +41,7 @@ Documents that help to understand and contribute to Kata Containers.
|
||||
|
||||
### Design and Implementations
|
||||
|
||||
* [Kata Containers Architecture](design/architecture.md): Architectural overview of Kata Containers
|
||||
* [Kata Containers Architecture](design/architecture): Architectural overview of Kata Containers
|
||||
* [Kata Containers E2E Flow](design/end-to-end-flow.md): The entire end-to-end flow of Kata Containers
|
||||
* [Kata Containers design](./design/README.md): More Kata Containers design documents
|
||||
* [Kata Containers threat model](./threat-model/threat-model.md): Kata Containers threat model
|
||||
|
@ -114,7 +114,7 @@ with containerd.
|
||||
> kernel or image.
|
||||
|
||||
If you are using custom
|
||||
[guest assets](design/architecture.md#guest-assets),
|
||||
[guest assets](design/architecture/README.md#guest-assets),
|
||||
you must upgrade them to work with Kata Containers 2.x since Kata
|
||||
Containers 1.x assets will **not** work.
|
||||
|
||||
|
@ -2,7 +2,7 @@
|
||||
|
||||
Kata Containers design documents:
|
||||
|
||||
- [Kata Containers architecture](architecture.md)
|
||||
- [Kata Containers architecture](architecture)
|
||||
- [API Design of Kata Containers](kata-api-design.md)
|
||||
- [Design requirements for Kata Containers](kata-design-requirements.md)
|
||||
- [VSocks](VSocks.md)
|
||||
|
@ -1,864 +0,0 @@
|
||||
# Kata Containers Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Kata Containers is an open source community working to build a secure
|
||||
container [runtime](#runtime) with lightweight virtual machines (VM's)
|
||||
that feel and perform like standard Linux containers, but provide
|
||||
stronger [workload](#workload) isolation using hardware
|
||||
[virtualization](#virtualization) technology as a second layer of
|
||||
defence.
|
||||
|
||||
Kata Containers runs on [multiple architectures](../../src/runtime/README.md#platform-support)
|
||||
and supports [multiple hypervisors](../hypervisors.md).
|
||||
|
||||
This document is a summary of the Kata Containers architecture.
|
||||
|
||||
## Virtualization
|
||||
|
||||
For details on how Kata Containers maps container concepts to VM
|
||||
technologies, and how this is realized in the multiple hypervisors and
|
||||
VMMs that Kata supports see the
|
||||
[virtualization documentation](./virtualization.md).
|
||||
|
||||
## Compatibility
|
||||
|
||||
The [Kata Containers runtime](../../src/runtime) is compatible with
|
||||
the [OCI](https://github.com/opencontainers)
|
||||
[runtime specification](https://github.com/opencontainers/runtime-spec)
|
||||
and therefore works seamlessly with the
|
||||
[Kubernetes Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/container-runtime-interface.md)
|
||||
through the [CRI-O](https://github.com/kubernetes-incubator/cri-o)
|
||||
and [containerd](https://github.com/containerd/containerd)
|
||||
implementations.
|
||||
|
||||
Kata Containers provides a ["shimv2"](#shim-v2-architecture) compatible runtime.
|
||||
|
||||
## Shim v2 architecture
|
||||
|
||||
The Kata Containers runtime is shim v2 ("shimv2") compatible. This
|
||||
section explains what this means.
|
||||
|
||||
### History
|
||||
|
||||
In the old [Kata 1.x architecture](https://github.com/kata-containers/documentation/blob/master/design/architecture.md),
|
||||
the Kata [runtime](#runtime) was an executable called `kata-runtime`.
|
||||
The container manager called this executable multiple times when
|
||||
creating each container. Each time the runtime was called a different
|
||||
OCI command-line verb was provided. This architecture was simple, but
|
||||
not well suited to creating VM based containers due to the issue of
|
||||
handling state between calls. Additionally, the architecture suffered
|
||||
from performance issues related to continually having to spawn new
|
||||
instances of the runtime binary, and
|
||||
[Kata shim](https://github.com/kata-containers/shim) and
|
||||
[Kata proxy](https://github.com/kata-containers/proxy) processes for systems
|
||||
that did not provide VSOCK.
|
||||
|
||||
### An improved architecture
|
||||
|
||||
The
|
||||
[containerd runtime shimv2 architecture](https://github.com/containerd/containerd/tree/main/runtime/v2)
|
||||
or _shim API_ architecture resolves the issues with the old
|
||||
architecture by defining a set of shimv2 APIs that a compatible
|
||||
runtime implementation must supply. Rather than calling the runtime
|
||||
binary multiple times for each new container, the shimv2 architecture
|
||||
runs a single instance of the runtime binary (for any number of
|
||||
containers). This improves performance and resolves the state handling
|
||||
issue.
|
||||
|
||||
The shimv2 API is similar to the
|
||||
[OCI runtime](https://github.com/opencontainers/runtime-spec)
|
||||
API in terms of the way the container lifecycle is split into
|
||||
different verbs. Rather than calling the runtime multiple times, the
|
||||
container manager creates a socket and passes it to the shimv2
|
||||
runtime. The socket is a bi-directional communication channel that
|
||||
uses a gRPC based protocol to allow the container manager to send API
|
||||
calls to the runtime, which returns the result to the container
|
||||
manager using the same channel.
|
||||
|
||||
The shimv2 architecture allows running several containers per VM to
|
||||
support container engines that require multiple containers running
|
||||
inside a pod.
|
||||
|
||||
With the new architecture [Kubernetes](#kubernetes-support) can
|
||||
launch both Pod and OCI compatible containers with a single
|
||||
[runtime](#runtime) shim per Pod, rather than `2N+1` shims. No stand
|
||||
alone `kata-proxy` process is required, even if VSOCK is not
|
||||
available.
|
||||
|
||||
### Architectural comparison
|
||||
|
||||
| Kata version | Kata Runtime process calls | Kata shim processes | Kata proxy processes (if no VSOCK) |
|
||||
|-|-|-|-|
|
||||
| 1.x | multiple per container | 1 per container connection | 1 |
|
||||
| 2.x | 1 per VM (hosting any number of containers) | 0 | 0 |
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - A single VM can host one or more containers.
|
||||
>
|
||||
> - The "Kata shim processes" column refers to the old
|
||||
> [Kata shim](https://github.com/kata-containers/shim) (`kata-shim` binary),
|
||||
> *not* the new shimv2 runtime instance (`containerd-shim-kata-v2` binary).
|
||||
|
||||
The diagram below shows how the original architecture was simplified
|
||||
with the advent of shimv2.
|
||||
|
||||

|
||||
|
||||
## Root filesystem
|
||||
|
||||
This document uses the term _rootfs_ to refer to a root filesystem
|
||||
which is mounted as the top-level directory ("`/`") and often referred
|
||||
to as _slash_.
|
||||
|
||||
It is important to understand this term since the overall system uses
|
||||
multiple different rootfs's (as explained in the
|
||||
[Environments](#environments) section.
|
||||
|
||||
## Example command
|
||||
|
||||
The following containerd command creates a container. It is referred
|
||||
to throughout this document to help explain various points:
|
||||
|
||||
```bash
|
||||
$ sudo ctr run --runtime "io.containerd.kata.v2" --rm -t "quay.io/libpod/ubuntu:latest" foo sh
|
||||
```
|
||||
|
||||
This command requests that containerd:
|
||||
|
||||
- Create a container (`ctr run`).
|
||||
- Use the Kata [shimv2](#shim-v2-architecture) runtime (`--runtime "io.containerd.kata.v2"`).
|
||||
- Delete the container when it [exits](#workload-exit) (`--rm`).
|
||||
- Attach the container to the user's terminal (`-t`).
|
||||
- Use the Ubuntu Linux [container image](#container-image)
|
||||
to create the container [rootfs](#root-filesystem) that will become
|
||||
the [container environment](#environments)
|
||||
(`quay.io/libpod/ubuntu:latest`).
|
||||
- Create the container with the name "`foo`".
|
||||
- Run the `sh(1)` command in the Ubuntu rootfs based container
|
||||
environment.
|
||||
|
||||
The command specified here is referred to as the [workload](#workload).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> For the purposes of this document and to keep explanations
|
||||
> simpler, we assume the user is running this command in the
|
||||
> [host environment](#environments).
|
||||
|
||||
## Container image
|
||||
|
||||
In the [example command](#example-command) the user has specified the
|
||||
type of container they wish to run via the container image name:
|
||||
`ubuntu`. This image name corresponds to a _container image_ that can
|
||||
be used to create a container with an Ubuntu Linux environment. Hence,
|
||||
in our [example](#example-command), the `sh(1)` command will be run
|
||||
inside a container which has an Ubuntu rootfs.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> The term _container image_ is confusing since the image in question
|
||||
> is **not** a container: it is simply a set of files (_an image_)
|
||||
> that can be used to _create_ a container. The term _container
|
||||
> template_ would be more accurate but the term _container image_ is
|
||||
> commonly used so this document uses the standard term.
|
||||
|
||||
For the purposes of this document, the most important part of the
|
||||
[example command line](#example-command) is the container image the
|
||||
user has requested. Normally, the container manager will _pull_
|
||||
(download) a container image from a remote site and store a copy
|
||||
locally. This local container image is used by the container manager
|
||||
to create an [OCI bundle](#oci-bundle) which will form the environment
|
||||
the container will run in. After creating the OCI bundle, the
|
||||
container manager launches a [runtime](#runtime) which will create the
|
||||
container using the provided OCI bundle.
|
||||
|
||||
## OCI bundle
|
||||
|
||||
To understand what follows, it is important to know at a high level
|
||||
how an OCI ([Open Containers Initiative](https://opencontainers.org)) compatible container is created.
|
||||
|
||||
An OCI compatible container is created by taking a
|
||||
[container image](#container-image) and converting the embedded rootfs
|
||||
into an
|
||||
[OCI rootfs bundle](https://github.com/opencontainers/runtime-spec/blob/main/bundle.md),
|
||||
or more simply, an _OCI bundle_.
|
||||
|
||||
An OCI bundle is a `tar(1)` archive normally created by a container
|
||||
manager which is passed to an OCI [runtime](#runtime) which converts
|
||||
it into a full container rootfs. The bundle contains two assets:
|
||||
|
||||
- A container image [rootfs](#root-filesystem)
|
||||
|
||||
This is simply a directory of files that will be used to represent
|
||||
the rootfs for the container.
|
||||
|
||||
For the [example command](#example-command), the directory will
|
||||
contain the files necessary to create a minimal Ubuntu root
|
||||
filesystem.
|
||||
|
||||
- An [OCI configuration file](https://github.com/opencontainers/runtime-spec/blob/main/config.md)
|
||||
|
||||
This is a JSON file called `config.json`.
|
||||
|
||||
The container manager will create this file so that:
|
||||
|
||||
- The `root.path` value is set to the full path of the specified
|
||||
container rootfs.
|
||||
|
||||
In [the example](#example-command) this value will be `ubuntu`.
|
||||
|
||||
- The `process.args` array specifies the list of commands the user
|
||||
wishes to run. This is known as the [workload](#workload).
|
||||
|
||||
In [the example](#example-command) the workload is `sh(1)`.
|
||||
|
||||
## Workload
|
||||
|
||||
The workload is the command the user requested to run in the
|
||||
container and is specified in the [OCI bundle](#oci-bundle)'s
|
||||
configuration file.
|
||||
|
||||
In our [example](#example-command), the workload is the `sh(1)` command.
|
||||
|
||||
### Workload root filesystem
|
||||
|
||||
For details of how the [runtime](#runtime) makes the
|
||||
[container image](#container-image) chosen by the user available to
|
||||
the workload process, see the
|
||||
[Container creation](#container-creation) and [storage](#storage) sections.
|
||||
|
||||
Note that the workload is isolated from the [guest VM](#environments) environment by its
|
||||
surrounding [container environment](#environments). The guest VM
|
||||
environment where the container runs in is also isolated from the _outer_
|
||||
[host environment](#environments) where the container manager runs.
|
||||
|
||||
## System overview
|
||||
|
||||
### Environments
|
||||
|
||||
The following terminology is used to describe the different or
|
||||
environments (or contexts) various processes run in. It is necessary
|
||||
to study this table closely to make sense of what follows:
|
||||
|
||||
| Type | Name | Virtualized | Containerized | rootfs | Rootfs device type | Mount type | Description |
|
||||
|-|-|-|-|-|-|-|-|
|
||||
| Host | Host | no `[1]` | no | Host specific | Host specific | Host specific | The environment provided by a standard, physical non virtualized system. |
|
||||
| VM root | Guest VM | yes | no | rootfs inside the [guest image](#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. |
|
||||
| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](#example-command)) | `kataShared` | [virtio FS](#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](#oci-bundle). |
|
||||
|
||||
**Key:**
|
||||
|
||||
- `[1]`: For simplicity, this document assumes the host environment
|
||||
runs on physical hardware.
|
||||
|
||||
- `[2]`: See the [DAX](#dax) section.
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The word "root" is used to mean _top level_ here in a similar
|
||||
> manner to the term [rootfs](#root-filesystem).
|
||||
>
|
||||
> - The term "first level" prefix used above is important since it implies
|
||||
> that it is possible to create multi level systems. However, they do
|
||||
> not form part of a standard Kata Containers environment so will not
|
||||
> be considered in this document.
|
||||
|
||||
The reasons for containerizing the [workload](#workload) inside the VM
|
||||
are:
|
||||
|
||||
- Isolates the workload entirely from the VM environment.
|
||||
- Provides better isolation between containers in a [pod](#kubernetes-support).
|
||||
- Allows the workload to be managed and monitored through its cgroup
|
||||
confinement.
|
||||
|
||||
### Container creation
|
||||
|
||||
The steps below show at a high level how a Kata Containers container is
|
||||
created using the containerd container manager:
|
||||
|
||||
1. The user requests the creation of a container by running a command
|
||||
like the [example command](#example-command).
|
||||
1. The container manager daemon runs a single instance of the Kata
|
||||
[runtime](#runtime).
|
||||
1. The Kata runtime loads its [configuration file](#configuration).
|
||||
1. The container manager calls a set of shimv2 API functions on the runtime.
|
||||
1. The Kata runtime launches the configured [hypervisor](#hypervisor).
|
||||
1. The hypervisor creates and starts (_boots_) a VM using the
|
||||
[guest assets](#guest-assets):
|
||||
|
||||
- The hypervisor [DAX](#dax) shares the [guest image](#guest-image)
|
||||
into the VM to become the VM [rootfs](#root-filesystem) (mounted on a `/dev/pmem*` device),
|
||||
which is known as the [VM root environment](#environments).
|
||||
- The hypervisor mounts the [OCI bundle](#oci-bundle), using [virtio FS](#virtio-fs),
|
||||
into a container specific directory inside the VM's rootfs.
|
||||
|
||||
This container specific directory will become the
|
||||
[container rootfs](#environments), known as the
|
||||
[container environment](#environments).
|
||||
|
||||
1. The [agent](#agent) is started as part of the VM boot.
|
||||
|
||||
1. The runtime calls the agent's `CreateSandbox` API to request the
|
||||
agent create a container:
|
||||
|
||||
1. The agent creates a [container environment](#environments)
|
||||
in the container specific directory that contains the [container rootfs](#environments).
|
||||
|
||||
The container environment hosts the [workload](#workload) in the
|
||||
[container rootfs](#environments) directory.
|
||||
|
||||
1. The agent spawns the workload inside the container environment.
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The container environment created by the agent is equivalent to
|
||||
> a container environment created by the
|
||||
> [`runc`](https://github.com/opencontainers/runc) OCI runtime;
|
||||
> Linux cgroups and namespaces are created inside the VM by the
|
||||
> [guest kernel](#guest-kernel) to isolate the workload from the
|
||||
> VM environment the container is created in. See the
|
||||
> [Environments](#environments) section for an explanation of why
|
||||
> this is done.
|
||||
>
|
||||
> - See the [guest image](#guest-image) section for details of
|
||||
> exactly how the agent is started.
|
||||
|
||||
1. The container manager returns control of the container to the
|
||||
user running the `ctr` command.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> At this point, the container is running and:
|
||||
>
|
||||
> - The [workload](#workload) process ([`sh(1)` in the example](#example-command))
|
||||
> is running in the [container environment](#environments).
|
||||
> - The user is now able to interact with the workload
|
||||
> (using the [`ctr` command in the example](#example-command)).
|
||||
> - The [agent](#agent), running inside the VM is monitoring the
|
||||
> [workload](#workload) process.
|
||||
> - The [runtime](#runtime) is waiting for the agent's `WaitProcess` API
|
||||
> call to complete.
|
||||
|
||||
Further details of these steps are provided in the sections below.
|
||||
|
||||
### Container shutdown
|
||||
|
||||
There are two possible ways for the container environment to be
|
||||
terminated:
|
||||
|
||||
- When the [workload](#workload) exits.
|
||||
|
||||
This is the standard, or _graceful_ shutdown method.
|
||||
|
||||
- When the container manager forces the container to be deleted.
|
||||
|
||||
#### Workload exit
|
||||
|
||||
The [agent](#agent) will detect when the [workload](#workload) process
|
||||
exits, capture its exit status (see `wait(2)`) and return that value
|
||||
to the [runtime](#runtime) by specifying it as the response to the
|
||||
`WaitProcess` agent API call made by the [runtime](#runtime).
|
||||
|
||||
The runtime then passes the value back to the container manager by the
|
||||
`Wait` [shimv2 API](#shim-v2-architecture) call.
|
||||
|
||||
Once the workload has fully exited, the VM is no longer needed and the
|
||||
runtime cleans up the environment (which includes terminating the
|
||||
[hypervisor](#hypervisor) process).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> When [agent tracing is enabled](../tracing.md#agent-shutdown-behaviour),
|
||||
> the shutdown behaviour is different.
|
||||
|
||||
#### Container manager requested shutdown
|
||||
|
||||
If the container manager requests the container be deleted, the
|
||||
[runtime](#runtime) will signal the agent by sending it a
|
||||
`DestroySandbox` [ttRPC API](../../src/agent/protocols/protos/agent.proto) request.
|
||||
|
||||
## Guest assets
|
||||
|
||||
Kata Containers creates a VM in which to run one or more containers. It
|
||||
does this by launching a [hypervisor](#hypervisor) to create the VM.
|
||||
The hypervisor needs two assets for this task: a Linux kernel and a
|
||||
small root filesystem image to boot the VM.
|
||||
|
||||
### Guest kernel
|
||||
|
||||
The [guest kernel](../../tools/packaging/kernel)
|
||||
is passed to the hypervisor and used to boot the VM.
|
||||
The default kernel provided in Kata Containers is highly optimized for
|
||||
kernel boot time and minimal memory footprint, providing only those
|
||||
services required by a container workload. It is based on the latest
|
||||
Linux LTS (Long Term Support) [kernel](https://www.kernel.org).
|
||||
|
||||
### Guest image
|
||||
|
||||
The hypervisor uses an image file which provides a minimal root
|
||||
filesystem used by the guest kernel to boot the VM and host the Kata
|
||||
Container. Kata Containers supports both initrd and rootfs based
|
||||
minimal guest images. The [default packages](../install/) provide both
|
||||
an image and an initrd, both of which are created using the
|
||||
[`osbuilder`](../../tools/osbuilder) tool.
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - Although initrd and rootfs based images are supported, not all
|
||||
> [hypervisors](#hypervisor) support both types of image.
|
||||
>
|
||||
> - The guest image is *unrelated* to the image used in a container
|
||||
> workload.
|
||||
>
|
||||
> For example, if a user creates a container that runs a shell in a
|
||||
> BusyBox image, they will run that shell in a BusyBox environment.
|
||||
> However, the guest image running inside the VM that is used to
|
||||
> *host* that BusyBox image could be running Clear Linux, Ubuntu,
|
||||
> Fedora or any other distribution potentially.
|
||||
>
|
||||
> The `osbuilder` tool provides
|
||||
> [configurations for various common Linux distributions](../../tools/osbuilder/rootfs-builder)
|
||||
> which can be built into either initrd or rootfs guest images.
|
||||
>
|
||||
> - If you are using a [packaged version of Kata
|
||||
> Containers](../install), you can see image details by running the
|
||||
> [`kata-collect-data.sh`](../../src/runtime/data/kata-collect-data.sh.in)
|
||||
> script as `root` and looking at the "Image details" section of the
|
||||
> output.
|
||||
|
||||
#### Root filesystem image
|
||||
|
||||
The default packaged rootfs image, sometimes referred to as the _mini
|
||||
O/S_, is a highly optimized container bootstrap system.
|
||||
|
||||
If this image type is [configured](#configuration), when the user runs
|
||||
the [example command](#example-command):
|
||||
|
||||
- The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor).
|
||||
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
||||
- The kernel will start the init daemon as PID 1 (`systemd`) inside the VM root environment.
|
||||
- `systemd`, running inside the mini-OS context, will launch the [agent](#agent)
|
||||
in the root context of the VM.
|
||||
- The agent will create a new container environment, setting its root
|
||||
filesystem to that requested by the user (Ubuntu in [the example](#example-command)).
|
||||
- The agent will then execute the command (`sh(1)` in [the example](#example-command))
|
||||
inside the new container.
|
||||
|
||||
The table below summarises the default mini O/S showing the
|
||||
environments that are created, the services running in those
|
||||
environments (for all platforms) and the root filesystem used by
|
||||
each service:
|
||||
|
||||
| Process | Environment | systemd service? | rootfs | User accessible | Notes |
|
||||
|-|-|-|-|-|-|
|
||||
| systemd | VM root | n/a | [VM guest image](#guest-image)| [debug console][debug-console] | The init daemon, running as PID 1 |
|
||||
| [Agent](#agent) | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Runs as a systemd service |
|
||||
| `chronyd` | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Used to synchronise the time with the host |
|
||||
| container workload (`sh(1)` in [the example](#example-command)) | VM container | no | User specified (Ubuntu in [the example](#example-command)) | [exec command](#exec-command) | Managed by the agent |
|
||||
|
||||
See also the [process overview](#process-overview).
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The "User accessible" column shows how an administrator can access
|
||||
> the environment.
|
||||
>
|
||||
> - The container workload is running inside a full container
|
||||
> environment which itself is running within a VM environment.
|
||||
>
|
||||
> - See the [configuration files for the `osbuilder` tool](../../tools/osbuilder/rootfs-builder)
|
||||
> for details of the default distribution for platforms other than
|
||||
> Intel x86_64.
|
||||
|
||||
#### Initrd image
|
||||
|
||||
The initrd image is a compressed `cpio(1)` archive, created from a
|
||||
rootfs which is loaded into memory and used as part of the Linux
|
||||
startup process. During startup, the kernel unpacks it into a special
|
||||
instance of a `tmpfs` mount that becomes the initial root filesystem.
|
||||
|
||||
If this image type is [configured](#configuration), when the user runs
|
||||
the [example command](#example-command):
|
||||
|
||||
- The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor).
|
||||
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
||||
- The kernel will start the init daemon as PID 1 (the [agent](#agent))
|
||||
inside the VM root environment.
|
||||
- The [agent](#agent) will create a new container environment, setting its root
|
||||
filesystem to that requested by the user (`ubuntu` in
|
||||
[the example](#example-command)).
|
||||
- The agent will then execute the command (`sh(1)` in [the example](#example-command))
|
||||
inside the new container.
|
||||
|
||||
The table below summarises the default mini O/S showing the environments that are created,
|
||||
the processes running in those environments (for all platforms) and
|
||||
the root filesystem used by each service:
|
||||
|
||||
| Process | Environment | rootfs | User accessible | Notes |
|
||||
|-|-|-|-|-|
|
||||
| [Agent](#agent) | VM root | [VM guest image](#guest-image) | [debug console][debug-console] | Runs as the init daemon (PID 1) |
|
||||
| container workload | VM container | User specified (Ubuntu in this example) | [exec command](#exec-command) | Managed by the agent |
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The "User accessible" column shows how an administrator can access
|
||||
> the environment.
|
||||
>
|
||||
> - It is possible to use a standard init daemon such as systemd with
|
||||
> an initrd image if this is desirable.
|
||||
|
||||
See also the [process overview](#process-overview).
|
||||
|
||||
#### Image summary
|
||||
|
||||
| Image type | Default distro | Init daemon | Reason | Notes |
|
||||
|-|-|-|-|-|
|
||||
| [image](#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility |
|
||||
| [initrd](#initrd-image) | [Alpine Linux](https://alpinelinux.org) | Kata [agent](#agent) (as no systemd support) | Security hardened and tiny C library |
|
||||
|
||||
See also:
|
||||
|
||||
- The [osbuilder](../../tools/osbuilder) tool
|
||||
|
||||
This is used to build all default image types.
|
||||
|
||||
- The [versions database](../../versions.yaml)
|
||||
|
||||
The `default-image-name` and `default-initrd-name` options specify
|
||||
the default distributions for each image type.
|
||||
|
||||
## Hypervisor
|
||||
|
||||
The [hypervisor](../hypervisors.md) specified in the
|
||||
[configuration file](#configuration) creates a VM to host the
|
||||
[agent](#agent) and the [workload](#workload) inside the
|
||||
[container environment](#environments).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> The hypervisor process runs inside an environment slightly different
|
||||
> to the host environment:
|
||||
>
|
||||
> - It is run in a different cgroup environment to the host.
|
||||
> - It is given a separate network namespace from the host.
|
||||
> - If the [OCI configuration specifies a SELinux label](https://github.com/opencontainers/runtime-spec/blob/main/config.md#linux-process),
|
||||
> the hypervisor process will run with that label (*not* the workload running inside the hypervisor's VM).
|
||||
|
||||
## Agent
|
||||
|
||||
The Kata Containers agent ([`kata-agent`](../../src/agent)), written
|
||||
in the [Rust programming language](https://www.rust-lang.org), is a
|
||||
long running process that runs inside the VM. It acts as the
|
||||
supervisor for managing the containers and the [workload](#workload)
|
||||
running within those containers. Only a single agent process is run
|
||||
for each VM created.
|
||||
|
||||
### Agent communications protocol
|
||||
|
||||
The agent communicates with the other Kata components (primarily the
|
||||
[runtime](#runtime)) using a
|
||||
[`ttRPC`](https://github.com/containerd/ttrpc-rust) based
|
||||
[protocol](../../src/agent/protocols/protos).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> If you wish to learn more about this protocol, a practical way to do
|
||||
> so is to experiment with the
|
||||
> [agent control tool](#agent-control-tool) on a test system.
|
||||
> This tool is for test and development purposes only and can send
|
||||
> arbitrary ttRPC agent API commands to the [agent](#agent).
|
||||
|
||||
## Runtime
|
||||
|
||||
The Kata Containers runtime (the [`containerd-shim-kata-v2`](../../src/runtime/cmd/containerd-shim-kata-v2
|
||||
) binary) is a [shimv2](#shim-v2-architecture) compatible runtime.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> The Kata Containers runtime is sometimes referred to as the Kata
|
||||
> _shim_. Both terms are correct since the `containerd-shim-kata-v2`
|
||||
> is a container runtime, and that runtime implements the containerd
|
||||
> shim v2 API.
|
||||
|
||||
The runtime makes heavy use of the [`virtcontainers`
|
||||
package](../../src/runtime/virtcontainers), which provides a generic,
|
||||
runtime-specification agnostic, hardware-virtualized containers
|
||||
library.
|
||||
|
||||
The runtime is responsible for starting the [hypervisor](#hypervisor)
|
||||
and it's VM, and communicating with the [agent](#agent) using a
|
||||
[ttRPC based protocol](#agent-communications-protocol) over a VSOCK
|
||||
socket that provides a communications link between the VM and the
|
||||
host.
|
||||
|
||||
This protocol allows the runtime to send container management commands
|
||||
to the agent. The protocol is also used to carry the standard I/O
|
||||
streams (`stdout`, `stderr`, `stdin`) between the containers and
|
||||
container managers (such as CRI-O or containerd).
|
||||
|
||||
## Utility program
|
||||
|
||||
The `kata-runtime` binary is a utility program that provides
|
||||
administrative commands to manipulate and query a Kata Containers
|
||||
installation.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> In Kata 1.x, this program also acted as the main
|
||||
> [runtime](#runtime), but this is no longer required due to the
|
||||
> improved shimv2 architecture.
|
||||
|
||||
### exec command
|
||||
|
||||
The `exec` command allows an administrator or developer to enter the
|
||||
[VM root environment](#environments) which is not accessible by the container
|
||||
[workload](#workload).
|
||||
|
||||
See [the developer guide](../Developer-Guide.md#connect-to-debug-console) for further details.
|
||||
|
||||
### Configuration
|
||||
|
||||
See the [configuration file details](../../src/runtime/README.md#configuration).
|
||||
|
||||
The configuration file is also used to enable runtime [debug output](../Developer-Guide.md#enable-full-debug).
|
||||
|
||||
## Process overview
|
||||
|
||||
The table below shows an example of the main processes running in the
|
||||
different [environments](#environments) when a Kata Container is
|
||||
created with containerd using our [example command](#example-command):
|
||||
|
||||
| Description | Host | VM root environment | VM container environment |
|
||||
|-|-|-|-|
|
||||
| Container manager | `containerd` | |
|
||||
| Kata Containers | [runtime](#runtime), [`virtiofsd`](#virtio-fs), [hypervisor](#hypervisor) | [agent](#agent) |
|
||||
| User [workload](#workload) | | | [`ubuntu sh`](#example-command) |
|
||||
|
||||
## Networking
|
||||
|
||||
Containers will typically live in their own, possibly shared, networking namespace.
|
||||
At some point in a container lifecycle, container engines will set up that namespace
|
||||
to add the container to a network which is isolated from the host network, but
|
||||
which is shared between containers
|
||||
|
||||
In order to do so, container engines will usually add one end of a virtual
|
||||
ethernet (`veth`) pair into the container networking namespace. The other end of
|
||||
the `veth` pair is added to the host networking namespace.
|
||||
|
||||
This is a very namespace-centric approach as many hypervisors or VM
|
||||
Managers (VMMs) such as `virt-manager` cannot handle `veth`
|
||||
interfaces. Typically, `TAP` interfaces are created for VM
|
||||
connectivity.
|
||||
|
||||
To overcome incompatibility between typical container engines expectations
|
||||
and virtual machines, Kata Containers networking transparently connects `veth`
|
||||
interfaces with `TAP` ones using Traffic Control:
|
||||
|
||||

|
||||
|
||||
With a TC filter in place, a redirection is created between the container network and the
|
||||
virtual machine. As an example, the CNI may create a device, `eth0`, in the container's network
|
||||
namespace, which is a VETH device. Kata Containers will create a tap device for the VM, `tap0_kata`,
|
||||
and setup a TC redirection filter to mirror traffic from `eth0`'s ingress to `tap0_kata`'s egress,
|
||||
and a second to mirror traffic from `tap0_kata`'s ingress to `eth0`'s egress.
|
||||
|
||||
Kata Containers maintains support for MACVTAP, which was an earlier implementation used in Kata. TC-filter
|
||||
is the default because it allows for simpler configuration, better CNI plugin compatibility, and performance
|
||||
on par with MACVTAP.
|
||||
|
||||
Kata Containers has deprecated support for bridge due to lacking performance relative to TC-filter and MACVTAP.
|
||||
|
||||
Kata Containers supports both
|
||||
[CNM](https://github.com/docker/libnetwork/blob/master/docs/design.md#the-container-network-model)
|
||||
and [CNI](https://github.com/containernetworking/cni) for networking management.
|
||||
|
||||
### Network Hotplug
|
||||
|
||||
Kata Containers has developed a set of network sub-commands and APIs to add, list and
|
||||
remove a guest network endpoint and to manipulate the guest route table.
|
||||
|
||||
The following diagram illustrates the Kata Containers network hotplug workflow.
|
||||
|
||||

|
||||
|
||||
## Storage
|
||||
|
||||
### virtio SCSI
|
||||
|
||||
If a block-based graph driver is [configured](#configuration),
|
||||
`virtio-scsi` is used to _share_ the workload image (such as
|
||||
`busybox:latest`) into the container's environment inside the VM.
|
||||
|
||||
### virtio FS
|
||||
|
||||
If a block-based graph driver is _not_ [configured](#configuration), a
|
||||
[`virtio-fs`](https://virtio-fs.gitlab.io) (`VIRTIO`) overlay
|
||||
filesystem mount point is used to _share_ the workload image instead. The
|
||||
[agent](#agent) uses this mount point as the root filesystem for the
|
||||
container processes.
|
||||
|
||||
For virtio-fs, the [runtime](#runtime) starts one `virtiofsd` daemon
|
||||
(that runs in the host context) for each VM created.
|
||||
|
||||
### Devicemapper
|
||||
|
||||
The
|
||||
[devicemapper `snapshotter`](https://github.com/containerd/containerd/tree/master/snapshots/devmapper)
|
||||
is a special case. The `snapshotter` uses dedicated block devices
|
||||
rather than formatted filesystems, and operates at the block level
|
||||
rather than the file level. This knowledge is used to directly use the
|
||||
underlying block device instead of the overlay file system for the
|
||||
container root file system. The block device maps to the top
|
||||
read-write layer for the overlay. This approach gives much better I/O
|
||||
performance compared to using `virtio-fs` to share the container file
|
||||
system.
|
||||
|
||||
#### Hot plug and unplug
|
||||
|
||||
Kata Containers has the ability to hot plug add and hot plug remove
|
||||
block devices. This makes it possible to use block devices for
|
||||
containers started after the VM has been launched.
|
||||
|
||||
Users can check to see if the container uses the `devicemapper` block
|
||||
device as its rootfs by calling `mount(8)` within the container. If
|
||||
the `devicemapper` block device is used, the root filesystem (`/`)
|
||||
will be mounted from `/dev/vda`. Users can disable direct mounting of
|
||||
the underlying block device through the runtime
|
||||
[configuration](#configuration).
|
||||
|
||||
## Kubernetes support
|
||||
|
||||
[Kubernetes](https://github.com/kubernetes/kubernetes/), or K8s, is a popular open source
|
||||
container orchestration engine. In Kubernetes, a set of containers sharing resources
|
||||
such as networking, storage, mount, PID, etc. is called a
|
||||
[pod](https://kubernetes.io/docs/user-guide/pods/).
|
||||
|
||||
A node can have multiple pods, but at a minimum, a node within a Kubernetes cluster
|
||||
only needs to run a container runtime and a container agent (called a
|
||||
[Kubelet](https://kubernetes.io/docs/admin/kubelet/)).
|
||||
|
||||
Kata Containers represents a Kubelet pod as a VM.
|
||||
|
||||
A Kubernetes cluster runs a control plane where a scheduler (typically
|
||||
running on a dedicated master node) calls into a compute Kubelet. This
|
||||
Kubelet instance is responsible for managing the lifecycle of pods
|
||||
within the nodes and eventually relies on a container runtime to
|
||||
handle execution. The Kubelet architecture decouples lifecycle
|
||||
management from container execution through a dedicated gRPC based
|
||||
[Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/container-runtime-interface-v1.md).
|
||||
|
||||
In other words, a Kubelet is a CRI client and expects a CRI
|
||||
implementation to handle the server side of the interface.
|
||||
[CRI-O](https://github.com/kubernetes-incubator/cri-o) and
|
||||
[containerd](https://github.com/containerd/containerd/) are CRI
|
||||
implementations that rely on
|
||||
[OCI](https://github.com/opencontainers/runtime-spec) compatible
|
||||
runtimes for managing container instances.
|
||||
|
||||
Kata Containers is an officially supported CRI-O and containerd
|
||||
runtime. Refer to the following guides on how to set up Kata
|
||||
Containers with Kubernetes:
|
||||
|
||||
- [How to use Kata Containers and containerd](../how-to/containerd-kata.md)
|
||||
- [Run Kata Containers with Kubernetes](../how-to/run-kata-with-k8s.md)
|
||||
|
||||
#### OCI annotations
|
||||
|
||||
In order for the Kata Containers [runtime](#runtime) (or any VM based OCI compatible
|
||||
runtime) to be able to understand if it needs to create a full VM or if it
|
||||
has to create a new container inside an existing pod's VM, CRI-O adds
|
||||
specific annotations to the OCI configuration file (`config.json`) which is passed to
|
||||
the OCI compatible runtime.
|
||||
|
||||
Before calling its runtime, CRI-O will always add a `io.kubernetes.cri-o.ContainerType`
|
||||
annotation to the `config.json` configuration file it produces from the Kubelet CRI
|
||||
request. The `io.kubernetes.cri-o.ContainerType` annotation can either be set to `sandbox`
|
||||
or `container`. Kata Containers will then use this annotation to decide if it needs to
|
||||
respectively create a virtual machine or a container inside a virtual machine associated
|
||||
with a Kubernetes pod:
|
||||
|
||||
| Annotation value | Kata VM created? | Kata container created? |
|
||||
|-|-|-|
|
||||
| `sandbox` | yes | yes (inside new VM) |
|
||||
| `container`| no | yes (in existing VM) |
|
||||
|
||||
#### Mixing VM based and namespace based runtimes
|
||||
|
||||
> **Note:** Since Kubernetes 1.12, the [`Kubernetes RuntimeClass`](https://kubernetes.io/docs/concepts/containers/runtime-class/)
|
||||
> has been supported and the user can specify runtime without the non-standardized annotations.
|
||||
|
||||
With `RuntimeClass`, users can define Kata Containers as a
|
||||
`RuntimeClass` and then explicitly specify that a pod must be created
|
||||
as a Kata Containers pod. For details, please refer to [How to use
|
||||
Kata Containers and containerd](../../docs/how-to/containerd-kata.md).
|
||||
|
||||
## Tracing
|
||||
|
||||
The [tracing document](../tracing.md) provides details on the tracing
|
||||
architecture.
|
||||
|
||||
# Appendices
|
||||
|
||||
## DAX
|
||||
|
||||
Kata Containers utilizes the Linux kernel DAX
|
||||
[(Direct Access filesystem)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.rst?h=v5.14)
|
||||
feature to efficiently map the [guest image](#guest-image) in the
|
||||
[host environment](#environments) into the
|
||||
[guest VM environment](#environments) to become the VM's
|
||||
[rootfs](#root-filesystem).
|
||||
|
||||
If the [configured](#configuration) [hypervisor](#hypervisor) is set
|
||||
to either QEMU or Cloud Hypervisor, DAX is used with the feature shown
|
||||
in the table below:
|
||||
|
||||
| Hypervisor | Feature used | rootfs device type |
|
||||
|-|-|-|
|
||||
| Cloud Hypervisor (CH) | `dax` `FsConfig` configuration option | PMEM (emulated Persistent Memory device) |
|
||||
| QEMU | NVDIMM memory device with a memory file backend | NVDIMM (emulated Non-Volatile Dual In-line Memory Module device) |
|
||||
|
||||
The features in the table above are equivalent in that they provide a memory-mapped
|
||||
virtual device which is used to DAX map the VM's
|
||||
[rootfs](#root-filesystem) into the [VM guest](#environments) memory
|
||||
address space.
|
||||
|
||||
The VM is then booted, specifying the `root=` kernel parameter to make
|
||||
the [guest kernel](#guest-kernel) use the appropriate emulated device
|
||||
as its rootfs.
|
||||
|
||||
### DAX advantages
|
||||
|
||||
Mapping files using [DAX](#dax) provides a number of benefits over
|
||||
more traditional VM file and device mapping mechanisms:
|
||||
|
||||
- Mapping as a direct access device allows the guest to directly
|
||||
access the host memory pages (such as via Execute In Place (XIP)),
|
||||
bypassing the [guest kernel](#guest-kernel)'s page cache. This
|
||||
zero copy provides both time and space optimizations.
|
||||
|
||||
- Mapping as a direct access device inside the VM allows pages from the
|
||||
host to be demand loaded using page faults, rather than having to make requests
|
||||
via a virtualized device (causing expensive VM exits/hypercalls), thus providing
|
||||
a speed optimization.
|
||||
|
||||
- Utilizing `mmap(2)`'s `MAP_SHARED` shared memory option on the host
|
||||
allows the host to efficiently share pages.
|
||||
|
||||

|
||||
|
||||
For further details of the use of NVDIMM with QEMU, see the [QEMU
|
||||
project documentation](https://www.qemu.org).
|
||||
|
||||
## Agent control tool
|
||||
|
||||
The [agent control tool](../../src/tools/agent-ctl) is a test and
|
||||
development tool that can be used to learn more about a Kata
|
||||
Containers system.
|
||||
|
||||
## Terminology
|
||||
|
||||
See the [project glossary](../../Glossary.md).
|
||||
|
||||
[debug-console]: ../Developer-Guide.md#connect-to-debug-console
|
479
docs/design/architecture/README.md
Normal file
479
docs/design/architecture/README.md
Normal file
@ -0,0 +1,479 @@
|
||||
# Kata Containers Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Kata Containers is an open source community working to build a secure
|
||||
container [runtime](#runtime) with lightweight virtual machines (VM's)
|
||||
that feel and perform like standard Linux containers, but provide
|
||||
stronger [workload](#workload) isolation using hardware
|
||||
[virtualization](#virtualization) technology as a second layer of
|
||||
defence.
|
||||
|
||||
Kata Containers runs on [multiple architectures](../../../src/runtime/README.md#platform-support)
|
||||
and supports [multiple hypervisors](../../hypervisors.md).
|
||||
|
||||
This document is a summary of the Kata Containers architecture.
|
||||
|
||||
## Background knowledge
|
||||
|
||||
This document assumes the reader understands a number of concepts
|
||||
related to containers and file systems. The
|
||||
[background](background.md) document explains these concepts.
|
||||
|
||||
## Example command
|
||||
|
||||
This document makes use of a particular [example
|
||||
command](example-command.md) throughout the text to illustrate certain
|
||||
concepts.
|
||||
|
||||
## Virtualization
|
||||
|
||||
For details on how Kata Containers maps container concepts to VM
|
||||
technologies, and how this is realized in the multiple hypervisors and
|
||||
VMMs that Kata supports see the
|
||||
[virtualization documentation](../virtualization.md).
|
||||
|
||||
## Compatibility
|
||||
|
||||
The [Kata Containers runtime](../../../src/runtime) is compatible with
|
||||
the [OCI](https://github.com/opencontainers)
|
||||
[runtime specification](https://github.com/opencontainers/runtime-spec)
|
||||
and therefore works seamlessly with the
|
||||
[Kubernetes Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/container-runtime-interface.md)
|
||||
through the [CRI-O](https://github.com/kubernetes-incubator/cri-o)
|
||||
and [containerd](https://github.com/containerd/containerd)
|
||||
implementations.
|
||||
|
||||
Kata Containers provides a ["shimv2"](#shim-v2-architecture) compatible runtime.
|
||||
|
||||
## Shim v2 architecture
|
||||
|
||||
The Kata Containers runtime is shim v2 ("shimv2") compatible. This
|
||||
section explains what this means.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> For a comparison with the Kata 1.x architecture, see
|
||||
> [the architectural history document](history.md).
|
||||
|
||||
The
|
||||
[containerd runtime shimv2 architecture](https://github.com/containerd/containerd/tree/main/runtime/v2)
|
||||
or _shim API_ architecture resolves the issues with the old
|
||||
architecture by defining a set of shimv2 APIs that a compatible
|
||||
runtime implementation must supply. Rather than calling the runtime
|
||||
binary multiple times for each new container, the shimv2 architecture
|
||||
runs a single instance of the runtime binary (for any number of
|
||||
containers). This improves performance and resolves the state handling
|
||||
issue.
|
||||
|
||||
The shimv2 API is similar to the
|
||||
[OCI runtime](https://github.com/opencontainers/runtime-spec)
|
||||
API in terms of the way the container lifecycle is split into
|
||||
different verbs. Rather than calling the runtime multiple times, the
|
||||
container manager creates a socket and passes it to the shimv2
|
||||
runtime. The socket is a bi-directional communication channel that
|
||||
uses a gRPC based protocol to allow the container manager to send API
|
||||
calls to the runtime, which returns the result to the container
|
||||
manager using the same channel.
|
||||
|
||||
The shimv2 architecture allows running several containers per VM to
|
||||
support container engines that require multiple containers running
|
||||
inside a pod.
|
||||
|
||||
With the new architecture [Kubernetes](kubernetes.md) can
|
||||
launch both Pod and OCI compatible containers with a single
|
||||
[runtime](#runtime) shim per Pod, rather than `2N+1` shims. No stand
|
||||
alone `kata-proxy` process is required, even if VSOCK is not
|
||||
available.
|
||||
|
||||
## Workload
|
||||
|
||||
The workload is the command the user requested to run in the
|
||||
container and is specified in the [OCI bundle](background.md#oci-bundle)'s
|
||||
configuration file.
|
||||
|
||||
In our [example](example-command.md), the workload is the `sh(1)` command.
|
||||
|
||||
### Workload root filesystem
|
||||
|
||||
For details of how the [runtime](#runtime) makes the
|
||||
[container image](background.md#container-image) chosen by the user available to
|
||||
the workload process, see the
|
||||
[Container creation](#container-creation) and [storage](#storage) sections.
|
||||
|
||||
Note that the workload is isolated from the [guest VM](#environments) environment by its
|
||||
surrounding [container environment](#environments). The guest VM
|
||||
environment where the container runs in is also isolated from the _outer_
|
||||
[host environment](#environments) where the container manager runs.
|
||||
|
||||
## System overview
|
||||
|
||||
### Environments
|
||||
|
||||
The following terminology is used to describe the different or
|
||||
environments (or contexts) various processes run in. It is necessary
|
||||
to study this table closely to make sense of what follows:
|
||||
|
||||
| Type | Name | Virtualized | Containerized | rootfs | Rootfs device type | Mount type | Description |
|
||||
|-|-|-|-|-|-|-|-|
|
||||
| Host | Host | no `[1]` | no | Host specific | Host specific | Host specific | The environment provided by a standard, physical non virtualized system. |
|
||||
| VM root | Guest VM | yes | no | rootfs inside the [guest image](guest-assets.md#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. |
|
||||
| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](example-command.md)) | `kataShared` | [virtio FS](storage.md#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](background.md#oci-bundle). |
|
||||
|
||||
**Key:**
|
||||
|
||||
- `[1]`: For simplicity, this document assumes the host environment
|
||||
runs on physical hardware.
|
||||
|
||||
- `[2]`: See the [DAX](#dax) section.
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The word "root" is used to mean _top level_ here in a similar
|
||||
> manner to the term [rootfs](background.md#root-filesystem).
|
||||
>
|
||||
> - The term "first level" prefix used above is important since it implies
|
||||
> that it is possible to create multi level systems. However, they do
|
||||
> not form part of a standard Kata Containers environment so will not
|
||||
> be considered in this document.
|
||||
|
||||
The reasons for containerizing the [workload](#workload) inside the VM
|
||||
are:
|
||||
|
||||
- Isolates the workload entirely from the VM environment.
|
||||
- Provides better isolation between containers in a [pod](kubernetes.md).
|
||||
- Allows the workload to be managed and monitored through its cgroup
|
||||
confinement.
|
||||
|
||||
### Container creation
|
||||
|
||||
The steps below show at a high level how a Kata Containers container is
|
||||
created using the containerd container manager:
|
||||
|
||||
1. The user requests the creation of a container by running a command
|
||||
like the [example command](example-command.md).
|
||||
1. The container manager daemon runs a single instance of the Kata
|
||||
[runtime](#runtime).
|
||||
1. The Kata runtime loads its [configuration file](#configuration).
|
||||
1. The container manager calls a set of shimv2 API functions on the runtime.
|
||||
1. The Kata runtime launches the configured [hypervisor](#hypervisor).
|
||||
1. The hypervisor creates and starts (_boots_) a VM using the
|
||||
[guest assets](guest-assets.md#guest-assets):
|
||||
|
||||
- The hypervisor [DAX](#dax) shares the
|
||||
[guest image](guest-assets.md#guest-image)
|
||||
into the VM to become the VM [rootfs](background.md#root-filesystem) (mounted on a `/dev/pmem*` device),
|
||||
which is known as the [VM root environment](#environments).
|
||||
- The hypervisor mounts the [OCI bundle](background.md#oci-bundle), using [virtio FS](storage.md#virtio-fs),
|
||||
into a container specific directory inside the VM's rootfs.
|
||||
|
||||
This container specific directory will become the
|
||||
[container rootfs](#environments), known as the
|
||||
[container environment](#environments).
|
||||
|
||||
1. The [agent](#agent) is started as part of the VM boot.
|
||||
|
||||
1. The runtime calls the agent's `CreateSandbox` API to request the
|
||||
agent create a container:
|
||||
|
||||
1. The agent creates a [container environment](#environments)
|
||||
in the container specific directory that contains the [container rootfs](#environments).
|
||||
|
||||
The container environment hosts the [workload](#workload) in the
|
||||
[container rootfs](#environments) directory.
|
||||
|
||||
1. The agent spawns the workload inside the container environment.
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The container environment created by the agent is equivalent to
|
||||
> a container environment created by the
|
||||
> [`runc`](https://github.com/opencontainers/runc) OCI runtime;
|
||||
> Linux cgroups and namespaces are created inside the VM by the
|
||||
> [guest kernel](guest-assets.md#guest-kernel) to isolate the
|
||||
> workload from the VM environment the container is created in.
|
||||
> See the [Environments](#environments) section for an
|
||||
> explanation of why this is done.
|
||||
>
|
||||
> - See the [guest image](guest-assets.md#guest-image) section for
|
||||
> details of exactly how the agent is started.
|
||||
|
||||
1. The container manager returns control of the container to the
|
||||
user running the `ctr` command.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> At this point, the container is running and:
|
||||
>
|
||||
> - The [workload](#workload) process ([`sh(1)` in the example](example-command.md))
|
||||
> is running in the [container environment](#environments).
|
||||
> - The user is now able to interact with the workload
|
||||
> (using the [`ctr` command in the example](example-command.md)).
|
||||
> - The [agent](#agent), running inside the VM is monitoring the
|
||||
> [workload](#workload) process.
|
||||
> - The [runtime](#runtime) is waiting for the agent's `WaitProcess` API
|
||||
> call to complete.
|
||||
|
||||
Further details of these steps are provided in the sections below.
|
||||
|
||||
### Container shutdown
|
||||
|
||||
There are two possible ways for the container environment to be
|
||||
terminated:
|
||||
|
||||
- When the [workload](#workload) exits.
|
||||
|
||||
This is the standard, or _graceful_ shutdown method.
|
||||
|
||||
- When the container manager forces the container to be deleted.
|
||||
|
||||
#### Workload exit
|
||||
|
||||
The [agent](#agent) will detect when the [workload](#workload) process
|
||||
exits, capture its exit status (see `wait(2)`) and return that value
|
||||
to the [runtime](#runtime) by specifying it as the response to the
|
||||
`WaitProcess` agent API call made by the [runtime](#runtime).
|
||||
|
||||
The runtime then passes the value back to the container manager by the
|
||||
`Wait` [shimv2 API](#shim-v2-architecture) call.
|
||||
|
||||
Once the workload has fully exited, the VM is no longer needed and the
|
||||
runtime cleans up the environment (which includes terminating the
|
||||
[hypervisor](#hypervisor) process).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> When [agent tracing is enabled](../../tracing.md#agent-shutdown-behaviour),
|
||||
> the shutdown behaviour is different.
|
||||
|
||||
#### Container manager requested shutdown
|
||||
|
||||
If the container manager requests the container be deleted, the
|
||||
[runtime](#runtime) will signal the agent by sending it a
|
||||
`DestroySandbox` [ttRPC API](../../../src/agent/protocols/protos/agent.proto) request.
|
||||
|
||||
## Guest assets
|
||||
|
||||
The guest assets comprise a guest image and a guest kernel that are
|
||||
used by the [hypervisor](#hypervisor).
|
||||
|
||||
See the [guest assets](guest-assets.md) document for further
|
||||
information.
|
||||
|
||||
## Hypervisor
|
||||
|
||||
The [hypervisor](../../hypervisors.md) specified in the
|
||||
[configuration file](#configuration) creates a VM to host the
|
||||
[agent](#agent) and the [workload](#workload) inside the
|
||||
[container environment](#environments).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> The hypervisor process runs inside an environment slightly different
|
||||
> to the host environment:
|
||||
>
|
||||
> - It is run in a different cgroup environment to the host.
|
||||
> - It is given a separate network namespace from the host.
|
||||
> - If the [OCI configuration specifies a SELinux label](https://github.com/opencontainers/runtime-spec/blob/main/config.md#linux-process),
|
||||
> the hypervisor process will run with that label (*not* the workload running inside the hypervisor's VM).
|
||||
|
||||
## Agent
|
||||
|
||||
The Kata Containers agent ([`kata-agent`](../../../src/agent)), written
|
||||
in the [Rust programming language](https://www.rust-lang.org), is a
|
||||
long running process that runs inside the VM. It acts as the
|
||||
supervisor for managing the containers and the [workload](#workload)
|
||||
running within those containers. Only a single agent process is run
|
||||
for each VM created.
|
||||
|
||||
### Agent communications protocol
|
||||
|
||||
The agent communicates with the other Kata components (primarily the
|
||||
[runtime](#runtime)) using a
|
||||
[`ttRPC`](https://github.com/containerd/ttrpc-rust) based
|
||||
[protocol](../../../src/agent/protocols/protos).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> If you wish to learn more about this protocol, a practical way to do
|
||||
> so is to experiment with the
|
||||
> [agent control tool](#agent-control-tool) on a test system.
|
||||
> This tool is for test and development purposes only and can send
|
||||
> arbitrary ttRPC agent API commands to the [agent](#agent).
|
||||
|
||||
## Runtime
|
||||
|
||||
The Kata Containers runtime (the [`containerd-shim-kata-v2`](../../../src/runtime/cmd/containerd-shim-kata-v2
|
||||
) binary) is a [shimv2](#shim-v2-architecture) compatible runtime.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> The Kata Containers runtime is sometimes referred to as the Kata
|
||||
> _shim_. Both terms are correct since the `containerd-shim-kata-v2`
|
||||
> is a container runtime, and that runtime implements the containerd
|
||||
> shim v2 API.
|
||||
|
||||
The runtime makes heavy use of the [`virtcontainers`
|
||||
package](../../../src/runtime/virtcontainers), which provides a generic,
|
||||
runtime-specification agnostic, hardware-virtualized containers
|
||||
library.
|
||||
|
||||
The runtime is responsible for starting the [hypervisor](#hypervisor)
|
||||
and it's VM, and communicating with the [agent](#agent) using a
|
||||
[ttRPC based protocol](#agent-communications-protocol) over a VSOCK
|
||||
socket that provides a communications link between the VM and the
|
||||
host.
|
||||
|
||||
This protocol allows the runtime to send container management commands
|
||||
to the agent. The protocol is also used to carry the standard I/O
|
||||
streams (`stdout`, `stderr`, `stdin`) between the containers and
|
||||
container managers (such as CRI-O or containerd).
|
||||
|
||||
## Utility program
|
||||
|
||||
The `kata-runtime` binary is a utility program that provides
|
||||
administrative commands to manipulate and query a Kata Containers
|
||||
installation.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> In Kata 1.x, this program also acted as the main
|
||||
> [runtime](#runtime), but this is no longer required due to the
|
||||
> improved shimv2 architecture.
|
||||
|
||||
### exec command
|
||||
|
||||
The `exec` command allows an administrator or developer to enter the
|
||||
[VM root environment](#environments) which is not accessible by the container
|
||||
[workload](#workload).
|
||||
|
||||
See [the developer guide](../../Developer-Guide.md#connect-to-debug-console) for further details.
|
||||
|
||||
### Configuration
|
||||
|
||||
See the [configuration file details](../../../src/runtime/README.md#configuration).
|
||||
|
||||
The configuration file is also used to enable runtime [debug output](../../Developer-Guide.md#enable-full-debug).
|
||||
|
||||
## Process overview
|
||||
|
||||
The table below shows an example of the main processes running in the
|
||||
different [environments](#environments) when a Kata Container is
|
||||
created with containerd using our [example command](example-command.md):
|
||||
|
||||
| Description | Host | VM root environment | VM container environment |
|
||||
|-|-|-|-|
|
||||
| Container manager | `containerd` | |
|
||||
| Kata Containers | [runtime](#runtime), [`virtiofsd`](storage.md#virtio-fs), [hypervisor](#hypervisor) | [agent](#agent) |
|
||||
| User [workload](#workload) | | | [`ubuntu sh`](example-command.md) |
|
||||
|
||||
## Networking
|
||||
|
||||
See the [networking document](networking.md).
|
||||
|
||||
## Storage
|
||||
|
||||
See the [storage document](storage.md).
|
||||
|
||||
## Kubernetes support
|
||||
|
||||
See the [Kubernetes document](kubernetes.md).
|
||||
|
||||
#### OCI annotations
|
||||
|
||||
In order for the Kata Containers [runtime](#runtime) (or any VM based OCI compatible
|
||||
runtime) to be able to understand if it needs to create a full VM or if it
|
||||
has to create a new container inside an existing pod's VM, CRI-O adds
|
||||
specific annotations to the OCI configuration file (`config.json`) which is passed to
|
||||
the OCI compatible runtime.
|
||||
|
||||
Before calling its runtime, CRI-O will always add a `io.kubernetes.cri-o.ContainerType`
|
||||
annotation to the `config.json` configuration file it produces from the Kubelet CRI
|
||||
request. The `io.kubernetes.cri-o.ContainerType` annotation can either be set to `sandbox`
|
||||
or `container`. Kata Containers will then use this annotation to decide if it needs to
|
||||
respectively create a virtual machine or a container inside a virtual machine associated
|
||||
with a Kubernetes pod:
|
||||
|
||||
| Annotation value | Kata VM created? | Kata container created? |
|
||||
|-|-|-|
|
||||
| `sandbox` | yes | yes (inside new VM) |
|
||||
| `container`| no | yes (in existing VM) |
|
||||
|
||||
#### Mixing VM based and namespace based runtimes
|
||||
|
||||
> **Note:** Since Kubernetes 1.12, the [`Kubernetes RuntimeClass`](https://kubernetes.io/docs/concepts/containers/runtime-class/)
|
||||
> has been supported and the user can specify runtime without the non-standardized annotations.
|
||||
|
||||
With `RuntimeClass`, users can define Kata Containers as a
|
||||
`RuntimeClass` and then explicitly specify that a pod must be created
|
||||
as a Kata Containers pod. For details, please refer to [How to use
|
||||
Kata Containers and containerd](../../../docs/how-to/containerd-kata.md).
|
||||
|
||||
## Tracing
|
||||
|
||||
The [tracing document](../../tracing.md) provides details on the tracing
|
||||
architecture.
|
||||
|
||||
# Appendices
|
||||
|
||||
## DAX
|
||||
|
||||
Kata Containers utilizes the Linux kernel DAX
|
||||
[(Direct Access filesystem)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.rst?h=v5.14)
|
||||
feature to efficiently map the [guest image](guest-assets.md#guest-image) in the
|
||||
[host environment](#environments) into the
|
||||
[guest VM environment](#environments) to become the VM's
|
||||
[rootfs](background.md#root-filesystem).
|
||||
|
||||
If the [configured](#configuration) [hypervisor](#hypervisor) is set
|
||||
to either QEMU or Cloud Hypervisor, DAX is used with the feature shown
|
||||
in the table below:
|
||||
|
||||
| Hypervisor | Feature used | rootfs device type |
|
||||
|-|-|-|
|
||||
| Cloud Hypervisor (CH) | `dax` `FsConfig` configuration option | PMEM (emulated Persistent Memory device) |
|
||||
| QEMU | NVDIMM memory device with a memory file backend | NVDIMM (emulated Non-Volatile Dual In-line Memory Module device) |
|
||||
|
||||
The features in the table above are equivalent in that they provide a memory-mapped
|
||||
virtual device which is used to DAX map the VM's
|
||||
[rootfs](background.md#root-filesystem) into the [VM guest](#environments) memory
|
||||
address space.
|
||||
|
||||
The VM is then booted, specifying the `root=` kernel parameter to make
|
||||
the [guest kernel](guest-assets.md#guest-kernel) use the appropriate emulated device
|
||||
as its rootfs.
|
||||
|
||||
### DAX advantages
|
||||
|
||||
Mapping files using [DAX](#dax) provides a number of benefits over
|
||||
more traditional VM file and device mapping mechanisms:
|
||||
|
||||
- Mapping as a direct access device allows the guest to directly
|
||||
access the host memory pages (such as via Execute In Place (XIP)),
|
||||
bypassing the [guest kernel](guest-assets.md#guest-kernel)'s page cache. This
|
||||
zero copy provides both time and space optimizations.
|
||||
|
||||
- Mapping as a direct access device inside the VM allows pages from the
|
||||
host to be demand loaded using page faults, rather than having to make requests
|
||||
via a virtualized device (causing expensive VM exits/hypercalls), thus providing
|
||||
a speed optimization.
|
||||
|
||||
- Utilizing `mmap(2)`'s `MAP_SHARED` shared memory option on the host
|
||||
allows the host to efficiently share pages.
|
||||
|
||||

|
||||
|
||||
For further details of the use of NVDIMM with QEMU, see the [QEMU
|
||||
project documentation](https://www.qemu.org).
|
||||
|
||||
## Agent control tool
|
||||
|
||||
The [agent control tool](../../../src/tools/agent-ctl) is a test and
|
||||
development tool that can be used to learn more about a Kata
|
||||
Containers system.
|
||||
|
||||
## Terminology
|
||||
|
||||
See the [project glossary](../../../Glossary.md).
|
||||
|
||||
[debug-console]: ../../Developer-Guide.md#connect-to-debug-console
|
81
docs/design/architecture/background.md
Normal file
81
docs/design/architecture/background.md
Normal file
@ -0,0 +1,81 @@
|
||||
# Kata Containers architecture background knowledge
|
||||
|
||||
The following sections explain some of the background concepts
|
||||
required to understand the [architecture document](README.md).
|
||||
|
||||
## Root filesystem
|
||||
|
||||
This document uses the term _rootfs_ to refer to a root filesystem
|
||||
which is mounted as the top-level directory ("`/`") and often referred
|
||||
to as _slash_.
|
||||
|
||||
It is important to understand this term since the overall system uses
|
||||
multiple different rootfs's (as explained in the
|
||||
[Environments](README.md#environments) section.
|
||||
|
||||
## Container image
|
||||
|
||||
In the [example command](example-command.md) the user has specified the
|
||||
type of container they wish to run via the container image name:
|
||||
`ubuntu`. This image name corresponds to a _container image_ that can
|
||||
be used to create a container with an Ubuntu Linux environment. Hence,
|
||||
in our [example](example-command.md), the `sh(1)` command will be run
|
||||
inside a container which has an Ubuntu rootfs.
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> The term _container image_ is confusing since the image in question
|
||||
> is **not** a container: it is simply a set of files (_an image_)
|
||||
> that can be used to _create_ a container. The term _container
|
||||
> template_ would be more accurate but the term _container image_ is
|
||||
> commonly used so this document uses the standard term.
|
||||
|
||||
For the purposes of this document, the most important part of the
|
||||
[example command line](example-command.md) is the container image the
|
||||
user has requested. Normally, the container manager will _pull_
|
||||
(download) a container image from a remote site and store a copy
|
||||
locally. This local container image is used by the container manager
|
||||
to create an [OCI bundle](#oci-bundle) which will form the environment
|
||||
the container will run in. After creating the OCI bundle, the
|
||||
container manager launches a [runtime](README.md#runtime) which will create the
|
||||
container using the provided OCI bundle.
|
||||
|
||||
## OCI bundle
|
||||
|
||||
To understand what follows, it is important to know at a high level
|
||||
how an OCI ([Open Containers Initiative](https://opencontainers.org)) compatible container is created.
|
||||
|
||||
An OCI compatible container is created by taking a
|
||||
[container image](#container-image) and converting the embedded rootfs
|
||||
into an
|
||||
[OCI rootfs bundle](https://github.com/opencontainers/runtime-spec/blob/main/bundle.md),
|
||||
or more simply, an _OCI bundle_.
|
||||
|
||||
An OCI bundle is a `tar(1)` archive normally created by a container
|
||||
manager which is passed to an OCI [runtime](README.md#runtime) which converts
|
||||
it into a full container rootfs. The bundle contains two assets:
|
||||
|
||||
- A container image [rootfs](#root-filesystem)
|
||||
|
||||
This is simply a directory of files that will be used to represent
|
||||
the rootfs for the container.
|
||||
|
||||
For the [example command](example-command.md), the directory will
|
||||
contain the files necessary to create a minimal Ubuntu root
|
||||
filesystem.
|
||||
|
||||
- An [OCI configuration file](https://github.com/opencontainers/runtime-spec/blob/main/config.md)
|
||||
|
||||
This is a JSON file called `config.json`.
|
||||
|
||||
The container manager will create this file so that:
|
||||
|
||||
- The `root.path` value is set to the full path of the specified
|
||||
container rootfs.
|
||||
|
||||
In [the example](example-command.md) this value will be `ubuntu`.
|
||||
|
||||
- The `process.args` array specifies the list of commands the user
|
||||
wishes to run. This is known as the [workload](README.md#workload).
|
||||
|
||||
In [the example](example-command.md) the workload is `sh(1)`.
|
30
docs/design/architecture/example-command.md
Normal file
30
docs/design/architecture/example-command.md
Normal file
@ -0,0 +1,30 @@
|
||||
# Example command
|
||||
|
||||
The following containerd command creates a container. It is referred
|
||||
to throughout the architecture document to help explain various points:
|
||||
|
||||
```bash
|
||||
$ sudo ctr run --runtime "io.containerd.kata.v2" --rm -t "quay.io/libpod/ubuntu:latest" foo sh
|
||||
```
|
||||
|
||||
This command requests that containerd:
|
||||
|
||||
- Create a container (`ctr run`).
|
||||
- Use the Kata [shimv2](README.md#shim-v2-architecture) runtime (`--runtime "io.containerd.kata.v2"`).
|
||||
- Delete the container when it [exits](README.md#workload-exit) (`--rm`).
|
||||
- Attach the container to the user's terminal (`-t`).
|
||||
- Use the Ubuntu Linux [container image](background.md#container-image)
|
||||
to create the container [rootfs](background.md#root-filesystem) that will become
|
||||
the [container environment](README.md#environments)
|
||||
(`quay.io/libpod/ubuntu:latest`).
|
||||
- Create the container with the name "`foo`".
|
||||
- Run the `sh(1)` command in the Ubuntu rootfs based container
|
||||
environment.
|
||||
|
||||
The command specified here is referred to as the [workload](README.md#workload).
|
||||
|
||||
> **Note:**
|
||||
>
|
||||
> For the purposes of this document and to keep explanations
|
||||
> simpler, we assume the user is running this command in the
|
||||
> [host environment](README.md#environments).
|
150
docs/design/architecture/guest-assets.md
Normal file
150
docs/design/architecture/guest-assets.md
Normal file
@ -0,0 +1,150 @@
|
||||
# Guest assets
|
||||
|
||||
Kata Containers creates a VM in which to run one or more containers.
|
||||
It does this by launching a [hypervisor](README.md#hypervisor) to
|
||||
create the VM. The hypervisor needs two assets for this task: a Linux
|
||||
kernel and a small root filesystem image to boot the VM.
|
||||
|
||||
## Guest kernel
|
||||
|
||||
The [guest kernel](../../../tools/packaging/kernel)
|
||||
is passed to the hypervisor and used to boot the VM.
|
||||
The default kernel provided in Kata Containers is highly optimized for
|
||||
kernel boot time and minimal memory footprint, providing only those
|
||||
services required by a container workload. It is based on the latest
|
||||
Linux LTS (Long Term Support) [kernel](https://www.kernel.org).
|
||||
|
||||
## Guest image
|
||||
|
||||
The hypervisor uses an image file which provides a minimal root
|
||||
filesystem used by the guest kernel to boot the VM and host the Kata
|
||||
Container. Kata Containers supports both initrd and rootfs based
|
||||
minimal guest images. The [default packages](../../install/) provide both
|
||||
an image and an initrd, both of which are created using the
|
||||
[`osbuilder`](../../../tools/osbuilder) tool.
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - Although initrd and rootfs based images are supported, not all
|
||||
> [hypervisors](README.md#hypervisor) support both types of image.
|
||||
>
|
||||
> - The guest image is *unrelated* to the image used in a container
|
||||
> workload.
|
||||
>
|
||||
> For example, if a user creates a container that runs a shell in a
|
||||
> BusyBox image, they will run that shell in a BusyBox environment.
|
||||
> However, the guest image running inside the VM that is used to
|
||||
> *host* that BusyBox image could be running Clear Linux, Ubuntu,
|
||||
> Fedora or any other distribution potentially.
|
||||
>
|
||||
> The `osbuilder` tool provides
|
||||
> [configurations for various common Linux distributions](../../../tools/osbuilder/rootfs-builder)
|
||||
> which can be built into either initrd or rootfs guest images.
|
||||
>
|
||||
> - If you are using a [packaged version of Kata
|
||||
> Containers](../../install), you can see image details by running the
|
||||
> [`kata-collect-data.sh`](../../../src/runtime/data/kata-collect-data.sh.in)
|
||||
> script as `root` and looking at the "Image details" section of the
|
||||
> output.
|
||||
|
||||
#### Root filesystem image
|
||||
|
||||
The default packaged rootfs image, sometimes referred to as the _mini
|
||||
O/S_, is a highly optimized container bootstrap system.
|
||||
|
||||
If this image type is [configured](README.md#configuration), when the
|
||||
user runs the [example command](example-command.md):
|
||||
|
||||
- The [runtime](README.md#runtime) will launch the configured [hypervisor](README.md#hypervisor).
|
||||
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
||||
- The kernel will start the init daemon as PID 1 (`systemd`) inside the VM root environment.
|
||||
- `systemd`, running inside the mini-OS context, will launch the [agent](README.md#agent)
|
||||
in the root context of the VM.
|
||||
- The agent will create a new container environment, setting its root
|
||||
filesystem to that requested by the user (Ubuntu in [the example](example-command.md)).
|
||||
- The agent will then execute the command (`sh(1)` in [the example](example-command.md))
|
||||
inside the new container.
|
||||
|
||||
The table below summarises the default mini O/S showing the
|
||||
environments that are created, the services running in those
|
||||
environments (for all platforms) and the root filesystem used by
|
||||
each service:
|
||||
|
||||
| Process | Environment | systemd service? | rootfs | User accessible | Notes |
|
||||
|-|-|-|-|-|-|
|
||||
| systemd | VM root | n/a | [VM guest image](#guest-image)| [debug console][debug-console] | The init daemon, running as PID 1 |
|
||||
| [Agent](README.md#agent) | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Runs as a systemd service |
|
||||
| `chronyd` | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Used to synchronise the time with the host |
|
||||
| container workload (`sh(1)` in [the example](example-command.md)) | VM container | no | User specified (Ubuntu in [the example](example-command.md)) | [exec command](README.md#exec-command) | Managed by the agent |
|
||||
|
||||
See also the [process overview](README.md#process-overview).
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The "User accessible" column shows how an administrator can access
|
||||
> the environment.
|
||||
>
|
||||
> - The container workload is running inside a full container
|
||||
> environment which itself is running within a VM environment.
|
||||
>
|
||||
> - See the [configuration files for the `osbuilder` tool](../../../tools/osbuilder/rootfs-builder)
|
||||
> for details of the default distribution for platforms other than
|
||||
> Intel x86_64.
|
||||
|
||||
#### Initrd image
|
||||
|
||||
The initrd image is a compressed `cpio(1)` archive, created from a
|
||||
rootfs which is loaded into memory and used as part of the Linux
|
||||
startup process. During startup, the kernel unpacks it into a special
|
||||
instance of a `tmpfs` mount that becomes the initial root filesystem.
|
||||
|
||||
If this image type is [configured](README.md#configuration), when the user runs
|
||||
the [example command](example-command.md):
|
||||
|
||||
- The [runtime](README.md#runtime) will launch the configured [hypervisor](README.md#hypervisor).
|
||||
- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel).
|
||||
- The kernel will start the init daemon as PID 1 (the
|
||||
[agent](README.md#agent))
|
||||
inside the VM root environment.
|
||||
- The [agent](README.md#agent) will create a new container environment, setting its root
|
||||
filesystem to that requested by the user (`ubuntu` in
|
||||
[the example](example-command.md)).
|
||||
- The agent will then execute the command (`sh(1)` in [the example](example-command.md))
|
||||
inside the new container.
|
||||
|
||||
The table below summarises the default mini O/S showing the environments that are created,
|
||||
the processes running in those environments (for all platforms) and
|
||||
the root filesystem used by each service:
|
||||
|
||||
| Process | Environment | rootfs | User accessible | Notes |
|
||||
|-|-|-|-|-|
|
||||
| [Agent](README.md#agent) | VM root | [VM guest image](#guest-image) | [debug console][debug-console] | Runs as the init daemon (PID 1) |
|
||||
| container workload | VM container | User specified (Ubuntu in this example) | [exec command](README.md#exec-command) | Managed by the agent |
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - The "User accessible" column shows how an administrator can access
|
||||
> the environment.
|
||||
>
|
||||
> - It is possible to use a standard init daemon such as systemd with
|
||||
> an initrd image if this is desirable.
|
||||
|
||||
See also the [process overview](README.md#process-overview).
|
||||
|
||||
#### Image summary
|
||||
|
||||
| Image type | Default distro | Init daemon | Reason | Notes |
|
||||
|-|-|-|-|-|
|
||||
| [image](background.md#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility |
|
||||
| [initrd](#initrd-image) | [Alpine Linux](https://alpinelinux.org) | Kata [agent](README.md#agent) (as no systemd support) | Security hardened and tiny C library |
|
||||
|
||||
See also:
|
||||
|
||||
- The [osbuilder](../../../tools/osbuilder) tool
|
||||
|
||||
This is used to build all default image types.
|
||||
|
||||
- The [versions database](../../../versions.yaml)
|
||||
|
||||
The `default-image-name` and `default-initrd-name` options specify
|
||||
the default distributions for each image type.
|
41
docs/design/architecture/history.md
Normal file
41
docs/design/architecture/history.md
Normal file
@ -0,0 +1,41 @@
|
||||
# History
|
||||
|
||||
## Kata 1.x architecture
|
||||
|
||||
In the old [Kata 1.x architecture](https://github.com/kata-containers/documentation/blob/master/design/architecture.md),
|
||||
the Kata [runtime](README.md#runtime) was an executable called `kata-runtime`.
|
||||
The container manager called this executable multiple times when
|
||||
creating each container. Each time the runtime was called a different
|
||||
OCI command-line verb was provided. This architecture was simple, but
|
||||
not well suited to creating VM based containers due to the issue of
|
||||
handling state between calls. Additionally, the architecture suffered
|
||||
from performance issues related to continually having to spawn new
|
||||
instances of the runtime binary, and
|
||||
[Kata shim](https://github.com/kata-containers/shim) and
|
||||
[Kata proxy](https://github.com/kata-containers/proxy) processes for systems
|
||||
that did not provide VSOCK.
|
||||
|
||||
## Kata 2.x architecture
|
||||
|
||||
See the ["shimv2"](README.md#shim-v2-architecture) section of the
|
||||
architecture document.
|
||||
|
||||
## Architectural comparison
|
||||
|
||||
| Kata version | Kata Runtime process calls | Kata shim processes | Kata proxy processes (if no VSOCK) |
|
||||
|-|-|-|-|
|
||||
| 1.x | multiple per container | 1 per container connection | 1 |
|
||||
| 2.x | 1 per VM (hosting any number of containers) | 0 | 0 |
|
||||
|
||||
> **Notes:**
|
||||
>
|
||||
> - A single VM can host one or more containers.
|
||||
>
|
||||
> - The "Kata shim processes" column refers to the old
|
||||
> [Kata shim](https://github.com/kata-containers/shim) (`kata-shim` binary),
|
||||
> *not* the new shimv2 runtime instance (`containerd-shim-kata-v2` binary).
|
||||
|
||||
The diagram below shows how the original architecture was simplified
|
||||
with the advent of shimv2.
|
||||
|
||||

|
35
docs/design/architecture/kubernetes.md
Normal file
35
docs/design/architecture/kubernetes.md
Normal file
@ -0,0 +1,35 @@
|
||||
# Kubernetes support
|
||||
|
||||
[Kubernetes](https://github.com/kubernetes/kubernetes/), or K8s, is a popular open source
|
||||
container orchestration engine. In Kubernetes, a set of containers sharing resources
|
||||
such as networking, storage, mount, PID, etc. is called a
|
||||
[pod](https://kubernetes.io/docs/user-guide/pods/).
|
||||
|
||||
A node can have multiple pods, but at a minimum, a node within a Kubernetes cluster
|
||||
only needs to run a container runtime and a container agent (called a
|
||||
[Kubelet](https://kubernetes.io/docs/admin/kubelet/)).
|
||||
|
||||
Kata Containers represents a Kubelet pod as a VM.
|
||||
|
||||
A Kubernetes cluster runs a control plane where a scheduler (typically
|
||||
running on a dedicated master node) calls into a compute Kubelet. This
|
||||
Kubelet instance is responsible for managing the lifecycle of pods
|
||||
within the nodes and eventually relies on a container runtime to
|
||||
handle execution. The Kubelet architecture decouples lifecycle
|
||||
management from container execution through a dedicated gRPC based
|
||||
[Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/container-runtime-interface-v1.md).
|
||||
|
||||
In other words, a Kubelet is a CRI client and expects a CRI
|
||||
implementation to handle the server side of the interface.
|
||||
[CRI-O](https://github.com/kubernetes-incubator/cri-o) and
|
||||
[containerd](https://github.com/containerd/containerd/) are CRI
|
||||
implementations that rely on
|
||||
[OCI](https://github.com/opencontainers/runtime-spec) compatible
|
||||
runtimes for managing container instances.
|
||||
|
||||
Kata Containers is an officially supported CRI-O and containerd
|
||||
runtime. Refer to the following guides on how to set up Kata
|
||||
Containers with Kubernetes:
|
||||
|
||||
- [How to use Kata Containers and containerd](../../how-to/containerd-kata.md)
|
||||
- [Run Kata Containers with Kubernetes](../../how-to/run-kata-with-k8s.md)
|
48
docs/design/architecture/networking.md
Normal file
48
docs/design/architecture/networking.md
Normal file
@ -0,0 +1,48 @@
|
||||
# Networking
|
||||
|
||||
See the [networking document](networking.md).
|
||||
|
||||
Containers will typically live in their own, possibly shared, networking namespace.
|
||||
At some point in a container lifecycle, container engines will set up that namespace
|
||||
to add the container to a network which is isolated from the host network, but
|
||||
which is shared between containers
|
||||
|
||||
In order to do so, container engines will usually add one end of a virtual
|
||||
ethernet (`veth`) pair into the container networking namespace. The other end of
|
||||
the `veth` pair is added to the host networking namespace.
|
||||
|
||||
This is a very namespace-centric approach as many hypervisors or VM
|
||||
Managers (VMMs) such as `virt-manager` cannot handle `veth`
|
||||
interfaces. Typically, `TAP` interfaces are created for VM
|
||||
connectivity.
|
||||
|
||||
To overcome incompatibility between typical container engines expectations
|
||||
and virtual machines, Kata Containers networking transparently connects `veth`
|
||||
interfaces with `TAP` ones using Traffic Control:
|
||||
|
||||

|
||||
|
||||
With a TC filter in place, a redirection is created between the container network and the
|
||||
virtual machine. As an example, the CNI may create a device, `eth0`, in the container's network
|
||||
namespace, which is a VETH device. Kata Containers will create a tap device for the VM, `tap0_kata`,
|
||||
and setup a TC redirection filter to mirror traffic from `eth0`'s ingress to `tap0_kata`'s egress,
|
||||
and a second to mirror traffic from `tap0_kata`'s ingress to `eth0`'s egress.
|
||||
|
||||
Kata Containers maintains support for MACVTAP, which was an earlier implementation used in Kata. TC-filter
|
||||
is the default because it allows for simpler configuration, better CNI plugin compatibility, and performance
|
||||
on par with MACVTAP.
|
||||
|
||||
Kata Containers has deprecated support for bridge due to lacking performance relative to TC-filter and MACVTAP.
|
||||
|
||||
Kata Containers supports both
|
||||
[CNM](https://github.com/docker/libnetwork/blob/master/docs/design.md#the-container-network-model)
|
||||
and [CNI](https://github.com/containernetworking/cni) for networking management.
|
||||
|
||||
## Network Hotplug
|
||||
|
||||
Kata Containers has developed a set of network sub-commands and APIs to add, list and
|
||||
remove a guest network endpoint and to manipulate the guest route table.
|
||||
|
||||
The following diagram illustrates the Kata Containers network hotplug workflow.
|
||||
|
||||

|
44
docs/design/architecture/storage.md
Normal file
44
docs/design/architecture/storage.md
Normal file
@ -0,0 +1,44 @@
|
||||
# Storage
|
||||
|
||||
## virtio SCSI
|
||||
|
||||
If a block-based graph driver is [configured](README.md#configuration),
|
||||
`virtio-scsi` is used to _share_ the workload image (such as
|
||||
`busybox:latest`) into the container's environment inside the VM.
|
||||
|
||||
## virtio FS
|
||||
|
||||
If a block-based graph driver is _not_ [configured](README.md#configuration), a
|
||||
[`virtio-fs`](https://virtio-fs.gitlab.io) (`VIRTIO`) overlay
|
||||
filesystem mount point is used to _share_ the workload image instead. The
|
||||
[agent](README.md#agent) uses this mount point as the root filesystem for the
|
||||
container processes.
|
||||
|
||||
For virtio-fs, the [runtime](README.md#runtime) starts one `virtiofsd` daemon
|
||||
(that runs in the host context) for each VM created.
|
||||
|
||||
## Devicemapper
|
||||
|
||||
The
|
||||
[devicemapper `snapshotter`](https://github.com/containerd/containerd/tree/master/snapshots/devmapper)
|
||||
is a special case. The `snapshotter` uses dedicated block devices
|
||||
rather than formatted filesystems, and operates at the block level
|
||||
rather than the file level. This knowledge is used to directly use the
|
||||
underlying block device instead of the overlay file system for the
|
||||
container root file system. The block device maps to the top
|
||||
read-write layer for the overlay. This approach gives much better I/O
|
||||
performance compared to using `virtio-fs` to share the container file
|
||||
system.
|
||||
|
||||
#### Hot plug and unplug
|
||||
|
||||
Kata Containers has the ability to hot plug add and hot plug remove
|
||||
block devices. This makes it possible to use block devices for
|
||||
containers started after the VM has been launched.
|
||||
|
||||
Users can check to see if the container uses the `devicemapper` block
|
||||
device as its rootfs by calling `mount(8)` within the container. If
|
||||
the `devicemapper` block device is used, the root filesystem (`/`)
|
||||
will be mounted from `/dev/vda`. Users can disable direct mounting of
|
||||
the underlying block device through the runtime
|
||||
[configuration](README.md#configuration).
|
@ -41,7 +41,7 @@ Kata Containers with QEMU has complete compatibility with Kubernetes.
|
||||
Depending on the host architecture, Kata Containers supports various machine types,
|
||||
for example `pc` and `q35` on x86 systems, `virt` on ARM systems and `pseries` on IBM Power systems. The default Kata Containers
|
||||
machine type is `pc`. The machine type and its [`Machine accelerators`](#machine-accelerators) can
|
||||
be changed by editing the runtime [`configuration`](./architecture.md/#configuration) file.
|
||||
be changed by editing the runtime [`configuration`](architecture/README.md#configuration) file.
|
||||
|
||||
Devices and features used:
|
||||
- virtio VSOCK or virtio serial
|
||||
|
@ -76,7 +76,7 @@ then a new configuration file can be [created](#configure-kata-containers)
|
||||
and [configured][7].
|
||||
|
||||
[1]: https://docs.snapcraft.io/snaps/intro
|
||||
[2]: ../docs/design/architecture.md#root-filesystem-image
|
||||
[2]: ../docs/design/architecture/README.md#root-filesystem-image
|
||||
[3]: https://docs.snapcraft.io/reference/confinement#classic
|
||||
[4]: https://github.com/kata-containers/runtime#configuration
|
||||
[5]: https://docs.docker.com/engine/reference/commandline/dockerd
|
||||
|
@ -6,14 +6,14 @@ The Kata agent is a long running process that runs inside the Virtual Machine
|
||||
(VM) (also known as the "pod" or "sandbox").
|
||||
|
||||
The agent is packaged inside the Kata Containers
|
||||
[guest image](../../docs/design/architecture.md#guest-image)
|
||||
[guest image](../../docs/design/architecture/README.md#guest-image)
|
||||
which is used to boot the VM. Once the runtime has launched the configured
|
||||
[hypervisor](../../docs/hypervisors.md) to create a new VM, the agent is
|
||||
started. From this point on, the agent is responsible for creating and
|
||||
managing the life cycle of the containers inside the VM.
|
||||
|
||||
For further details, see the
|
||||
[architecture document](../../docs/design/architecture.md).
|
||||
[architecture document](../../docs/design/architecture).
|
||||
|
||||
## Audience
|
||||
|
||||
|
@ -70,7 +70,7 @@ See the
|
||||
|
||||
## Architecture overview
|
||||
|
||||
See the [architecture overview](../../docs/design/architecture.md)
|
||||
See the [architecture overview](../../docs/design/architecture)
|
||||
for details on the Kata Containers design.
|
||||
|
||||
## Configuration
|
||||
|
@ -135,7 +135,7 @@ There are three drawbacks about using CNM instead of CNI:
|
||||
|
||||
# Storage
|
||||
|
||||
See [Kata Containers Architecture](../../../docs/design/architecture.md#storage).
|
||||
See [Kata Containers Architecture](../../../docs/design/architecture/README.md#storage).
|
||||
|
||||
# Devices
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user