From 6f9efb4043f38a384d4f07c5e545a3377c0759ff Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Wed, 15 Dec 2021 17:44:24 +0000 Subject: [PATCH 1/7] docs: Move arch doc to separate directory Move the architecture document into a new `docs/design/architecture/` directory in preparation for splitting it into more manageable pieces. Signed-off-by: James O. D. Hunt --- docs/README.md | 2 +- docs/Upgrading.md | 2 +- docs/design/README.md | 2 +- .../README.md} | 68 +++++++++---------- docs/design/virtualization.md | 2 +- snap/README.md | 2 +- src/agent/README.md | 4 +- src/runtime/README.md | 2 +- src/runtime/virtcontainers/README.md | 2 +- 9 files changed, 43 insertions(+), 43 deletions(-) rename docs/design/{architecture.md => architecture/README.md} (94%) diff --git a/docs/README.md b/docs/README.md index f5fd38eef7..7eb65a210c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -41,7 +41,7 @@ Documents that help to understand and contribute to Kata Containers. ### Design and Implementations -* [Kata Containers Architecture](design/architecture.md): Architectural overview of Kata Containers +* [Kata Containers Architecture](design/architecture): Architectural overview of Kata Containers * [Kata Containers E2E Flow](design/end-to-end-flow.md): The entire end-to-end flow of Kata Containers * [Kata Containers design](./design/README.md): More Kata Containers design documents * [Kata Containers threat model](./threat-model/threat-model.md): Kata Containers threat model diff --git a/docs/Upgrading.md b/docs/Upgrading.md index ef633fe68f..0403b91bea 100644 --- a/docs/Upgrading.md +++ b/docs/Upgrading.md @@ -114,7 +114,7 @@ with containerd. > kernel or image. If you are using custom -[guest assets](design/architecture.md#guest-assets), +[guest assets](design/architecture/README.md#guest-assets), you must upgrade them to work with Kata Containers 2.x since Kata Containers 1.x assets will **not** work. diff --git a/docs/design/README.md b/docs/design/README.md index 1a334453ed..775f4d4c9f 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -2,7 +2,7 @@ Kata Containers design documents: -- [Kata Containers architecture](architecture.md) +- [Kata Containers architecture](architecture) - [API Design of Kata Containers](kata-api-design.md) - [Design requirements for Kata Containers](kata-design-requirements.md) - [VSocks](VSocks.md) diff --git a/docs/design/architecture.md b/docs/design/architecture/README.md similarity index 94% rename from docs/design/architecture.md rename to docs/design/architecture/README.md index 1281a2e367..296276a7b1 100644 --- a/docs/design/architecture.md +++ b/docs/design/architecture/README.md @@ -9,8 +9,8 @@ stronger [workload](#workload) isolation using hardware [virtualization](#virtualization) technology as a second layer of defence. -Kata Containers runs on [multiple architectures](../../src/runtime/README.md#platform-support) -and supports [multiple hypervisors](../hypervisors.md). +Kata Containers runs on [multiple architectures](../../../src/runtime/README.md#platform-support) +and supports [multiple hypervisors](../../hypervisors.md). This document is a summary of the Kata Containers architecture. @@ -19,11 +19,11 @@ This document is a summary of the Kata Containers architecture. For details on how Kata Containers maps container concepts to VM technologies, and how this is realized in the multiple hypervisors and VMMs that Kata supports see the -[virtualization documentation](./virtualization.md). +[virtualization documentation](../virtualization.md). ## Compatibility -The [Kata Containers runtime](../../src/runtime) is compatible with +The [Kata Containers runtime](../../../src/runtime) is compatible with the [OCI](https://github.com/opencontainers) [runtime specification](https://github.com/opencontainers/runtime-spec) and therefore works seamlessly with the @@ -104,7 +104,7 @@ available. The diagram below shows how the original architecture was simplified with the advent of shimv2. -![Kubernetes integration with shimv2](arch-images/shimv2.svg) +![Kubernetes integration with shimv2](../arch-images/shimv2.svg) ## Root filesystem @@ -370,14 +370,14 @@ runtime cleans up the environment (which includes terminating the > **Note:** > -> When [agent tracing is enabled](../tracing.md#agent-shutdown-behaviour), +> When [agent tracing is enabled](../../tracing.md#agent-shutdown-behaviour), > the shutdown behaviour is different. #### Container manager requested shutdown If the container manager requests the container be deleted, the [runtime](#runtime) will signal the agent by sending it a -`DestroySandbox` [ttRPC API](../../src/agent/protocols/protos/agent.proto) request. +`DestroySandbox` [ttRPC API](../../../src/agent/protocols/protos/agent.proto) request. ## Guest assets @@ -388,7 +388,7 @@ small root filesystem image to boot the VM. ### Guest kernel -The [guest kernel](../../tools/packaging/kernel) +The [guest kernel](../../../tools/packaging/kernel) is passed to the hypervisor and used to boot the VM. The default kernel provided in Kata Containers is highly optimized for kernel boot time and minimal memory footprint, providing only those @@ -400,9 +400,9 @@ Linux LTS (Long Term Support) [kernel](https://www.kernel.org). The hypervisor uses an image file which provides a minimal root filesystem used by the guest kernel to boot the VM and host the Kata Container. Kata Containers supports both initrd and rootfs based -minimal guest images. The [default packages](../install/) provide both +minimal guest images. The [default packages](../../install/) provide both an image and an initrd, both of which are created using the -[`osbuilder`](../../tools/osbuilder) tool. +[`osbuilder`](../../../tools/osbuilder) tool. > **Notes:** > @@ -419,12 +419,12 @@ an image and an initrd, both of which are created using the > Fedora or any other distribution potentially. > > The `osbuilder` tool provides -> [configurations for various common Linux distributions](../../tools/osbuilder/rootfs-builder) +> [configurations for various common Linux distributions](../../../tools/osbuilder/rootfs-builder) > which can be built into either initrd or rootfs guest images. > > - If you are using a [packaged version of Kata -> Containers](../install), you can see image details by running the -> [`kata-collect-data.sh`](../../src/runtime/data/kata-collect-data.sh.in) +> Containers](../../install), you can see image details by running the +> [`kata-collect-data.sh`](../../../src/runtime/data/kata-collect-data.sh.in) > script as `root` and looking at the "Image details" section of the > output. @@ -468,7 +468,7 @@ See also the [process overview](#process-overview). > - The container workload is running inside a full container > environment which itself is running within a VM environment. > -> - See the [configuration files for the `osbuilder` tool](../../tools/osbuilder/rootfs-builder) +> - See the [configuration files for the `osbuilder` tool](../../../tools/osbuilder/rootfs-builder) > for details of the default distribution for platforms other than > Intel x86_64. @@ -520,18 +520,18 @@ See also the [process overview](#process-overview). See also: -- The [osbuilder](../../tools/osbuilder) tool +- The [osbuilder](../../../tools/osbuilder) tool This is used to build all default image types. -- The [versions database](../../versions.yaml) +- The [versions database](../../../versions.yaml) The `default-image-name` and `default-initrd-name` options specify the default distributions for each image type. ## Hypervisor -The [hypervisor](../hypervisors.md) specified in the +The [hypervisor](../../hypervisors.md) specified in the [configuration file](#configuration) creates a VM to host the [agent](#agent) and the [workload](#workload) inside the [container environment](#environments). @@ -548,7 +548,7 @@ The [hypervisor](../hypervisors.md) specified in the ## Agent -The Kata Containers agent ([`kata-agent`](../../src/agent)), written +The Kata Containers agent ([`kata-agent`](../../../src/agent)), written in the [Rust programming language](https://www.rust-lang.org), is a long running process that runs inside the VM. It acts as the supervisor for managing the containers and the [workload](#workload) @@ -560,7 +560,7 @@ for each VM created. The agent communicates with the other Kata components (primarily the [runtime](#runtime)) using a [`ttRPC`](https://github.com/containerd/ttrpc-rust) based -[protocol](../../src/agent/protocols/protos). +[protocol](../../../src/agent/protocols/protos). > **Note:** > @@ -572,7 +572,7 @@ The agent communicates with the other Kata components (primarily the ## Runtime -The Kata Containers runtime (the [`containerd-shim-kata-v2`](../../src/runtime/cmd/containerd-shim-kata-v2 +The Kata Containers runtime (the [`containerd-shim-kata-v2`](../../../src/runtime/cmd/containerd-shim-kata-v2 ) binary) is a [shimv2](#shim-v2-architecture) compatible runtime. > **Note:** @@ -583,7 +583,7 @@ The Kata Containers runtime (the [`containerd-shim-kata-v2`](../../src/runtime/c > shim v2 API. The runtime makes heavy use of the [`virtcontainers` -package](../../src/runtime/virtcontainers), which provides a generic, +package](../../../src/runtime/virtcontainers), which provides a generic, runtime-specification agnostic, hardware-virtualized containers library. @@ -616,13 +616,13 @@ The `exec` command allows an administrator or developer to enter the [VM root environment](#environments) which is not accessible by the container [workload](#workload). -See [the developer guide](../Developer-Guide.md#connect-to-debug-console) for further details. +See [the developer guide](../../Developer-Guide.md#connect-to-debug-console) for further details. ### Configuration -See the [configuration file details](../../src/runtime/README.md#configuration). +See the [configuration file details](../../../src/runtime/README.md#configuration). -The configuration file is also used to enable runtime [debug output](../Developer-Guide.md#enable-full-debug). +The configuration file is also used to enable runtime [debug output](../../Developer-Guide.md#enable-full-debug). ## Process overview @@ -656,7 +656,7 @@ To overcome incompatibility between typical container engines expectations and virtual machines, Kata Containers networking transparently connects `veth` interfaces with `TAP` ones using Traffic Control: -![Kata Containers networking](arch-images/network.png) +![Kata Containers networking](../arch-images/network.png) With a TC filter in place, a redirection is created between the container network and the virtual machine. As an example, the CNI may create a device, `eth0`, in the container's network @@ -681,7 +681,7 @@ remove a guest network endpoint and to manipulate the guest route table. The following diagram illustrates the Kata Containers network hotplug workflow. -![Network Hotplug](arch-images/kata-containers-network-hotplug.png) +![Network Hotplug](../arch-images/kata-containers-network-hotplug.png) ## Storage @@ -761,8 +761,8 @@ Kata Containers is an officially supported CRI-O and containerd runtime. Refer to the following guides on how to set up Kata Containers with Kubernetes: -- [How to use Kata Containers and containerd](../how-to/containerd-kata.md) -- [Run Kata Containers with Kubernetes](../how-to/run-kata-with-k8s.md) +- [How to use Kata Containers and containerd](../../how-to/containerd-kata.md) +- [Run Kata Containers with Kubernetes](../../how-to/run-kata-with-k8s.md) #### OCI annotations @@ -792,11 +792,11 @@ with a Kubernetes pod: With `RuntimeClass`, users can define Kata Containers as a `RuntimeClass` and then explicitly specify that a pod must be created as a Kata Containers pod. For details, please refer to [How to use -Kata Containers and containerd](../../docs/how-to/containerd-kata.md). +Kata Containers and containerd](../../../docs/how-to/containerd-kata.md). ## Tracing -The [tracing document](../tracing.md) provides details on the tracing +The [tracing document](../../tracing.md) provides details on the tracing architecture. # Appendices @@ -846,19 +846,19 @@ more traditional VM file and device mapping mechanisms: - Utilizing `mmap(2)`'s `MAP_SHARED` shared memory option on the host allows the host to efficiently share pages. -![DAX](arch-images/DAX.png) +![DAX](../arch-images/DAX.png) For further details of the use of NVDIMM with QEMU, see the [QEMU project documentation](https://www.qemu.org). ## Agent control tool -The [agent control tool](../../src/tools/agent-ctl) is a test and +The [agent control tool](../../../src/tools/agent-ctl) is a test and development tool that can be used to learn more about a Kata Containers system. ## Terminology -See the [project glossary](../../Glossary.md). +See the [project glossary](../../../Glossary.md). -[debug-console]: ../Developer-Guide.md#connect-to-debug-console +[debug-console]: ../../Developer-Guide.md#connect-to-debug-console diff --git a/docs/design/virtualization.md b/docs/design/virtualization.md index eab3d6602a..75ec62bf46 100644 --- a/docs/design/virtualization.md +++ b/docs/design/virtualization.md @@ -41,7 +41,7 @@ Kata Containers with QEMU has complete compatibility with Kubernetes. Depending on the host architecture, Kata Containers supports various machine types, for example `pc` and `q35` on x86 systems, `virt` on ARM systems and `pseries` on IBM Power systems. The default Kata Containers machine type is `pc`. The machine type and its [`Machine accelerators`](#machine-accelerators) can -be changed by editing the runtime [`configuration`](./architecture.md/#configuration) file. +be changed by editing the runtime [`configuration`](architecture/README.md#configuration) file. Devices and features used: - virtio VSOCK or virtio serial diff --git a/snap/README.md b/snap/README.md index 1114315bef..3729542ecd 100644 --- a/snap/README.md +++ b/snap/README.md @@ -76,7 +76,7 @@ then a new configuration file can be [created](#configure-kata-containers) and [configured][7]. [1]: https://docs.snapcraft.io/snaps/intro -[2]: ../docs/design/architecture.md#root-filesystem-image +[2]: ../docs/design/architecture/README.md#root-filesystem-image [3]: https://docs.snapcraft.io/reference/confinement#classic [4]: https://github.com/kata-containers/runtime#configuration [5]: https://docs.docker.com/engine/reference/commandline/dockerd diff --git a/src/agent/README.md b/src/agent/README.md index 98ec59bbdf..24161cebed 100644 --- a/src/agent/README.md +++ b/src/agent/README.md @@ -6,14 +6,14 @@ The Kata agent is a long running process that runs inside the Virtual Machine (VM) (also known as the "pod" or "sandbox"). The agent is packaged inside the Kata Containers -[guest image](../../docs/design/architecture.md#guest-image) +[guest image](../../docs/design/architecture/README.md#guest-image) which is used to boot the VM. Once the runtime has launched the configured [hypervisor](../../docs/hypervisors.md) to create a new VM, the agent is started. From this point on, the agent is responsible for creating and managing the life cycle of the containers inside the VM. For further details, see the -[architecture document](../../docs/design/architecture.md). +[architecture document](../../docs/design/architecture). ## Audience diff --git a/src/runtime/README.md b/src/runtime/README.md index 2a893681f1..c7de965445 100644 --- a/src/runtime/README.md +++ b/src/runtime/README.md @@ -70,7 +70,7 @@ See the ## Architecture overview -See the [architecture overview](../../docs/design/architecture.md) +See the [architecture overview](../../docs/design/architecture) for details on the Kata Containers design. ## Configuration diff --git a/src/runtime/virtcontainers/README.md b/src/runtime/virtcontainers/README.md index 97f01cfb91..bd090b65d6 100644 --- a/src/runtime/virtcontainers/README.md +++ b/src/runtime/virtcontainers/README.md @@ -135,7 +135,7 @@ There are three drawbacks about using CNM instead of CNI: # Storage -See [Kata Containers Architecture](../../../docs/design/architecture.md#storage). +See [Kata Containers Architecture](../../../docs/design/architecture/README.md#storage). # Devices From 283d7d52c8e52261c84ebe3d211a83ef8ecea408 Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Wed, 15 Dec 2021 18:03:55 +0000 Subject: [PATCH 2/7] docs: Split history out of arch doc Move the historical details out of the architecture doc and into a separate file. Partially fixes: #3246. Signed-off-by: James O. D. Hunt --- docs/design/architecture/README.md | 40 +++------------------------- docs/design/architecture/history.md | 41 +++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+), 36 deletions(-) create mode 100644 docs/design/architecture/history.md diff --git a/docs/design/architecture/README.md b/docs/design/architecture/README.md index 296276a7b1..73d1ef459d 100644 --- a/docs/design/architecture/README.md +++ b/docs/design/architecture/README.md @@ -39,22 +39,10 @@ Kata Containers provides a ["shimv2"](#shim-v2-architecture) compatible runtime. The Kata Containers runtime is shim v2 ("shimv2") compatible. This section explains what this means. -### History - -In the old [Kata 1.x architecture](https://github.com/kata-containers/documentation/blob/master/design/architecture.md), -the Kata [runtime](#runtime) was an executable called `kata-runtime`. -The container manager called this executable multiple times when -creating each container. Each time the runtime was called a different -OCI command-line verb was provided. This architecture was simple, but -not well suited to creating VM based containers due to the issue of -handling state between calls. Additionally, the architecture suffered -from performance issues related to continually having to spawn new -instances of the runtime binary, and -[Kata shim](https://github.com/kata-containers/shim) and -[Kata proxy](https://github.com/kata-containers/proxy) processes for systems -that did not provide VSOCK. - -### An improved architecture +> **Note:** +> +> For a comparison with the Kata 1.x architecture, see +> [the architectural history document](history.md). The [containerd runtime shimv2 architecture](https://github.com/containerd/containerd/tree/main/runtime/v2) @@ -86,26 +74,6 @@ launch both Pod and OCI compatible containers with a single alone `kata-proxy` process is required, even if VSOCK is not available. -### Architectural comparison - -| Kata version | Kata Runtime process calls | Kata shim processes | Kata proxy processes (if no VSOCK) | -|-|-|-|-| -| 1.x | multiple per container | 1 per container connection | 1 | -| 2.x | 1 per VM (hosting any number of containers) | 0 | 0 | - -> **Notes:** -> -> - A single VM can host one or more containers. -> -> - The "Kata shim processes" column refers to the old -> [Kata shim](https://github.com/kata-containers/shim) (`kata-shim` binary), -> *not* the new shimv2 runtime instance (`containerd-shim-kata-v2` binary). - -The diagram below shows how the original architecture was simplified -with the advent of shimv2. - -![Kubernetes integration with shimv2](../arch-images/shimv2.svg) - ## Root filesystem This document uses the term _rootfs_ to refer to a root filesystem diff --git a/docs/design/architecture/history.md b/docs/design/architecture/history.md new file mode 100644 index 0000000000..ca23396e5b --- /dev/null +++ b/docs/design/architecture/history.md @@ -0,0 +1,41 @@ +# History + +## Kata 1.x architecture + +In the old [Kata 1.x architecture](https://github.com/kata-containers/documentation/blob/master/design/architecture.md), +the Kata [runtime](README.md#runtime) was an executable called `kata-runtime`. +The container manager called this executable multiple times when +creating each container. Each time the runtime was called a different +OCI command-line verb was provided. This architecture was simple, but +not well suited to creating VM based containers due to the issue of +handling state between calls. Additionally, the architecture suffered +from performance issues related to continually having to spawn new +instances of the runtime binary, and +[Kata shim](https://github.com/kata-containers/shim) and +[Kata proxy](https://github.com/kata-containers/proxy) processes for systems +that did not provide VSOCK. + +## Kata 2.x architecture + +See the ["shimv2"](README.md#shim-v2-architecture) section of the +architecture document. + +## Architectural comparison + +| Kata version | Kata Runtime process calls | Kata shim processes | Kata proxy processes (if no VSOCK) | +|-|-|-|-| +| 1.x | multiple per container | 1 per container connection | 1 | +| 2.x | 1 per VM (hosting any number of containers) | 0 | 0 | + +> **Notes:** +> +> - A single VM can host one or more containers. +> +> - The "Kata shim processes" column refers to the old +> [Kata shim](https://github.com/kata-containers/shim) (`kata-shim` binary), +> *not* the new shimv2 runtime instance (`containerd-shim-kata-v2` binary). + +The diagram below shows how the original architecture was simplified +with the advent of shimv2. + +![Kubernetes integration with shimv2](../arch-images/shimv2.svg) From 7229b7a69dddc1df4b1cdb2dc99bd550df65ea65 Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Thu, 16 Dec 2021 11:07:40 +0000 Subject: [PATCH 3/7] docs: Split background and example out of arch doc Move the background and example command details out of the architecture doc and into separate files. Partially fixes: #3246. Signed-off-by: James O. D. Hunt --- docs/design/architecture/README.md | 164 ++++---------------- docs/design/architecture/background.md | 81 ++++++++++ docs/design/architecture/example-command.md | 30 ++++ 3 files changed, 145 insertions(+), 130 deletions(-) create mode 100644 docs/design/architecture/background.md create mode 100644 docs/design/architecture/example-command.md diff --git a/docs/design/architecture/README.md b/docs/design/architecture/README.md index 73d1ef459d..5baaee56e4 100644 --- a/docs/design/architecture/README.md +++ b/docs/design/architecture/README.md @@ -14,6 +14,18 @@ and supports [multiple hypervisors](../../hypervisors.md). This document is a summary of the Kata Containers architecture. +## Background knowledge + +This document assumes the reader understands a number of concepts +related to containers and file systems. The +[background](background.md) document explains these concepts. + +## Example command + +This document makes use of a particular [example +command](example-command.md) throughout the text to illustrate certain +concepts. + ## Virtualization For details on how Kata Containers maps container concepts to VM @@ -74,126 +86,18 @@ launch both Pod and OCI compatible containers with a single alone `kata-proxy` process is required, even if VSOCK is not available. -## Root filesystem - -This document uses the term _rootfs_ to refer to a root filesystem -which is mounted as the top-level directory ("`/`") and often referred -to as _slash_. - -It is important to understand this term since the overall system uses -multiple different rootfs's (as explained in the -[Environments](#environments) section. - -## Example command - -The following containerd command creates a container. It is referred -to throughout this document to help explain various points: - -```bash -$ sudo ctr run --runtime "io.containerd.kata.v2" --rm -t "quay.io/libpod/ubuntu:latest" foo sh -``` - -This command requests that containerd: - -- Create a container (`ctr run`). -- Use the Kata [shimv2](#shim-v2-architecture) runtime (`--runtime "io.containerd.kata.v2"`). -- Delete the container when it [exits](#workload-exit) (`--rm`). -- Attach the container to the user's terminal (`-t`). -- Use the Ubuntu Linux [container image](#container-image) - to create the container [rootfs](#root-filesystem) that will become - the [container environment](#environments) - (`quay.io/libpod/ubuntu:latest`). -- Create the container with the name "`foo`". -- Run the `sh(1)` command in the Ubuntu rootfs based container - environment. - - The command specified here is referred to as the [workload](#workload). - -> **Note:** -> -> For the purposes of this document and to keep explanations -> simpler, we assume the user is running this command in the -> [host environment](#environments). - -## Container image - -In the [example command](#example-command) the user has specified the -type of container they wish to run via the container image name: -`ubuntu`. This image name corresponds to a _container image_ that can -be used to create a container with an Ubuntu Linux environment. Hence, -in our [example](#example-command), the `sh(1)` command will be run -inside a container which has an Ubuntu rootfs. - -> **Note:** -> -> The term _container image_ is confusing since the image in question -> is **not** a container: it is simply a set of files (_an image_) -> that can be used to _create_ a container. The term _container -> template_ would be more accurate but the term _container image_ is -> commonly used so this document uses the standard term. - -For the purposes of this document, the most important part of the -[example command line](#example-command) is the container image the -user has requested. Normally, the container manager will _pull_ -(download) a container image from a remote site and store a copy -locally. This local container image is used by the container manager -to create an [OCI bundle](#oci-bundle) which will form the environment -the container will run in. After creating the OCI bundle, the -container manager launches a [runtime](#runtime) which will create the -container using the provided OCI bundle. - -## OCI bundle - -To understand what follows, it is important to know at a high level -how an OCI ([Open Containers Initiative](https://opencontainers.org)) compatible container is created. - -An OCI compatible container is created by taking a -[container image](#container-image) and converting the embedded rootfs -into an -[OCI rootfs bundle](https://github.com/opencontainers/runtime-spec/blob/main/bundle.md), -or more simply, an _OCI bundle_. - -An OCI bundle is a `tar(1)` archive normally created by a container -manager which is passed to an OCI [runtime](#runtime) which converts -it into a full container rootfs. The bundle contains two assets: - -- A container image [rootfs](#root-filesystem) - - This is simply a directory of files that will be used to represent - the rootfs for the container. - - For the [example command](#example-command), the directory will - contain the files necessary to create a minimal Ubuntu root - filesystem. - -- An [OCI configuration file](https://github.com/opencontainers/runtime-spec/blob/main/config.md) - - This is a JSON file called `config.json`. - - The container manager will create this file so that: - - - The `root.path` value is set to the full path of the specified - container rootfs. - - In [the example](#example-command) this value will be `ubuntu`. - - - The `process.args` array specifies the list of commands the user - wishes to run. This is known as the [workload](#workload). - - In [the example](#example-command) the workload is `sh(1)`. - ## Workload The workload is the command the user requested to run in the -container and is specified in the [OCI bundle](#oci-bundle)'s +container and is specified in the [OCI bundle](background.md#oci-bundle)'s configuration file. -In our [example](#example-command), the workload is the `sh(1)` command. +In our [example](example-command.md), the workload is the `sh(1)` command. ### Workload root filesystem For details of how the [runtime](#runtime) makes the -[container image](#container-image) chosen by the user available to +[container image](background.md#container-image) chosen by the user available to the workload process, see the [Container creation](#container-creation) and [storage](#storage) sections. @@ -214,7 +118,7 @@ to study this table closely to make sense of what follows: |-|-|-|-|-|-|-|-| | Host | Host | no `[1]` | no | Host specific | Host specific | Host specific | The environment provided by a standard, physical non virtualized system. | | VM root | Guest VM | yes | no | rootfs inside the [guest image](#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. | -| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](#example-command)) | `kataShared` | [virtio FS](#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](#oci-bundle). | +| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](example-command.md)) | `kataShared` | [virtio FS](#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](background.md#oci-bundle). | **Key:** @@ -226,7 +130,7 @@ to study this table closely to make sense of what follows: > **Notes:** > > - The word "root" is used to mean _top level_ here in a similar -> manner to the term [rootfs](#root-filesystem). +> manner to the term [rootfs](background.md#root-filesystem). > > - The term "first level" prefix used above is important since it implies > that it is possible to create multi level systems. However, they do @@ -247,7 +151,7 @@ The steps below show at a high level how a Kata Containers container is created using the containerd container manager: 1. The user requests the creation of a container by running a command - like the [example command](#example-command). + like the [example command](example-command.md). 1. The container manager daemon runs a single instance of the Kata [runtime](#runtime). 1. The Kata runtime loads its [configuration file](#configuration). @@ -257,9 +161,9 @@ created using the containerd container manager: [guest assets](#guest-assets): - The hypervisor [DAX](#dax) shares the [guest image](#guest-image) - into the VM to become the VM [rootfs](#root-filesystem) (mounted on a `/dev/pmem*` device), + into the VM to become the VM [rootfs](background.md#root-filesystem) (mounted on a `/dev/pmem*` device), which is known as the [VM root environment](#environments). - - The hypervisor mounts the [OCI bundle](#oci-bundle), using [virtio FS](#virtio-fs), + - The hypervisor mounts the [OCI bundle](background.md#oci-bundle), using [virtio FS](#virtio-fs), into a container specific directory inside the VM's rootfs. This container specific directory will become the @@ -300,10 +204,10 @@ created using the containerd container manager: > > At this point, the container is running and: > -> - The [workload](#workload) process ([`sh(1)` in the example](#example-command)) +> - The [workload](#workload) process ([`sh(1)` in the example](example-command.md)) > is running in the [container environment](#environments). > - The user is now able to interact with the workload -> (using the [`ctr` command in the example](#example-command)). +> (using the [`ctr` command in the example](example-command.md)). > - The [agent](#agent), running inside the VM is monitoring the > [workload](#workload) process. > - The [runtime](#runtime) is waiting for the agent's `WaitProcess` API @@ -402,7 +306,7 @@ The default packaged rootfs image, sometimes referred to as the _mini O/S_, is a highly optimized container bootstrap system. If this image type is [configured](#configuration), when the user runs -the [example command](#example-command): +the [example command](example-command.md): - The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor). - The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel). @@ -410,8 +314,8 @@ the [example command](#example-command): - `systemd`, running inside the mini-OS context, will launch the [agent](#agent) in the root context of the VM. - The agent will create a new container environment, setting its root - filesystem to that requested by the user (Ubuntu in [the example](#example-command)). -- The agent will then execute the command (`sh(1)` in [the example](#example-command)) + filesystem to that requested by the user (Ubuntu in [the example](example-command.md)). +- The agent will then execute the command (`sh(1)` in [the example](example-command.md)) inside the new container. The table below summarises the default mini O/S showing the @@ -424,7 +328,7 @@ each service: | systemd | VM root | n/a | [VM guest image](#guest-image)| [debug console][debug-console] | The init daemon, running as PID 1 | | [Agent](#agent) | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Runs as a systemd service | | `chronyd` | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Used to synchronise the time with the host | -| container workload (`sh(1)` in [the example](#example-command)) | VM container | no | User specified (Ubuntu in [the example](#example-command)) | [exec command](#exec-command) | Managed by the agent | +| container workload (`sh(1)` in [the example](example-command.md)) | VM container | no | User specified (Ubuntu in [the example](example-command.md)) | [exec command](#exec-command) | Managed by the agent | See also the [process overview](#process-overview). @@ -448,7 +352,7 @@ startup process. During startup, the kernel unpacks it into a special instance of a `tmpfs` mount that becomes the initial root filesystem. If this image type is [configured](#configuration), when the user runs -the [example command](#example-command): +the [example command](example-command.md): - The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor). - The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel). @@ -456,8 +360,8 @@ the [example command](#example-command): inside the VM root environment. - The [agent](#agent) will create a new container environment, setting its root filesystem to that requested by the user (`ubuntu` in - [the example](#example-command)). -- The agent will then execute the command (`sh(1)` in [the example](#example-command)) + [the example](example-command.md)). +- The agent will then execute the command (`sh(1)` in [the example](example-command.md)) inside the new container. The table below summarises the default mini O/S showing the environments that are created, @@ -483,7 +387,7 @@ See also the [process overview](#process-overview). | Image type | Default distro | Init daemon | Reason | Notes | |-|-|-|-|-| -| [image](#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility | +| [image](background.md#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility | | [initrd](#initrd-image) | [Alpine Linux](https://alpinelinux.org) | Kata [agent](#agent) (as no systemd support) | Security hardened and tiny C library | See also: @@ -596,13 +500,13 @@ The configuration file is also used to enable runtime [debug output](../../Devel The table below shows an example of the main processes running in the different [environments](#environments) when a Kata Container is -created with containerd using our [example command](#example-command): +created with containerd using our [example command](example-command.md): | Description | Host | VM root environment | VM container environment | |-|-|-|-| | Container manager | `containerd` | | | Kata Containers | [runtime](#runtime), [`virtiofsd`](#virtio-fs), [hypervisor](#hypervisor) | [agent](#agent) | -| User [workload](#workload) | | | [`ubuntu sh`](#example-command) | +| User [workload](#workload) | | | [`ubuntu sh`](example-command.md) | ## Networking @@ -776,7 +680,7 @@ Kata Containers utilizes the Linux kernel DAX feature to efficiently map the [guest image](#guest-image) in the [host environment](#environments) into the [guest VM environment](#environments) to become the VM's -[rootfs](#root-filesystem). +[rootfs](background.md#root-filesystem). If the [configured](#configuration) [hypervisor](#hypervisor) is set to either QEMU or Cloud Hypervisor, DAX is used with the feature shown @@ -789,7 +693,7 @@ in the table below: The features in the table above are equivalent in that they provide a memory-mapped virtual device which is used to DAX map the VM's -[rootfs](#root-filesystem) into the [VM guest](#environments) memory +[rootfs](background.md#root-filesystem) into the [VM guest](#environments) memory address space. The VM is then booted, specifying the `root=` kernel parameter to make diff --git a/docs/design/architecture/background.md b/docs/design/architecture/background.md new file mode 100644 index 0000000000..b052293b76 --- /dev/null +++ b/docs/design/architecture/background.md @@ -0,0 +1,81 @@ +# Kata Containers architecture background knowledge + +The following sections explain some of the background concepts +required to understand the [architecture document](README.md). + +## Root filesystem + +This document uses the term _rootfs_ to refer to a root filesystem +which is mounted as the top-level directory ("`/`") and often referred +to as _slash_. + +It is important to understand this term since the overall system uses +multiple different rootfs's (as explained in the +[Environments](README.md#environments) section. + +## Container image + +In the [example command](example-command.md) the user has specified the +type of container they wish to run via the container image name: +`ubuntu`. This image name corresponds to a _container image_ that can +be used to create a container with an Ubuntu Linux environment. Hence, +in our [example](example-command.md), the `sh(1)` command will be run +inside a container which has an Ubuntu rootfs. + +> **Note:** +> +> The term _container image_ is confusing since the image in question +> is **not** a container: it is simply a set of files (_an image_) +> that can be used to _create_ a container. The term _container +> template_ would be more accurate but the term _container image_ is +> commonly used so this document uses the standard term. + +For the purposes of this document, the most important part of the +[example command line](example-command.md) is the container image the +user has requested. Normally, the container manager will _pull_ +(download) a container image from a remote site and store a copy +locally. This local container image is used by the container manager +to create an [OCI bundle](#oci-bundle) which will form the environment +the container will run in. After creating the OCI bundle, the +container manager launches a [runtime](README.md#runtime) which will create the +container using the provided OCI bundle. + +## OCI bundle + +To understand what follows, it is important to know at a high level +how an OCI ([Open Containers Initiative](https://opencontainers.org)) compatible container is created. + +An OCI compatible container is created by taking a +[container image](#container-image) and converting the embedded rootfs +into an +[OCI rootfs bundle](https://github.com/opencontainers/runtime-spec/blob/main/bundle.md), +or more simply, an _OCI bundle_. + +An OCI bundle is a `tar(1)` archive normally created by a container +manager which is passed to an OCI [runtime](README.md#runtime) which converts +it into a full container rootfs. The bundle contains two assets: + +- A container image [rootfs](#root-filesystem) + + This is simply a directory of files that will be used to represent + the rootfs for the container. + + For the [example command](example-command.md), the directory will + contain the files necessary to create a minimal Ubuntu root + filesystem. + +- An [OCI configuration file](https://github.com/opencontainers/runtime-spec/blob/main/config.md) + + This is a JSON file called `config.json`. + + The container manager will create this file so that: + + - The `root.path` value is set to the full path of the specified + container rootfs. + + In [the example](example-command.md) this value will be `ubuntu`. + + - The `process.args` array specifies the list of commands the user + wishes to run. This is known as the [workload](README.md#workload). + + In [the example](example-command.md) the workload is `sh(1)`. diff --git a/docs/design/architecture/example-command.md b/docs/design/architecture/example-command.md new file mode 100644 index 0000000000..559e5dfd0c --- /dev/null +++ b/docs/design/architecture/example-command.md @@ -0,0 +1,30 @@ +# Example command + +The following containerd command creates a container. It is referred +to throughout the architecture document to help explain various points: + +```bash +$ sudo ctr run --runtime "io.containerd.kata.v2" --rm -t "quay.io/libpod/ubuntu:latest" foo sh +``` + +This command requests that containerd: + +- Create a container (`ctr run`). +- Use the Kata [shimv2](README.md#shim-v2-architecture) runtime (`--runtime "io.containerd.kata.v2"`). +- Delete the container when it [exits](README.md#workload-exit) (`--rm`). +- Attach the container to the user's terminal (`-t`). +- Use the Ubuntu Linux [container image](background.md#container-image) + to create the container [rootfs](background.md#root-filesystem) that will become + the [container environment](README.md#environments) + (`quay.io/libpod/ubuntu:latest`). +- Create the container with the name "`foo`". +- Run the `sh(1)` command in the Ubuntu rootfs based container + environment. + + The command specified here is referred to as the [workload](README.md#workload). + +> **Note:** +> +> For the purposes of this document and to keep explanations +> simpler, we assume the user is running this command in the +> [host environment](README.md#environments). From 5df0cb642055bc4c9c485974e5dc67d5739792df Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Thu, 16 Dec 2021 12:19:10 +0000 Subject: [PATCH 4/7] docs: Split storage out of arch doc Move the storage details in the architecture doc to a separate file. Partially fixes: #3246. Signed-off-by: James O. D. Hunt --- docs/design/architecture/README.md | 49 +++-------------------------- docs/design/architecture/storage.md | 44 ++++++++++++++++++++++++++ 2 files changed, 48 insertions(+), 45 deletions(-) create mode 100644 docs/design/architecture/storage.md diff --git a/docs/design/architecture/README.md b/docs/design/architecture/README.md index 5baaee56e4..1fc2275b93 100644 --- a/docs/design/architecture/README.md +++ b/docs/design/architecture/README.md @@ -118,7 +118,7 @@ to study this table closely to make sense of what follows: |-|-|-|-|-|-|-|-| | Host | Host | no `[1]` | no | Host specific | Host specific | Host specific | The environment provided by a standard, physical non virtualized system. | | VM root | Guest VM | yes | no | rootfs inside the [guest image](#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. | -| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](example-command.md)) | `kataShared` | [virtio FS](#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](background.md#oci-bundle). | +| VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](example-command.md)) | `kataShared` | [virtio FS](storage.md#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](background.md#oci-bundle). | **Key:** @@ -163,7 +163,7 @@ created using the containerd container manager: - The hypervisor [DAX](#dax) shares the [guest image](#guest-image) into the VM to become the VM [rootfs](background.md#root-filesystem) (mounted on a `/dev/pmem*` device), which is known as the [VM root environment](#environments). - - The hypervisor mounts the [OCI bundle](background.md#oci-bundle), using [virtio FS](#virtio-fs), + - The hypervisor mounts the [OCI bundle](background.md#oci-bundle), using [virtio FS](storage.md#virtio-fs), into a container specific directory inside the VM's rootfs. This container specific directory will become the @@ -505,7 +505,7 @@ created with containerd using our [example command](example-command.md): | Description | Host | VM root environment | VM container environment | |-|-|-|-| | Container manager | `containerd` | | -| Kata Containers | [runtime](#runtime), [`virtiofsd`](#virtio-fs), [hypervisor](#hypervisor) | [agent](#agent) | +| Kata Containers | [runtime](#runtime), [`virtiofsd`](storage.md#virtio-fs), [hypervisor](#hypervisor) | [agent](#agent) | | User [workload](#workload) | | | [`ubuntu sh`](example-command.md) | ## Networking @@ -557,48 +557,7 @@ The following diagram illustrates the Kata Containers network hotplug workflow. ## Storage -### virtio SCSI - -If a block-based graph driver is [configured](#configuration), -`virtio-scsi` is used to _share_ the workload image (such as -`busybox:latest`) into the container's environment inside the VM. - -### virtio FS - -If a block-based graph driver is _not_ [configured](#configuration), a -[`virtio-fs`](https://virtio-fs.gitlab.io) (`VIRTIO`) overlay -filesystem mount point is used to _share_ the workload image instead. The -[agent](#agent) uses this mount point as the root filesystem for the -container processes. - -For virtio-fs, the [runtime](#runtime) starts one `virtiofsd` daemon -(that runs in the host context) for each VM created. - -### Devicemapper - -The -[devicemapper `snapshotter`](https://github.com/containerd/containerd/tree/master/snapshots/devmapper) -is a special case. The `snapshotter` uses dedicated block devices -rather than formatted filesystems, and operates at the block level -rather than the file level. This knowledge is used to directly use the -underlying block device instead of the overlay file system for the -container root file system. The block device maps to the top -read-write layer for the overlay. This approach gives much better I/O -performance compared to using `virtio-fs` to share the container file -system. - -#### Hot plug and unplug - -Kata Containers has the ability to hot plug add and hot plug remove -block devices. This makes it possible to use block devices for -containers started after the VM has been launched. - -Users can check to see if the container uses the `devicemapper` block -device as its rootfs by calling `mount(8)` within the container. If -the `devicemapper` block device is used, the root filesystem (`/`) -will be mounted from `/dev/vda`. Users can disable direct mounting of -the underlying block device through the runtime -[configuration](#configuration). +See the [storage document](storage.md). ## Kubernetes support diff --git a/docs/design/architecture/storage.md b/docs/design/architecture/storage.md new file mode 100644 index 0000000000..974f260c31 --- /dev/null +++ b/docs/design/architecture/storage.md @@ -0,0 +1,44 @@ +# Storage + +## virtio SCSI + +If a block-based graph driver is [configured](README.md#configuration), +`virtio-scsi` is used to _share_ the workload image (such as +`busybox:latest`) into the container's environment inside the VM. + +## virtio FS + +If a block-based graph driver is _not_ [configured](README.md#configuration), a +[`virtio-fs`](https://virtio-fs.gitlab.io) (`VIRTIO`) overlay +filesystem mount point is used to _share_ the workload image instead. The +[agent](README.md#agent) uses this mount point as the root filesystem for the +container processes. + +For virtio-fs, the [runtime](README.md#runtime) starts one `virtiofsd` daemon +(that runs in the host context) for each VM created. + +## Devicemapper + +The +[devicemapper `snapshotter`](https://github.com/containerd/containerd/tree/master/snapshots/devmapper) +is a special case. The `snapshotter` uses dedicated block devices +rather than formatted filesystems, and operates at the block level +rather than the file level. This knowledge is used to directly use the +underlying block device instead of the overlay file system for the +container root file system. The block device maps to the top +read-write layer for the overlay. This approach gives much better I/O +performance compared to using `virtio-fs` to share the container file +system. + +#### Hot plug and unplug + +Kata Containers has the ability to hot plug add and hot plug remove +block devices. This makes it possible to use block devices for +containers started after the VM has been launched. + +Users can check to see if the container uses the `devicemapper` block +device as its rootfs by calling `mount(8)` within the container. If +the `devicemapper` block device is used, the root filesystem (`/`) +will be mounted from `/dev/vda`. Users can disable direct mounting of +the underlying block device through the runtime +[configuration](README.md#configuration). From 7ac619b24efc8137ab0b75de0840deb5adddec38 Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Thu, 16 Dec 2021 12:27:41 +0000 Subject: [PATCH 5/7] docs: Split networking out of arch doc Move the networking details out of the architecture doc and into a separate file. Partially fixes: #3246. Signed-off-by: James O. D. Hunt --- docs/design/architecture/README.md | 45 +----------------------- docs/design/architecture/networking.md | 48 ++++++++++++++++++++++++++ 2 files changed, 49 insertions(+), 44 deletions(-) create mode 100644 docs/design/architecture/networking.md diff --git a/docs/design/architecture/README.md b/docs/design/architecture/README.md index 1fc2275b93..89c877b4ef 100644 --- a/docs/design/architecture/README.md +++ b/docs/design/architecture/README.md @@ -510,50 +510,7 @@ created with containerd using our [example command](example-command.md): ## Networking -Containers will typically live in their own, possibly shared, networking namespace. -At some point in a container lifecycle, container engines will set up that namespace -to add the container to a network which is isolated from the host network, but -which is shared between containers - -In order to do so, container engines will usually add one end of a virtual -ethernet (`veth`) pair into the container networking namespace. The other end of -the `veth` pair is added to the host networking namespace. - -This is a very namespace-centric approach as many hypervisors or VM -Managers (VMMs) such as `virt-manager` cannot handle `veth` -interfaces. Typically, `TAP` interfaces are created for VM -connectivity. - -To overcome incompatibility between typical container engines expectations -and virtual machines, Kata Containers networking transparently connects `veth` -interfaces with `TAP` ones using Traffic Control: - -![Kata Containers networking](../arch-images/network.png) - -With a TC filter in place, a redirection is created between the container network and the -virtual machine. As an example, the CNI may create a device, `eth0`, in the container's network -namespace, which is a VETH device. Kata Containers will create a tap device for the VM, `tap0_kata`, -and setup a TC redirection filter to mirror traffic from `eth0`'s ingress to `tap0_kata`'s egress, -and a second to mirror traffic from `tap0_kata`'s ingress to `eth0`'s egress. - -Kata Containers maintains support for MACVTAP, which was an earlier implementation used in Kata. TC-filter -is the default because it allows for simpler configuration, better CNI plugin compatibility, and performance -on par with MACVTAP. - -Kata Containers has deprecated support for bridge due to lacking performance relative to TC-filter and MACVTAP. - -Kata Containers supports both -[CNM](https://github.com/docker/libnetwork/blob/master/docs/design.md#the-container-network-model) -and [CNI](https://github.com/containernetworking/cni) for networking management. - -### Network Hotplug - -Kata Containers has developed a set of network sub-commands and APIs to add, list and -remove a guest network endpoint and to manipulate the guest route table. - -The following diagram illustrates the Kata Containers network hotplug workflow. - -![Network Hotplug](../arch-images/kata-containers-network-hotplug.png) +See the [networking document](networking.md). ## Storage diff --git a/docs/design/architecture/networking.md b/docs/design/architecture/networking.md new file mode 100644 index 0000000000..80a6b7d27a --- /dev/null +++ b/docs/design/architecture/networking.md @@ -0,0 +1,48 @@ +# Networking + +See the [networking document](networking.md). + +Containers will typically live in their own, possibly shared, networking namespace. +At some point in a container lifecycle, container engines will set up that namespace +to add the container to a network which is isolated from the host network, but +which is shared between containers + +In order to do so, container engines will usually add one end of a virtual +ethernet (`veth`) pair into the container networking namespace. The other end of +the `veth` pair is added to the host networking namespace. + +This is a very namespace-centric approach as many hypervisors or VM +Managers (VMMs) such as `virt-manager` cannot handle `veth` +interfaces. Typically, `TAP` interfaces are created for VM +connectivity. + +To overcome incompatibility between typical container engines expectations +and virtual machines, Kata Containers networking transparently connects `veth` +interfaces with `TAP` ones using Traffic Control: + +![Kata Containers networking](../arch-images/network.png) + +With a TC filter in place, a redirection is created between the container network and the +virtual machine. As an example, the CNI may create a device, `eth0`, in the container's network +namespace, which is a VETH device. Kata Containers will create a tap device for the VM, `tap0_kata`, +and setup a TC redirection filter to mirror traffic from `eth0`'s ingress to `tap0_kata`'s egress, +and a second to mirror traffic from `tap0_kata`'s ingress to `eth0`'s egress. + +Kata Containers maintains support for MACVTAP, which was an earlier implementation used in Kata. TC-filter +is the default because it allows for simpler configuration, better CNI plugin compatibility, and performance +on par with MACVTAP. + +Kata Containers has deprecated support for bridge due to lacking performance relative to TC-filter and MACVTAP. + +Kata Containers supports both +[CNM](https://github.com/docker/libnetwork/blob/master/docs/design.md#the-container-network-model) +and [CNI](https://github.com/containernetworking/cni) for networking management. + +## Network Hotplug + +Kata Containers has developed a set of network sub-commands and APIs to add, list and +remove a guest network endpoint and to manipulate the guest route table. + +The following diagram illustrates the Kata Containers network hotplug workflow. + +![Network Hotplug](../arch-images/kata-containers-network-hotplug.png) From db411c23e83cf2fcf2a39d6169bfbe7c9c7178ee Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Thu, 16 Dec 2021 12:53:16 +0000 Subject: [PATCH 6/7] docs: Split k8s info out of arch doc Move the Kubernetes information out of the architecture doc and into a separate file. Partially fixes: #3246. Signed-off-by: James O. D. Hunt --- docs/design/architecture/README.md | 38 ++------------------------ docs/design/architecture/kubernetes.md | 35 ++++++++++++++++++++++++ 2 files changed, 38 insertions(+), 35 deletions(-) create mode 100644 docs/design/architecture/kubernetes.md diff --git a/docs/design/architecture/README.md b/docs/design/architecture/README.md index 89c877b4ef..c92750847a 100644 --- a/docs/design/architecture/README.md +++ b/docs/design/architecture/README.md @@ -80,7 +80,7 @@ The shimv2 architecture allows running several containers per VM to support container engines that require multiple containers running inside a pod. -With the new architecture [Kubernetes](#kubernetes-support) can +With the new architecture [Kubernetes](kubernetes.md) can launch both Pod and OCI compatible containers with a single [runtime](#runtime) shim per Pod, rather than `2N+1` shims. No stand alone `kata-proxy` process is required, even if VSOCK is not @@ -141,7 +141,7 @@ The reasons for containerizing the [workload](#workload) inside the VM are: - Isolates the workload entirely from the VM environment. -- Provides better isolation between containers in a [pod](#kubernetes-support). +- Provides better isolation between containers in a [pod](kubernetes.md). - Allows the workload to be managed and monitored through its cgroup confinement. @@ -518,39 +518,7 @@ See the [storage document](storage.md). ## Kubernetes support -[Kubernetes](https://github.com/kubernetes/kubernetes/), or K8s, is a popular open source -container orchestration engine. In Kubernetes, a set of containers sharing resources -such as networking, storage, mount, PID, etc. is called a -[pod](https://kubernetes.io/docs/user-guide/pods/). - -A node can have multiple pods, but at a minimum, a node within a Kubernetes cluster -only needs to run a container runtime and a container agent (called a -[Kubelet](https://kubernetes.io/docs/admin/kubelet/)). - -Kata Containers represents a Kubelet pod as a VM. - -A Kubernetes cluster runs a control plane where a scheduler (typically -running on a dedicated master node) calls into a compute Kubelet. This -Kubelet instance is responsible for managing the lifecycle of pods -within the nodes and eventually relies on a container runtime to -handle execution. The Kubelet architecture decouples lifecycle -management from container execution through a dedicated gRPC based -[Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/container-runtime-interface-v1.md). - -In other words, a Kubelet is a CRI client and expects a CRI -implementation to handle the server side of the interface. -[CRI-O](https://github.com/kubernetes-incubator/cri-o) and -[containerd](https://github.com/containerd/containerd/) are CRI -implementations that rely on -[OCI](https://github.com/opencontainers/runtime-spec) compatible -runtimes for managing container instances. - -Kata Containers is an officially supported CRI-O and containerd -runtime. Refer to the following guides on how to set up Kata -Containers with Kubernetes: - -- [How to use Kata Containers and containerd](../../how-to/containerd-kata.md) -- [Run Kata Containers with Kubernetes](../../how-to/run-kata-with-k8s.md) +See the [Kubernetes document](kubernetes.md). #### OCI annotations diff --git a/docs/design/architecture/kubernetes.md b/docs/design/architecture/kubernetes.md new file mode 100644 index 0000000000..be7377b39f --- /dev/null +++ b/docs/design/architecture/kubernetes.md @@ -0,0 +1,35 @@ +# Kubernetes support + +[Kubernetes](https://github.com/kubernetes/kubernetes/), or K8s, is a popular open source +container orchestration engine. In Kubernetes, a set of containers sharing resources +such as networking, storage, mount, PID, etc. is called a +[pod](https://kubernetes.io/docs/user-guide/pods/). + +A node can have multiple pods, but at a minimum, a node within a Kubernetes cluster +only needs to run a container runtime and a container agent (called a +[Kubelet](https://kubernetes.io/docs/admin/kubelet/)). + +Kata Containers represents a Kubelet pod as a VM. + +A Kubernetes cluster runs a control plane where a scheduler (typically +running on a dedicated master node) calls into a compute Kubelet. This +Kubelet instance is responsible for managing the lifecycle of pods +within the nodes and eventually relies on a container runtime to +handle execution. The Kubelet architecture decouples lifecycle +management from container execution through a dedicated gRPC based +[Container Runtime Interface (CRI)](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/container-runtime-interface-v1.md). + +In other words, a Kubelet is a CRI client and expects a CRI +implementation to handle the server side of the interface. +[CRI-O](https://github.com/kubernetes-incubator/cri-o) and +[containerd](https://github.com/containerd/containerd/) are CRI +implementations that rely on +[OCI](https://github.com/opencontainers/runtime-spec) compatible +runtimes for managing container instances. + +Kata Containers is an officially supported CRI-O and containerd +runtime. Refer to the following guides on how to set up Kata +Containers with Kubernetes: + +- [How to use Kata Containers and containerd](../../how-to/containerd-kata.md) +- [Run Kata Containers with Kubernetes](../../how-to/run-kata-with-k8s.md) From 233015a6d9e196929da11dd4692b1848c3a57671 Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Thu, 16 Dec 2021 14:09:38 +0000 Subject: [PATCH 7/7] docs: Split guest assets details out of arch doc Move the guest assets details out of the architecture doc and into a separate file. Fixes: #3246. Signed-off-by: James O. D. Hunt --- docs/design/architecture/README.md | 175 +++-------------------- docs/design/architecture/guest-assets.md | 150 +++++++++++++++++++ 2 files changed, 167 insertions(+), 158 deletions(-) create mode 100644 docs/design/architecture/guest-assets.md diff --git a/docs/design/architecture/README.md b/docs/design/architecture/README.md index c92750847a..9b5ccbe548 100644 --- a/docs/design/architecture/README.md +++ b/docs/design/architecture/README.md @@ -117,7 +117,7 @@ to study this table closely to make sense of what follows: | Type | Name | Virtualized | Containerized | rootfs | Rootfs device type | Mount type | Description | |-|-|-|-|-|-|-|-| | Host | Host | no `[1]` | no | Host specific | Host specific | Host specific | The environment provided by a standard, physical non virtualized system. | -| VM root | Guest VM | yes | no | rootfs inside the [guest image](#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. | +| VM root | Guest VM | yes | no | rootfs inside the [guest image](guest-assets.md#guest-image) | Hypervisor specific `[2]` | `ext4` | The first (or top) level VM environment created on a host system. | | VM container root | Container | yes | yes | rootfs type requested by user ([`ubuntu` in the example](example-command.md)) | `kataShared` | [virtio FS](storage.md#virtio-fs) | The first (or top) level container environment created inside the VM. Based on the [OCI bundle](background.md#oci-bundle). | **Key:** @@ -158,9 +158,10 @@ created using the containerd container manager: 1. The container manager calls a set of shimv2 API functions on the runtime. 1. The Kata runtime launches the configured [hypervisor](#hypervisor). 1. The hypervisor creates and starts (_boots_) a VM using the - [guest assets](#guest-assets): + [guest assets](guest-assets.md#guest-assets): - - The hypervisor [DAX](#dax) shares the [guest image](#guest-image) + - The hypervisor [DAX](#dax) shares the + [guest image](guest-assets.md#guest-image) into the VM to become the VM [rootfs](background.md#root-filesystem) (mounted on a `/dev/pmem*` device), which is known as the [VM root environment](#environments). - The hypervisor mounts the [OCI bundle](background.md#oci-bundle), using [virtio FS](storage.md#virtio-fs), @@ -189,13 +190,13 @@ created using the containerd container manager: > a container environment created by the > [`runc`](https://github.com/opencontainers/runc) OCI runtime; > Linux cgroups and namespaces are created inside the VM by the - > [guest kernel](#guest-kernel) to isolate the workload from the - > VM environment the container is created in. See the - > [Environments](#environments) section for an explanation of why - > this is done. + > [guest kernel](guest-assets.md#guest-kernel) to isolate the + > workload from the VM environment the container is created in. + > See the [Environments](#environments) section for an + > explanation of why this is done. > - > - See the [guest image](#guest-image) section for details of - > exactly how the agent is started. + > - See the [guest image](guest-assets.md#guest-image) section for + > details of exactly how the agent is started. 1. The container manager returns control of the container to the user running the `ctr` command. @@ -253,153 +254,11 @@ If the container manager requests the container be deleted, the ## Guest assets -Kata Containers creates a VM in which to run one or more containers. It -does this by launching a [hypervisor](#hypervisor) to create the VM. -The hypervisor needs two assets for this task: a Linux kernel and a -small root filesystem image to boot the VM. +The guest assets comprise a guest image and a guest kernel that are +used by the [hypervisor](#hypervisor). -### Guest kernel - -The [guest kernel](../../../tools/packaging/kernel) -is passed to the hypervisor and used to boot the VM. -The default kernel provided in Kata Containers is highly optimized for -kernel boot time and minimal memory footprint, providing only those -services required by a container workload. It is based on the latest -Linux LTS (Long Term Support) [kernel](https://www.kernel.org). - -### Guest image - -The hypervisor uses an image file which provides a minimal root -filesystem used by the guest kernel to boot the VM and host the Kata -Container. Kata Containers supports both initrd and rootfs based -minimal guest images. The [default packages](../../install/) provide both -an image and an initrd, both of which are created using the -[`osbuilder`](../../../tools/osbuilder) tool. - -> **Notes:** -> -> - Although initrd and rootfs based images are supported, not all -> [hypervisors](#hypervisor) support both types of image. -> -> - The guest image is *unrelated* to the image used in a container -> workload. -> -> For example, if a user creates a container that runs a shell in a -> BusyBox image, they will run that shell in a BusyBox environment. -> However, the guest image running inside the VM that is used to -> *host* that BusyBox image could be running Clear Linux, Ubuntu, -> Fedora or any other distribution potentially. -> -> The `osbuilder` tool provides -> [configurations for various common Linux distributions](../../../tools/osbuilder/rootfs-builder) -> which can be built into either initrd or rootfs guest images. -> -> - If you are using a [packaged version of Kata -> Containers](../../install), you can see image details by running the -> [`kata-collect-data.sh`](../../../src/runtime/data/kata-collect-data.sh.in) -> script as `root` and looking at the "Image details" section of the -> output. - -#### Root filesystem image - -The default packaged rootfs image, sometimes referred to as the _mini -O/S_, is a highly optimized container bootstrap system. - -If this image type is [configured](#configuration), when the user runs -the [example command](example-command.md): - -- The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor). -- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel). -- The kernel will start the init daemon as PID 1 (`systemd`) inside the VM root environment. -- `systemd`, running inside the mini-OS context, will launch the [agent](#agent) - in the root context of the VM. -- The agent will create a new container environment, setting its root - filesystem to that requested by the user (Ubuntu in [the example](example-command.md)). -- The agent will then execute the command (`sh(1)` in [the example](example-command.md)) - inside the new container. - -The table below summarises the default mini O/S showing the -environments that are created, the services running in those -environments (for all platforms) and the root filesystem used by -each service: - -| Process | Environment | systemd service? | rootfs | User accessible | Notes | -|-|-|-|-|-|-| -| systemd | VM root | n/a | [VM guest image](#guest-image)| [debug console][debug-console] | The init daemon, running as PID 1 | -| [Agent](#agent) | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Runs as a systemd service | -| `chronyd` | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Used to synchronise the time with the host | -| container workload (`sh(1)` in [the example](example-command.md)) | VM container | no | User specified (Ubuntu in [the example](example-command.md)) | [exec command](#exec-command) | Managed by the agent | - -See also the [process overview](#process-overview). - -> **Notes:** -> -> - The "User accessible" column shows how an administrator can access -> the environment. -> -> - The container workload is running inside a full container -> environment which itself is running within a VM environment. -> -> - See the [configuration files for the `osbuilder` tool](../../../tools/osbuilder/rootfs-builder) -> for details of the default distribution for platforms other than -> Intel x86_64. - -#### Initrd image - -The initrd image is a compressed `cpio(1)` archive, created from a -rootfs which is loaded into memory and used as part of the Linux -startup process. During startup, the kernel unpacks it into a special -instance of a `tmpfs` mount that becomes the initial root filesystem. - -If this image type is [configured](#configuration), when the user runs -the [example command](example-command.md): - -- The [runtime](#runtime) will launch the configured [hypervisor](#hypervisor). -- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel). -- The kernel will start the init daemon as PID 1 (the [agent](#agent)) - inside the VM root environment. -- The [agent](#agent) will create a new container environment, setting its root - filesystem to that requested by the user (`ubuntu` in - [the example](example-command.md)). -- The agent will then execute the command (`sh(1)` in [the example](example-command.md)) - inside the new container. - -The table below summarises the default mini O/S showing the environments that are created, -the processes running in those environments (for all platforms) and -the root filesystem used by each service: - -| Process | Environment | rootfs | User accessible | Notes | -|-|-|-|-|-| -| [Agent](#agent) | VM root | [VM guest image](#guest-image) | [debug console][debug-console] | Runs as the init daemon (PID 1) | -| container workload | VM container | User specified (Ubuntu in this example) | [exec command](#exec-command) | Managed by the agent | - -> **Notes:** -> -> - The "User accessible" column shows how an administrator can access -> the environment. -> -> - It is possible to use a standard init daemon such as systemd with -> an initrd image if this is desirable. - -See also the [process overview](#process-overview). - -#### Image summary - -| Image type | Default distro | Init daemon | Reason | Notes | -|-|-|-|-|-| -| [image](background.md#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility | -| [initrd](#initrd-image) | [Alpine Linux](https://alpinelinux.org) | Kata [agent](#agent) (as no systemd support) | Security hardened and tiny C library | - -See also: - -- The [osbuilder](../../../tools/osbuilder) tool - - This is used to build all default image types. - -- The [versions database](../../../versions.yaml) - - The `default-image-name` and `default-initrd-name` options specify - the default distributions for each image type. +See the [guest assets](guest-assets.md) document for further +information. ## Hypervisor @@ -561,7 +420,7 @@ architecture. Kata Containers utilizes the Linux kernel DAX [(Direct Access filesystem)](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/dax.rst?h=v5.14) -feature to efficiently map the [guest image](#guest-image) in the +feature to efficiently map the [guest image](guest-assets.md#guest-image) in the [host environment](#environments) into the [guest VM environment](#environments) to become the VM's [rootfs](background.md#root-filesystem). @@ -581,7 +440,7 @@ virtual device which is used to DAX map the VM's address space. The VM is then booted, specifying the `root=` kernel parameter to make -the [guest kernel](#guest-kernel) use the appropriate emulated device +the [guest kernel](guest-assets.md#guest-kernel) use the appropriate emulated device as its rootfs. ### DAX advantages @@ -591,7 +450,7 @@ more traditional VM file and device mapping mechanisms: - Mapping as a direct access device allows the guest to directly access the host memory pages (such as via Execute In Place (XIP)), - bypassing the [guest kernel](#guest-kernel)'s page cache. This + bypassing the [guest kernel](guest-assets.md#guest-kernel)'s page cache. This zero copy provides both time and space optimizations. - Mapping as a direct access device inside the VM allows pages from the diff --git a/docs/design/architecture/guest-assets.md b/docs/design/architecture/guest-assets.md new file mode 100644 index 0000000000..9c4995268a --- /dev/null +++ b/docs/design/architecture/guest-assets.md @@ -0,0 +1,150 @@ +# Guest assets + +Kata Containers creates a VM in which to run one or more containers. +It does this by launching a [hypervisor](README.md#hypervisor) to +create the VM. The hypervisor needs two assets for this task: a Linux +kernel and a small root filesystem image to boot the VM. + +## Guest kernel + +The [guest kernel](../../../tools/packaging/kernel) +is passed to the hypervisor and used to boot the VM. +The default kernel provided in Kata Containers is highly optimized for +kernel boot time and minimal memory footprint, providing only those +services required by a container workload. It is based on the latest +Linux LTS (Long Term Support) [kernel](https://www.kernel.org). + +## Guest image + +The hypervisor uses an image file which provides a minimal root +filesystem used by the guest kernel to boot the VM and host the Kata +Container. Kata Containers supports both initrd and rootfs based +minimal guest images. The [default packages](../../install/) provide both +an image and an initrd, both of which are created using the +[`osbuilder`](../../../tools/osbuilder) tool. + +> **Notes:** +> +> - Although initrd and rootfs based images are supported, not all +> [hypervisors](README.md#hypervisor) support both types of image. +> +> - The guest image is *unrelated* to the image used in a container +> workload. +> +> For example, if a user creates a container that runs a shell in a +> BusyBox image, they will run that shell in a BusyBox environment. +> However, the guest image running inside the VM that is used to +> *host* that BusyBox image could be running Clear Linux, Ubuntu, +> Fedora or any other distribution potentially. +> +> The `osbuilder` tool provides +> [configurations for various common Linux distributions](../../../tools/osbuilder/rootfs-builder) +> which can be built into either initrd or rootfs guest images. +> +> - If you are using a [packaged version of Kata +> Containers](../../install), you can see image details by running the +> [`kata-collect-data.sh`](../../../src/runtime/data/kata-collect-data.sh.in) +> script as `root` and looking at the "Image details" section of the +> output. + +#### Root filesystem image + +The default packaged rootfs image, sometimes referred to as the _mini +O/S_, is a highly optimized container bootstrap system. + +If this image type is [configured](README.md#configuration), when the +user runs the [example command](example-command.md): + +- The [runtime](README.md#runtime) will launch the configured [hypervisor](README.md#hypervisor). +- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel). +- The kernel will start the init daemon as PID 1 (`systemd`) inside the VM root environment. +- `systemd`, running inside the mini-OS context, will launch the [agent](README.md#agent) + in the root context of the VM. +- The agent will create a new container environment, setting its root + filesystem to that requested by the user (Ubuntu in [the example](example-command.md)). +- The agent will then execute the command (`sh(1)` in [the example](example-command.md)) + inside the new container. + +The table below summarises the default mini O/S showing the +environments that are created, the services running in those +environments (for all platforms) and the root filesystem used by +each service: + +| Process | Environment | systemd service? | rootfs | User accessible | Notes | +|-|-|-|-|-|-| +| systemd | VM root | n/a | [VM guest image](#guest-image)| [debug console][debug-console] | The init daemon, running as PID 1 | +| [Agent](README.md#agent) | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Runs as a systemd service | +| `chronyd` | VM root | yes | [VM guest image](#guest-image)| [debug console][debug-console] | Used to synchronise the time with the host | +| container workload (`sh(1)` in [the example](example-command.md)) | VM container | no | User specified (Ubuntu in [the example](example-command.md)) | [exec command](README.md#exec-command) | Managed by the agent | + +See also the [process overview](README.md#process-overview). + +> **Notes:** +> +> - The "User accessible" column shows how an administrator can access +> the environment. +> +> - The container workload is running inside a full container +> environment which itself is running within a VM environment. +> +> - See the [configuration files for the `osbuilder` tool](../../../tools/osbuilder/rootfs-builder) +> for details of the default distribution for platforms other than +> Intel x86_64. + +#### Initrd image + +The initrd image is a compressed `cpio(1)` archive, created from a +rootfs which is loaded into memory and used as part of the Linux +startup process. During startup, the kernel unpacks it into a special +instance of a `tmpfs` mount that becomes the initial root filesystem. + +If this image type is [configured](README.md#configuration), when the user runs +the [example command](example-command.md): + +- The [runtime](README.md#runtime) will launch the configured [hypervisor](README.md#hypervisor). +- The hypervisor will boot the mini-OS image using the [guest kernel](#guest-kernel). +- The kernel will start the init daemon as PID 1 (the + [agent](README.md#agent)) + inside the VM root environment. +- The [agent](README.md#agent) will create a new container environment, setting its root + filesystem to that requested by the user (`ubuntu` in + [the example](example-command.md)). +- The agent will then execute the command (`sh(1)` in [the example](example-command.md)) + inside the new container. + +The table below summarises the default mini O/S showing the environments that are created, +the processes running in those environments (for all platforms) and +the root filesystem used by each service: + +| Process | Environment | rootfs | User accessible | Notes | +|-|-|-|-|-| +| [Agent](README.md#agent) | VM root | [VM guest image](#guest-image) | [debug console][debug-console] | Runs as the init daemon (PID 1) | +| container workload | VM container | User specified (Ubuntu in this example) | [exec command](README.md#exec-command) | Managed by the agent | + +> **Notes:** +> +> - The "User accessible" column shows how an administrator can access +> the environment. +> +> - It is possible to use a standard init daemon such as systemd with +> an initrd image if this is desirable. + +See also the [process overview](README.md#process-overview). + +#### Image summary + +| Image type | Default distro | Init daemon | Reason | Notes | +|-|-|-|-|-| +| [image](background.md#root-filesystem-image) | [Clear Linux](https://clearlinux.org) (for x86_64 systems)| systemd | Minimal and highly optimized | systemd offers flexibility | +| [initrd](#initrd-image) | [Alpine Linux](https://alpinelinux.org) | Kata [agent](README.md#agent) (as no systemd support) | Security hardened and tiny C library | + +See also: + +- The [osbuilder](../../../tools/osbuilder) tool + + This is used to build all default image types. + +- The [versions database](../../../versions.yaml) + + The `default-image-name` and `default-initrd-name` options specify + the default distributions for each image type.