diff --git a/Limitations.md b/Limitations.md index 3af5507363..d8b5cbf1a7 100644 --- a/Limitations.md +++ b/Limitations.md @@ -138,7 +138,7 @@ these commands is potentially challenging. See issue https://github.com/clearcontainers/runtime/issues/341 and [the constraints challenge](#the-constraints-challenge) for more information. For CPUs resource management see -[CPU constraints](design/cpu-constraints.md). +[CPU constraints](design/vcpu-handling.md). ### docker run and shared memory diff --git a/design/README.md b/design/README.md index 6ca5796de7..da0faf7035 100644 --- a/design/README.md +++ b/design/README.md @@ -6,3 +6,5 @@ Kata Containers design documents: - [API Design of Kata Containers](kata-api-design.md) - [Design requirements for Kata Containers](kata-design-requirements.md) - [VSocks](VSocks.md) +- [VCPU handling](vcpu-handling.md) +- [Host cgroups](host-cgroups.md) diff --git a/design/VSocks.md b/design/VSocks.md index 09831cf78f..884561ceaa 100644 --- a/design/VSocks.md +++ b/design/VSocks.md @@ -130,5 +130,5 @@ the containers are removed automatically. [2]: https://github.com/kata-containers/proxy [3]: https://github.com/hashicorp/yamux [4]: https://wiki.qemu.org/Features/VirtioVsock -[5]: ./cpu-constraints.md#virtual-cpus-and-kubernetes-pods +[5]: ./vcpu-handling.md#virtual-cpus-and-kubernetes-pods [6]: https://github.com/kata-containers/shim diff --git a/design/host-cgroups.md b/design/host-cgroups.md new file mode 100644 index 0000000000..11b4849c51 --- /dev/null +++ b/design/host-cgroups.md @@ -0,0 +1,208 @@ +- [Host cgroup management](#host-cgroup-management) + - [Introduction](#introduction) + - [`SandboxCgroupOnly` enabled](#sandboxcgrouponly-enabled) + - [What does Kata do in this configuration?](#what-does-kata-do-in-this-configuration) + - [Why create a Kata-cgroup under the parent cgroup?](#why-create-a-kata-cgroup-under-the-parent-cgroup) + - [Improvements](#improvements) + - [`SandboxCgroupOnly` disabled (default, legacy)](#sandboxcgrouponly-disabled-default-legacy) + - [What does this method do?](#what-does-this-method-do) + - [Impact](#impact) + - [Summary](#summary) + +# Host cgroup management + +## Introduction + +In Kata Containers, workloads run in a virtual machine that is managed by a virtual +machine monitor (VMM) running on the host. As a result, Kata Containers run over two layers of cgroups. The +first layer is in the guest where the workload is placed, while the second layer is on the host where the +VMM and associated threads are running. + +The OCI [runtime specification][linux-config] provides guidance on where the container cgroups should be placed: + + > [`cgroupsPath`][cgroupspath]: (string, OPTIONAL) path to the cgroups. It can be used to either control the cgroups + > hierarchy for containers or to run a new process in an existing container + +cgroups are hierarchical, and this can be seen with the following pod example: + +- Pod 1: `cgroupsPath=/kubepods/pod1` + - Container 1: +`cgroupsPath=/kubepods/pod1/container1` + - Container 2: +`cgroupsPath=/kubepods/pod1/container2` + +- Pod 2: `cgroupsPath=/kubepods/pod2` + - Container 1: +`cgroupsPath=/kubepods/pod2/container2` + - Container 2: +`cgroupsPath=/kubepods/pod2/container2` + +Depending on the upper-level orchestrator, the cgroup under which the pod is placed is +managed by the orchestrator. In the case of Kubernetes, the pod-cgroup is created by Kubelet, +while the container cgroups are to be handled by the runtime. Kubelet will size the pod-cgroup +based on the container resource requirements. + +Kata Containers introduces a non-negligible overhead for running a sandbox (pod). Based on this, two scenarios are possible: + 1) The upper-layer orchestrator takes the overhead of running a sandbox into account when sizing the pod-cgroup, or + 2) Kata Containers do not fully constrain the VMM and associated processes, instead placing a subset of them outside of the pod-cgroup. + +Kata Containers provides two options for how cgroups are handled on the host. Selection of these options is done through +the `SandboxCgroupOnly` flag within the Kata Containers [configuration](https://github.com/kata-containers/runtime#configuration) +file. + +## `SandboxCgroupOnly` enabled + +With `SandboxCgroupOnly` enabled, it is expected that the parent cgroup is sized to take the overhead of running +a sandbox into account. This is ideal, as all the applicable Kata Containers components can be placed within the +given cgroup-path. + +In the context of Kubernetes, Kubelet will size the pod-cgroup to take the overhead of running a Kata-based sandbox +into account. This will be feasible in the 1.16 Kubernetes release through the `PodOverhead` feature. + +``` ++----------------------------------------------------------+ +| +---------------------------------------------------+ | +| | +---------------------------------------------+ | | +| | | +--------------------------------------+ | | | +| | | | kata-shimv2, VMM and threads: | | | | +| | | | (VMM, IO-threads, vCPU threads, etc)| | | | +| | | | | | | | +| | | | kata-sandbox- | | | | +| | | +--------------------------------------+ | | | +| | | | | | +| | |Pod 1 | | | +| | +---------------------------------------------+ | | +| | | | +| | +---------------------------------------------+ | | +| | | +--------------------------------------+ | | | +| | | | kata-shimv2, VMM and threads: | | | | +| | | | (VMM, IO-threads, vCPU threads, etc)| | | | +| | | | | | | | +| | | | kata-sandbox- | | | | +| | | +--------------------------------------+ | | | +| | |Pod 2 | | | +| | +---------------------------------------------+ | | +| |kubepods | | +| +---------------------------------------------------+ | +| | +|Node | ++----------------------------------------------------------+ +``` + +### What does Kata do in this configuration? +1. Given a `PodSandbox` container creation, let: + + ``` + podCgroup=Parent(container.CgroupsPath) + KataSandboxCgroup=/kata-sandbox- + ``` + +2. Create the cgroup, `KataSandboxCgroup` + +3. Join the `KataSandboxCgroup` + +Any process created by the runtime will be created in `KataSandboxCgroup`. +The runtime will not limit the cgroup in the host, but the caller is free +to set the proper limits for the `podCgroup`. + +In the example above the pod cgroups are `/kubepods/pod1` and `/kubepods/pod2`. +Kata creates the unrestricted sandbox cgroup under the pod cgroup. + +### Why create a Kata-cgroup under the parent cgroup? + +`Docker` does not have a notion of pods, and will not create a cgroup directory +to place a particular container in (i.e., all containers would be in a path like +`/docker/container-id`. To simplify the implementation and continue to support `Docker`, +Kata Containers creates the sandbox-cgroup, in the case of Kubernetes, or a container cgroup, in the case +of docker. + +### Improvements + +- Get statistics about pod resources + +If the Kata caller wants to know the resource usage on the host it can get +statistics from the pod cgroup. All cgroups stats in the hierarchy will include +the Kata overhead. This gives the possibility of gathering usage-statics at the +pod level and the container level. + +- Better host resource isolation + +Because the Kata runtime will place all the Kata processes in the pod cgroup, +the resource limits that the caller applies to the pod cgroup will affect all +processes that belong to the Kata sandbox in the host. This will improve the +isolation in the host preventing Kata to become a noisy neighbor. + +## `SandboxCgroupOnly` disabled (default, legacy) + +If the cgroup provided to Kata is not sized appropriately, instability will be +introduced when fully constraining Kata components, and the user-workload will +see a subset of resources that were requested. Based on this, the default +handling for Kata Containers is to not fully constrain the VMM and Kata +components on the host. + +``` ++----------------------------------------------------------+ +| +---------------------------------------------------+ | +| | +---------------------------------------------+ | | +| | | +--------------------------------------+ | | | +| | | |Container 1 |-|Container 2 | | | | +| | | | |-| | | | | +| | | | Shim+container1 |-| Shim+container2 | | | | +| | | +--------------------------------------+ | | | +| | | | | | +| | |Pod 1 | | | +| | +---------------------------------------------+ | | +| | | | +| | +---------------------------------------------+ | | +| | | +--------------------------------------+ | | | +| | | |Container 1 |-|Container 2 | | | | +| | | | |-| | | | | +| | | | Shim+container1 |-| Shim+container2 | | | | +| | | +--------------------------------------+ | | | +| | | | | | +| | |Pod 2 | | | +| | +---------------------------------------------+ | | +| |kubepods | | +| +---------------------------------------------------+ | +| +---------------------------------------------------+ | +| | Hypervisor | | +| |Kata | | +| +---------------------------------------------------+ | +| | +|Node | ++----------------------------------------------------------+ + +``` + +### What does this method do? + +1. Given a container creation let `containerCgroupHost=container.CgroupsPath` +1. Rename `containerCgroupHost` path to add `kata_` +1. Let `PodCgroupPath=PodSanboxContainerCgroup` where `PodSanboxContainerCgroup` is the cgroup of a container of type `PodSandbox` +1. Limit the `PodCgroupPath` with the sum of all the container limits in the Sandbox +1. Move only vCPU threads of hypervisor to `PodCgroupPath` +1. Per each container, move its `kata-shim` to its own `containerCgroupHost` +1. Move hypervisor and applicable threads to memory cgroup `/kata` + +_Note_: the Kata Containers runtime will not add all the hypervisor threads to +the cgroup path requested, only vCPUs. These threads are run unconstrained. + +This mitigates the risk of the VMM and other threads receiving an out of memory scenario (`OOM`). + + +#### Impact + +If resources are reserved at a system level to account for the overheads of +running sandbox containers, this configuration can be utilized with adequate +stability. In this scenario, non-negligible amounts of CPU and memory will be +utilized unaccounted for on the host. + +[linux-config]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md +[cgroupspath]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#cgroups-path + +## Summary + +| cgroup option | default? | status | pros | cons +|-|-|-|-|-| +| `SandboxCgroupOnly=false` | yes | legacy | Easiest to make Kata work | Unaccounted for memory and resource utilization +| `SandboxCgroupOnly=true` | no | recommended | Complete tracking of Kata memory and CPU utilization. In Kubernetes, the Kubelet can fully constrain Kata via the pod cgroup | Requires upper layer orchestrator which sizes sandbox cgroup appropriately | diff --git a/design/cpu-constraints.md b/design/vcpu-handling.md similarity index 55% rename from design/cpu-constraints.md rename to design/vcpu-handling.md index 4b7da6840b..811b972a05 100644 --- a/design/cpu-constraints.md +++ b/design/vcpu-handling.md @@ -1,17 +1,12 @@ -* [CPU constraints in Kata Containers](#cpu-constraints-in-kata-containers) - * [Default number of virtual CPUs](#default-number-of-virtual-cpus) - * [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods) - * [Container lifecycle](#container-lifecycle) - * [Container without CPU constraint](#container-without-cpu-constraint) - * [Container with CPU constraint](#container-with-cpu-constraint) - * [Do not waste resources](#do-not-waste-resources) - * [CPU cgroups](#cpu-cgroups) - * [cgroups in the guest](#cgroups-in-the-guest) - * [CPU pinning](#cpu-pinning) - * [cgroups in the host](#cgroups-in-the-host) +- [Virtual machine vCPU sizing in Kata Containers](#virtual-machine-vcpu-sizing-in-kata-containers) + * [Default number of virtual CPUs](#default-number-of-virtual-cpus) + * [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods) + * [Container lifecycle](#container-lifecycle) + * [Container without CPU constraint](#container-without-cpu-constraint) + * [Container with CPU constraint](#container-with-cpu-constraint) + * [Do not waste resources](#do-not-waste-resources) - -# CPU constraints in Kata Containers +# Virtual machine vCPU sizing in Kata Containers ## Default number of virtual CPUs @@ -171,83 +166,6 @@ docker run --cpus 4 -ti debian bash -c "nproc; cat /sys/fs/cgroup/cpu,cpuacct/cp ``` -## CPU cgroups - -Kata Containers runs over two layers of cgroups, the first layer is in the guest where -only the workload is placed, the second layer is in the host that is more complex and -might contain more than one process and task (thread) depending of the number of -containers per POD and vCPUs per container. The following diagram represents a Nginx container -created with `docker` with the default number of vCPUs. - - -``` -$ docker run -dt --runtime=kata-runtime nginx - - - .-------. - | Nginx | - .--'-------'---. .------------. - | Guest Cgroup | | Kata agent | - .-'--------------'--'------------'. .-----------. - | Thread: Hypervisor's vCPU 0 | | Kata Shim | - .'---------------------------------'. .'-----------'. - | Tasks | | Processes | -.'-----------------------------------'--'-------------'. -| Host Cgroup | -'------------------------------------------------------' -``` - -The next sections explain the difference between processes and tasks and why only hypervisor -vCPUs are constrained. - -### cgroups in the guest - -Only the workload process including all its threads are placed into CPU cgroups, this means -that `kata-agent` and `systemd` run without constraints in the guest. - -#### CPU pinning - -Kata Containers tries to apply and honor the cgroups but sometimes that is not possible. -An example of this occurs with CPU cgroups when the number of virtual CPUs (in the guest) -does not match the actual number of physical host CPUs. -In Kata Containers to have a good performance and small memory footprint, the resources are -hot added when they are needed, therefore the number of virtual resources is not the same -as the number of physical resources. The problem with this approach is that it's not possible -to pin a process on a specific resource that is not present in the guest. To deal with this -limitation and to not fail when the container is being created, Kata Containers does not apply -the constraint in the first layer (guest) if the resource does not exist in the guest, but it -is applied in the second layer (host) where the hypervisor is running. The constraint is applied -in both layers when the resource is available in the guest and host. The next sections provide -further details on what parts of the hypervisor are constrained. - -### cgroups in the host - -In Kata Containers the workloads run in a virtual machine that is managed and represented by a -hypervisor running in the host. Like other processes the hypervisor might use threads to realize -several tasks, for example IO and Network operations. One of the most important uses for the -threads is as vCPUs. The processes running in the guest see these vCPUs as physical CPUs, while -in the host those vCPU are just threads that are part of a process. This is the key to ensure -workloads consumes only the amount of CPU resources that were assigned to it without impacting -other operations. From user perspective the easier approach to implement it would be to take the -whole hypervisor including its threads and move them into the cgroup, unfortunately this will -impact negatively the performance, since vCPUs, IO and Network threads will be fighting for -resources. The following table shows a random read performance comparison between a Kata Container -with all its hypervisor threads in the cgroup and other with only its hypervisor vCPU threads -constrained, the difference is huge. - - -| Bandwidth | All threads | vCPU threads | Units | -|:-------------:|:-------------:|:------------:|:-----:| -| 4k | 136.2 | 294.7 | MB/s | -| 8k | 166.6 | 579.4 | MB/s | -| 16k | 178.3 | 1093.3 | MB/s | -| 32k | 179.9 | 1931.5 | MB/s | -| 64k | 213.6 | 3994.2 | MB/s | - - -To have the best performance in Kata Containers only the vCPU threads are constrained. - - [1]: https://docs.docker.com/config/containers/resource_constraints/#cpu [2]: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource [3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/