docs: Add documentation about host cgroup management

Document how cgroups are done today and what is expected for the upcoming SandboxCgroupOnly option. Prior cgroup documentation are no longer accurate. Removing the cgroup discussion from the cpu sizing discussion. Updating the cpu-constraints.md file name to reflect this. Fixes: #542 Signed-off-by: Eric Ernst <eric.ernst@intel.com> Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
2025-08-14 22:24:14 +00:00 · 2019-08-27 18:30:46 -05:00 · 2019-08-27 18:30:46 -05:00 · 2255b36a82
commit 2255b36a82
parent 3ed59ee50e
5 changed files with 220 additions and 92 deletions
--- a/Limitations.md
+++ b/Limitations.md
@ -138,7 +138,7 @@ these commands is potentially challenging.
 See issue https://github.com/clearcontainers/runtime/issues/341 and [the constraints challenge](#the-constraints-challenge) for more information.

 For CPUs resource management see
-[CPU constraints](design/cpu-constraints.md).
+[CPU constraints](design/vcpu-handling.md).

 ### docker run and shared memory

--- a/design/README.md
+++ b/design/README.md
@ -6,3 +6,5 @@ Kata Containers design documents:
 - [API Design of Kata Containers](kata-api-design.md)
 - [Design requirements for Kata Containers](kata-design-requirements.md)
 - [VSocks](VSocks.md)
+- [VCPU handling](vcpu-handling.md)
+- [Host cgroups](host-cgroups.md)
--- a/design/VSocks.md
+++ b/design/VSocks.md
@ -130,5 +130,5 @@ the containers are removed automatically.
 [2]: https://github.com/kata-containers/proxy
 [3]: https://github.com/hashicorp/yamux
 [4]: https://wiki.qemu.org/Features/VirtioVsock
-[5]: ./cpu-constraints.md#virtual-cpus-and-kubernetes-pods
+[5]: ./vcpu-handling.md#virtual-cpus-and-kubernetes-pods
 [6]: https://github.com/kata-containers/shim
--- a/design/host-cgroups.md
+++ b/design/host-cgroups.md
@ -0,0 +1,208 @@
+- [Host cgroup management](#host-cgroup-management)
+  - [Introduction](#introduction)
+  - [`SandboxCgroupOnly` enabled](#sandboxcgrouponly-enabled)
+    - [What does Kata do in this configuration?](#what-does-kata-do-in-this-configuration)
+    - [Why create a Kata-cgroup under the parent cgroup?](#why-create-a-kata-cgroup-under-the-parent-cgroup)
+    - [Improvements](#improvements)
+  - [`SandboxCgroupOnly` disabled (default, legacy)](#sandboxcgrouponly-disabled-default-legacy)
+    - [What does this method do?](#what-does-this-method-do)
+      - [Impact](#impact)
+  - [Summary](#summary)
+
+# Host cgroup management
+
+## Introduction
+
+In Kata Containers, workloads run in a virtual machine that is managed by a virtual
+machine monitor (VMM) running on the host. As a result, Kata Containers run over two layers of cgroups. The
+first layer is in the guest where the workload is placed, while the second layer is on the host where the
+VMM and associated threads are running.
+
+The OCI [runtime specification][linux-config] provides guidance on where the container cgroups should be placed:
+
+  > [`cgroupsPath`][cgroupspath]: (string, OPTIONAL) path to the cgroups. It can be used to either control the cgroups
+  > hierarchy for containers or to run a new process in an existing container
+
+cgroups are hierarchical, and this can be seen with the following pod example:
+
+- Pod 1: `cgroupsPath=/kubepods/pod1`
+  - Container 1:
+`cgroupsPath=/kubepods/pod1/container1`
+  - Container 2:
+`cgroupsPath=/kubepods/pod1/container2`
+
+- Pod 2: `cgroupsPath=/kubepods/pod2`
+  - Container 1:
+`cgroupsPath=/kubepods/pod2/container2`
+  - Container 2:
+`cgroupsPath=/kubepods/pod2/container2`
+
+Depending on the upper-level orchestrator, the cgroup under which the pod is placed is
+managed by the orchestrator. In the case of Kubernetes, the pod-cgroup is created by Kubelet,
+while the container cgroups are to be handled by the runtime. Kubelet will size the pod-cgroup
+based on the container resource requirements.
+
+Kata Containers introduces a non-negligible overhead for running a sandbox (pod). Based on this, two scenarios are possible:
+ 1) The upper-layer orchestrator takes the overhead of running a sandbox into account when sizing the pod-cgroup, or
+ 2) Kata Containers do not fully constrain the VMM and associated processes, instead placing a subset of them outside of the pod-cgroup.
+
+Kata Containers provides two options for how cgroups are handled on the host. Selection of these options is done through
+the `SandboxCgroupOnly` flag within the Kata Containers [configuration](https://github.com/kata-containers/runtime#configuration)
+file.
+
+## `SandboxCgroupOnly` enabled
+
+With `SandboxCgroupOnly` enabled, it is expected that the parent cgroup is sized to take the overhead of running
+a sandbox into account. This is ideal, as all the applicable Kata Containers components can be placed within the
+given cgroup-path.
+
+In the context of Kubernetes, Kubelet will size the pod-cgroup to take the overhead of running a Kata-based sandbox
+into account. This will be feasible in the 1.16 Kubernetes release through the `PodOverhead` feature.
+
+```
+----------------------------------------------------------+
+|    +---------------------------------------------------+ |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   | kata-shimv2, VMM and threads:        |  | | |
+|    |   |   |  (VMM, IO-threads, vCPU threads, etc)|  | | |
+|    |   |   |                                      |  | | |
+|    |   |   | kata-sandbox-<id>                    |  | | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |                                             | | |
+|    |   |Pod 1                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |                                                   | |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   | kata-shimv2, VMM and threads:        |  | | |
+|    |   |   |  (VMM, IO-threads, vCPU threads, etc)|  | | |
+|    |   |   |                                      |  | | |
+|    |   |   | kata-sandbox-<id>                    |  | | |
+|    |   |   +--------------------------------------+  | | |  
+|    |   |Pod 2                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |kubepods                                           | |
+|    +---------------------------------------------------+ |
+|                                                          |
+|Node                                                      |
+----------------------------------------------------------+
+```
+
+### What does Kata do in this configuration?
+1. Given a `PodSandbox` container creation, let:
+
+   ```
+   podCgroup=Parent(container.CgroupsPath)
+   KataSandboxCgroup=<podCgroup>/kata-sandbox-<PodSandboxID>
+   ```
+
+2. Create the cgroup, `KataSandboxCgroup`
+
+3. Join the `KataSandboxCgroup`
+
+Any process created by the runtime will be created in `KataSandboxCgroup`.
+The runtime will not limit the cgroup in the host, but the caller is free
+to set the proper limits for the `podCgroup`.
+
+In the example above the pod cgroups are `/kubepods/pod1` and `/kubepods/pod2`.
+Kata creates the unrestricted sandbox cgroup under the pod cgroup.
+
+### Why create a Kata-cgroup under the parent cgroup?
+
+`Docker` does not have a notion of pods, and will not create a cgroup directory
+to place a particular container in (i.e., all containers would be in a path like
+`/docker/container-id`. To simplify the implementation and continue to support `Docker`,
+Kata Containers creates the sandbox-cgroup, in the case of Kubernetes, or a container cgroup, in the case
+of docker.
+
+### Improvements
+
+- Get statistics about pod resources
+
+If the Kata caller wants to know the resource usage on the host it can get
+statistics from the pod cgroup. All cgroups stats in the hierarchy will include
+the Kata overhead. This gives the possibility of gathering usage-statics at the
+pod level and the container level.
+
+- Better host resource isolation
+
+Because the Kata runtime will place all the Kata processes in the pod cgroup,
+the resource limits that the caller applies to the pod cgroup will affect all
+processes that belong to the Kata sandbox in the host. This will improve the
+isolation in the host preventing Kata to become a noisy neighbor.
+
+## `SandboxCgroupOnly` disabled (default, legacy)
+
+If the cgroup provided to Kata is not sized appropriately, instability will be
+introduced when fully constraining Kata components, and the user-workload will
+see a subset of resources that were requested. Based on this, the default
+handling for Kata Containers is to not fully constrain the VMM and Kata
+components on the host.
+
+```
+----------------------------------------------------------+
+|    +---------------------------------------------------+ |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   |Container 1       |-|Container 2      |  | | |
+|    |   |   |                  |-|                 |  | | |
+|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |                                             | | |
+|    |   |Pod 1                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |                                                   | |
+|    |   +---------------------------------------------+ | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |   |Container 1       |-|Container 2      |  | | |
+|    |   |   |                  |-|                 |  | | |
+|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
+|    |   |   +--------------------------------------+  | | |
+|    |   |                                             | | |
+|    |   |Pod 2                                        | | |
+|    |   +---------------------------------------------+ | |
+|    |kubepods                                           | |
+|    +---------------------------------------------------+ |
+|    +---------------------------------------------------+ |
+|    |  Hypervisor                                       | |
+|    |Kata                                               | |
+|    +---------------------------------------------------+ |
+|                                                          |
+|Node                                                      |
+----------------------------------------------------------+
+
+```
+
+### What does this method do?
+
+1. Given a container creation let `containerCgroupHost=container.CgroupsPath`
+1. Rename `containerCgroupHost` path to add `kata_`
+1. Let `PodCgroupPath=PodSanboxContainerCgroup` where `PodSanboxContainerCgroup` is the cgroup of a container of type `PodSandbox`
+1. Limit the `PodCgroupPath` with the sum of all the container limits in the Sandbox
+1. Move only vCPU threads of hypervisor to `PodCgroupPath`
+1. Per each container, move its `kata-shim` to its own `containerCgroupHost`
+1. Move hypervisor and applicable threads to memory cgroup `/kata`
+
+_Note_: the Kata Containers runtime will not add all the hypervisor threads to
+the cgroup path requested, only vCPUs. These threads are run unconstrained.
+
+This mitigates the risk of the VMM and other threads receiving an out of memory scenario (`OOM`).
+
+
+#### Impact
+
+If resources are reserved at a system level to account for the overheads of
+running sandbox containers, this configuration can be utilized with adequate
+stability. In this scenario, non-negligible amounts of CPU and memory will be
+utilized unaccounted for on the host.
+
+[linux-config]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md
+[cgroupspath]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#cgroups-path
+
+## Summary
+
+| cgroup option | default? | status | pros | cons
+|-|-|-|-|-|
+| `SandboxCgroupOnly=false` | yes | legacy | Easiest to make Kata work | Unaccounted for memory and resource utilization
+| `SandboxCgroupOnly=true` | no | recommended | Complete tracking of Kata memory and CPU utilization. In Kubernetes, the Kubelet can fully constrain Kata via the pod cgroup | Requires upper layer orchestrator which sizes sandbox cgroup appropriately |
--- a/design/cpu-constraints.md
+++ b/design/cpu-constraints.md
@ -1,17 +1,12 @@
-* [CPU constraints in Kata Containers](#cpu-constraints-in-kata-containers)
+- [Virtual machine vCPU sizing in Kata Containers](#virtual-machine-vcpu-sizing-in-kata-containers)
  * [Default number of virtual CPUs](#default-number-of-virtual-cpus)
  * [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods)
  * [Container lifecycle](#container-lifecycle)
  * [Container without CPU constraint](#container-without-cpu-constraint)
  * [Container with CPU constraint](#container-with-cpu-constraint)
  * [Do not waste resources](#do-not-waste-resources)
-    * [CPU cgroups](#cpu-cgroups)
-    * [cgroups in the guest](#cgroups-in-the-guest)
-        * [CPU pinning](#cpu-pinning)
-    * [cgroups in the host](#cgroups-in-the-host)

-
-# CPU constraints in Kata Containers
+# Virtual machine vCPU sizing in Kata Containers

 ## Default number of virtual CPUs

@ -171,83 +166,6 @@ docker run --cpus 4 -ti debian bash -c "nproc; cat /sys/fs/cgroup/cpu,cpuacct/cp
 ```


-## CPU cgroups
-
-Kata Containers runs over two layers of cgroups, the first layer is in the guest where
-only the workload is placed, the second layer is in the host that is more complex and
-might contain more than one process and task (thread) depending of the number of
-containers per POD and vCPUs per container. The following diagram represents a Nginx container
-created with `docker` with the default number of vCPUs.
-
-
-```
-$ docker run -dt --runtime=kata-runtime nginx
-
-
-       .-------.
-       | Nginx |
-    .--'-------'---.  .------------.
-    | Guest Cgroup |  | Kata agent |
-  .-'--------------'--'------------'.    .-----------.
-  |  Thread: Hypervisor's vCPU 0    |    | Kata Shim |
- .'---------------------------------'.  .'-----------'.
- |             Tasks                 |  |  Processes  |
-.'-----------------------------------'--'-------------'.
-|                    Host Cgroup                       |
-'------------------------------------------------------'
-```
-
-The next sections explain the difference between processes and tasks and why only hypervisor
-vCPUs are constrained.
-
-### cgroups in the guest
-
-Only the workload process including all its threads are placed into CPU cgroups, this means
-that `kata-agent` and `systemd` run without constraints in the guest.
-
-#### CPU pinning
-
-Kata Containers tries to apply and honor the cgroups but sometimes that is not possible.
-An example of this occurs with CPU cgroups when the number of virtual CPUs (in the guest)
-does not match the actual number of physical host CPUs.
-In Kata Containers to have a good performance and small memory footprint, the resources are
-hot added when they are needed, therefore the number of virtual resources is not the same
-as the number of physical resources. The problem with this approach is that it's not possible
-to pin a process on a specific resource that is not present in the guest. To deal with this
-limitation and to not fail when the container is being created, Kata Containers does not apply
-the constraint in the first layer (guest) if the resource does not exist in the guest, but it
-is applied in the second layer (host) where the hypervisor is running. The constraint is applied
-in both layers when the resource is available in the guest and host. The next sections provide
-further details on what parts of the hypervisor are constrained.
-
-### cgroups in the host
-
-In Kata Containers the workloads run in a virtual machine that is managed and represented by a
-hypervisor running in the host. Like other processes the hypervisor might use threads to realize
-several tasks, for example IO and Network operations. One of the most important uses for the
-threads is as vCPUs. The processes running in the guest see these vCPUs as physical CPUs, while
-in the host those vCPU are just threads that are part of a process. This is the key to ensure
-workloads consumes only the amount of CPU resources that were assigned to it without impacting
-other operations. From user perspective the easier approach to implement it would be to take the
-whole hypervisor including its threads and move them into the cgroup, unfortunately this will
-impact negatively the performance, since vCPUs, IO and Network threads will be fighting for
-resources. The following table shows a random read performance comparison between a Kata Container
-with all its hypervisor threads in the cgroup and other with only its hypervisor vCPU threads
-constrained, the difference is huge.
-
-
-| Bandwidth     | All threads   | vCPU threads | Units |
-|:-------------:|:-------------:|:------------:|:-----:|
-| 4k            | 136.2         | 294.7        | MB/s  |
-| 8k            | 166.6         | 579.4        | MB/s  |
-| 16k           | 178.3         | 1093.3       | MB/s  |
-| 32k           | 179.9         | 1931.5       | MB/s  |
-| 64k           | 213.6         | 3994.2       | MB/s  |
-
-
-To have the best performance in Kata Containers only the vCPU threads are constrained.
-
-
 [1]: https://docs.docker.com/config/containers/resource_constraints/#cpu
 [2]: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource
 [3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/