docs: Host cgroups documentation update

Update according to the new sandbox/overhead cgroup split. Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
2025-07-13 15:14:08 +00:00 · 2021-07-19 20:55:28 +02:00 · 2021-07-19 20:55:28 +02:00 · 8d9d6e6af0
commit 8d9d6e6af0
parent 9bed2ade0f
1 changed files with 187 additions and 139 deletions
--- a/docs/design/host-cgroups.md
+++ b/docs/design/host-cgroups.md
@ -12,187 +12,244 @@ The OCI [runtime specification][linux-config] provides guidance on where the con
  > [`cgroupsPath`][cgroupspath]: (string, OPTIONAL) path to the cgroups. It can be used to either control the cgroups
  > hierarchy for containers or to run a new process in an existing container
-cgroups are hierarchical, and this can be seen with the following pod example:
+Cgroups are hierarchical, and this can be seen with the following pod example:
 - Pod 1: `cgroupsPath=/kubepods/pod1`
-  - Container 1:
+  - Container 1: `cgroupsPath=/kubepods/pod1/container1`
-`cgroupsPath=/kubepods/pod1/container1`
+  - Container 2: `cgroupsPath=/kubepods/pod1/container2`
  - Container 2:
 `cgroupsPath=/kubepods/pod1/container2`
 - Pod 2: `cgroupsPath=/kubepods/pod2`
-  - Container 1:
+  - Container 1: `cgroupsPath=/kubepods/pod2/container2`
-`cgroupsPath=/kubepods/pod2/container2`
+  - Container 2: `cgroupsPath=/kubepods/pod2/container2`
  - Container 2:
 `cgroupsPath=/kubepods/pod2/container2`
-Depending on the upper-level orchestrator, the cgroup under which the pod is placed is
+Depending on the upper-level orchestration layers, the cgroup under which the pod is placed is
-managed by the orchestrator. In the case of Kubernetes, the pod-cgroup is created by Kubelet,
+managed by the orchestrator or not. In the case of Kubernetes, the pod cgroup is created by Kubelet,
-while the container cgroups are to be handled by the runtime. Kubelet will size the pod-cgroup
+while the container cgroups are to be handled by the runtime.
-based on the container resource requirements.
+Kubelet will size the pod cgroup based on the container resource requirements, to which it may add
 a configured set of [pod resource overheads](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/).
-Kata Containers introduces a non-negligible overhead for running a sandbox (pod). Based on this, two scenarios are possible:
+Kata Containers introduces a non-negligible resource overhead for running a sandbox (pod). Typically, the Kata shim,
- 1) The upper-layer orchestrator takes the overhead of running a sandbox into account when sizing the pod-cgroup, or
+through its underlying VMM invocation, will create many additional threads compared to process based container runtimes:
- 2) Kata Containers do not fully constrain the VMM and associated processes, instead placing a subset of them outside of the pod-cgroup.
+the para-virtualized I/O back-ends, the VMM instance or even the Kata shim process, all of those host processes consume
 memory and CPU time not directly tied to the container workload, and introduces a sandbox resource overhead.
 In order for a Kata workload to run without significant performance degradation, its sandbox overhead must be
 provisioned accordingly. Two scenarios are possible:
-Kata Containers provides two options for how cgroups are handled on the host. Selection of these options is done through
+ 1) The upper-layer orchestrator takes the overhead of running a sandbox into account when sizing the pod cgroup.
-the `SandboxCgroupOnly` flag within the Kata Containers [configuration](../../src/runtime/README.md#configuration)
+    For example, Kubernetes [`PodOverhead`](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/)
-file.
+	feature lets the orchestrator add a configured sandbox overhead to the sum of all its containers resources. In
 	that case, the pod sandbox is properly sized and all Kata created processes will run under the pod cgroup
 	defined constraints and limits.
 2) The upper-layer orchestrator does **not** take the sandbox overhead into account and the pod cgroup is not
 	sized to properly run all Kata created processes. With that scenario, attaching all the Kata processes to the sandbox
 	cgroup may lead to non-negligible workload performance degradations. As a consequence, Kata Containers will move
 	all processes but the vCPU threads into a dedicated overhead cgroup under `/kata_overhead`. The Kata runtime will
 	not apply any constraints or limits to that cgroup, it is up to the infrastructure owner to optionally set it up.
-## `SandboxCgroupOnly` enabled
+Those 2 scenarios are not dynamically detected by the Kata Containers runtime implementation, and thus the
 infrastructure owner must configure the runtime according to how the upper-layer orchestrator creates and sizes the
 pod cgroup. That configuration selection is done through the `sandbox_cgroup_only` flag within the Kata Containers
 [configuration](../../src/runtime/README.md#configuration) file.
-With `SandboxCgroupOnly` enabled, it is expected that the parent cgroup is sized to take the overhead of running
+## `sandbox_cgroup_only = true`
 a sandbox into account. This is ideal, as all the applicable Kata Containers components can be placed within the
 given cgroup-path.
-In the context of Kubernetes, Kubelet will size the pod-cgroup to take the overhead of running a Kata-based sandbox
+Setting `sandbox_cgroup_only` to `true` from the Kata Containers configuration file means that the pod cgroup is
-into account. This will be feasible in the 1.16 Kubernetes release through the `PodOverhead` feature.
+properly sized and takes the pod overhead into account. This is ideal, as all the applicable Kata Containers processes
 can simply be placed within the given cgroup path.
 In the context of Kubernetes, Kubelet can size the pod cgroup to take the overhead of running a Kata-based sandbox
 into account. This has been supported since the 1.16 Kubernetes release, through the
 [`PodOverhead`](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/) feature.
 ```
-+----------------------------------------------------------+
+┌─────────────────────────────────────────┐
-|    +---------------------------------------------------+ |
+│                                         │
-|    |   +---------------------------------------------+ | |
+│  ┌──────────────────────────────────┐   │
-|    |   |   +--------------------------------------+  | | |
+│  │                                  │   │
-|    |   |   | kata-shimv2, VMM and threads:        |  | | |
+│  │ ┌─────────────────────────────┐  │   │
-|    |   |   |  (VMM, IO-threads, vCPU threads, etc)|  | | |
+│  │ │                             │  │   │
-|    |   |   |                                      |  | | |
+│  │ │ ┌─────────────────────┐     │  │   │
-|    |   |   | kata_<sandbox-id>                    |  | | |
+│  │ │ │ vCPU threads        │     │  │   │
-|    |   |   +--------------------------------------+  | | |
+│  │ │ │ I/O threads         │     │  │   │
-|    |   |                                             | | |
+│  │ │ │ VMM                 │     │  │   │
-|    |   |Pod 1                                        | | |
+│  │ │ │ Kata Shim           │     │  │   │
-|    |   +---------------------------------------------+ | |
+│  │ │ │                     │     │  │   │
-|    |                                                   | |
+│  │ │ │ /kata_<sandbox_id>  │     │  │   │
-|    |   +---------------------------------------------+ | |
+│  │ │ └─────────────────────┘     │  │   │
-|    |   |   +--------------------------------------+  | | |
+│  │ │Pod 1                        │  │   │
-|    |   |   | kata-shimv2, VMM and threads:        |  | | |
+│  │ └─────────────────────────────┘  │   │
-|    |   |   |  (VMM, IO-threads, vCPU threads, etc)|  | | |
+│  │                                  │   │
-|    |   |   |                                      |  | | |
+│  │ ┌─────────────────────────────┐  │   │
-|    |   |   | kata_<sandbox-id>                    |  | | |
+│  │ │                             │  │   │
-|    |   |   +--------------------------------------+  | | |
+│  │ │ ┌─────────────────────┐     │  │   │
-|    |   |Pod 2                                        | | |
+│  │ │ │ vCPU threads        │     │  │   │
-|    |   +---------------------------------------------+ | |
+│  │ │ │ I/O threads         │     │  │   │
-|    |kubepods                                           | |
+│  │ │ │ VMM                 │     │  │   │
-|    +---------------------------------------------------+ |
+│  │ │ │ Kata Shim           │     │  │   │
-|                                                          |
+│  │ │ │                     │     │  │   │
-|Node                                                      |
+│  │ │ │ /kata_<sandbox_id>  │     │  │   │
-+----------------------------------------------------------+
+│  │ │ └─────────────────────┘     │  │   │
 │  │ │Pod 2                        │  │   │
 │  │ └─────────────────────────────┘  │   │
 │  │                                  │   │
 │  │/kubepods                         │   │
 │  └──────────────────────────────────┘   │
 │                                         │
 │ Node                                    │
 └─────────────────────────────────────────┘
 ```
-### What does Kata do in this configuration?
+### Implementation details
 1. Given a `PodSandbox` container creation, let:
-   ```
+When `sandbox_cgroup_only` is enabled, the Kata shim will create a per pod
-   podCgroup=Parent(container.CgroupsPath)
+sub-cgroup under the pod's dedicated cgroup. For example, in the Kubernetes context,
-   KataSandboxCgroup=<podCgroup>/kata_<PodSandboxID>
+it will create a `/kata_<PodSandboxID>` under the `/kubepods` cgroup hierarchy.
-   ```
+On a typical cgroup v1 hierarchy mounted under `/sys/fs/cgroup/`, the memory cgroup
 subsystem for a pod with sandbox ID `12345678` would live under
 `/sys/fs/cgroup/memory/kubepods/kata_12345678`.
-2. Create the cgroup, `KataSandboxCgroup`
+In most cases, the `/kata_<PodSandboxID>` created cgroup is unrestricted and inherits and shares all
 constraints and limits from the parent cgroup (`/kubepods` in the Kubernetes case). The exception is
 for the `cpuset` and `devices` cgroup subsystems, which are managed by the Kata shim.
-3. Join the `KataSandboxCgroup`
+After creating the `/kata_<PodSandboxID>` cgroup, the Kata Containers shim will move itself to it, **before** starting
 the virtual machine. As a consequence all processes subsequently created by the Kata Containers shim (the VMM itself, and
 all vCPU and I/O related threads) will be created in the `/kata_<PodSandboxID>` cgroup.
-Any process created by the runtime will be created in `KataSandboxCgroup`.
+### Why create a kata-cgroup under the parent cgroup?
 The runtime will limit the cgroup in the host only if the sandbox doesn't have a
 container type annotation, but the caller is free to set the proper limits for the `podCgroup`.
-In the example above the pod cgroups are `/kubepods/pod1` and `/kubepods/pod2`.
+And why not directly adding the per sandbox shim directly to the pod cgroup (e.g. 
-Kata creates the unrestricted sandbox cgroup under the pod cgroup.
+`/kubepods` in the Kubernetes context)?
-### Why create a Kata-cgroup under the parent cgroup?
+The Kata Containers shim implementation creates a per-sandbox cgroup
 (`/kata_<PodSandboxID>`) to support the `Docker` use case. Although `Docker` does not
 have a notion of pods, Kata Containers still creates a sandbox to support the pod-less,
 single container use case that `Docker` implements. Since `Docker` does create any
 cgroup hierarchy to place a container into, it would be very complex for Kata to map
 a particular container to its sandbox without placing it under a `/kata_<containerID>>`
 sub-cgroup first.
-`Docker` does not have a notion of pods, and will not create a cgroup directory
+### Advantages
 to place a particular container in (i.e., all containers would be in a path like
 `/docker/container-id`. To simplify the implementation and continue to support `Docker`,
 Kata Containers creates the sandbox-cgroup, in the case of Kubernetes, or a container cgroup, in the case
 of docker.
-### Improvements
+Keeping all Kata Containers processes under a properly sized pod cgroup is ideal
 and makes for a simpler Kata Containers implementation. It also helps with gathering
 accurate statistics and preventing Kata workloads from being noisy neighbors.
- Get statistics about pod resources
+#### Pod resources statistics
 If the Kata caller wants to know the resource usage on the host it can get
 statistics from the pod cgroup. All cgroups stats in the hierarchy will include
 the Kata overhead. This gives the possibility of gathering usage-statics at the
 pod level and the container level.
- Better host resource isolation
+#### Better host resource isolation
 Because the Kata runtime will place all the Kata processes in the pod cgroup,
 the resource limits that the caller applies to the pod cgroup will affect all
 processes that belong to the Kata sandbox in the host. This will improve the
 isolation in the host preventing Kata to become a noisy neighbor.
-## `SandboxCgroupOnly` disabled (default, legacy)
+## `sandbox_cgroup_only = false` (Default setting)
 If the cgroup provided to Kata is not sized appropriately, Kata components will
 consume resources that the actual container workloads expect to see and use.
 This can cause instability and performance degradations.
 To avoid that situation, Kata Containers creates an unconstrained overhead
 cgroup and moves all non workload related processes (Anything but the virtual CPU
 threads) to it. The name of this overhead cgroup is `/kata_overhead` and a per
 sandbox sub cgroup will be created under it for each sandbox Kata Containers creates.
 Kata Containers does not add any constraints or limitations on the overhead cgroup. It is up to the infrastructure
 owner to either:
 - Provision nodes with a pre-sized `/kata_overhead` cgroup. Kata Containers will
  load that existing cgroup and move all non workload related processes to it.
 - Let Kata Containers create the `/kata_overhead` cgroup, leave it
  unconstrained or resize it a-posteriori.
 If the cgroup provided to Kata is not sized appropriately, instability will be
 introduced when fully constraining Kata components, and the user-workload will
 see a subset of resources that were requested. Based on this, the default
 handling for Kata Containers is to not fully constrain the VMM and Kata
 components on the host.
 ```
-+----------------------------------------------------------+
+┌────────────────────────────────────────────────────────────────────┐
-|    +---------------------------------------------------+ |
+│                                                                    │
-|    |   +---------------------------------------------+ | |
+│  ┌─────────────────────────────┐    ┌───────────────────────────┐  │
-|    |   |   +--------------------------------------+  | | |
+│  │                             │    │                           │  │
-|    |   |   |Container 1       |-|Container 2      |  | | |
+│  │   ┌─────────────────────────┼────┼─────────────────────────┐ │  │
-|    |   |   |                  |-|                 |  | | |
+│  │   │                         │    │                         │ │  │
-|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
+│  │   │ ┌─────────────────────┐ │    │ ┌─────────────────────┐ │ │  │
-|    |   |   +--------------------------------------+  | | |
+│  │   │ │  vCPU threads       │ │    │ │  VMM                │ │ │  │
-|    |   |                                             | | |
+│  │   │ │                     │ │    │ │  I/O threads        │ │ │  │
-|    |   |Pod 1                                        | | |
+│  │   │ │                     │ │    │ │  Kata Shim          │ │ │  │
-|    |   +---------------------------------------------+ | |
+│  │   │ │                     │ │    │ │                     │ │ │  │
-|    |                                                   | |
+│  │   │ │ /kata_<sandbox_id>  │ │    │ │ /<sandbox_id>       │ │ │  │
-|    |   +---------------------------------------------+ | |
+│  │   │ └─────────────────────┘ │    │ └─────────────────────┘ │ │  │
-|    |   |   +--------------------------------------+  | | |
+│  │   │                         │    │                         │ │  │
-|    |   |   |Container 1       |-|Container 2      |  | | |
+│  │   │  Pod 1                  │    │                         │ │  │
-|    |   |   |                  |-|                 |  | | |
+│  │   └─────────────────────────┼────┼─────────────────────────┘ │  │
-|    |   |   | Shim+container1  |-| Shim+container2 |  | | |
+│  │                             │    │                           │  │
-|    |   |   +--------------------------------------+  | | |
+│  │                             │    │                           │  │
-|    |   |                                             | | |
+│  │   ┌─────────────────────────┼────┼─────────────────────────┐ │  │
-|    |   |Pod 2                                        | | |
+│  │   │                         │    │                         │ │  │
-|    |   +---------------------------------------------+ | |
+│  │   │ ┌─────────────────────┐ │    │ ┌─────────────────────┐ │ │  │
-|    |kubepods                                           | |
+│  │   │ │  vCPU threads       │ │    │ │  VMM                │ │ │  │
-|    +---------------------------------------------------+ |
+│  │   │ │                     │ │    │ │  I/O threads        │ │ │  │
-|    +---------------------------------------------------+ |
+│  │   │ │                     │ │    │ │  Kata Shim          │ │ │  │
-|    |  Hypervisor                                       | |
+│  │   │ │                     │ │    │ │                     │ │ │  │
-|    |Kata                                               | |
+│  │   │ │ /kata_<sandbox_id>  │ │    │ │ /<sandbox_id>       │ │ │  │
-|    +---------------------------------------------------+ |
+│  │   │ └─────────────────────┘ │    │ └─────────────────────┘ │ │  │
-|                                                          |
+│  │   │                         │    │                         │ │  │
-|Node                                                      |
+│  │   │  Pod 2                  │    │                         │ │  │
-+----------------------------------------------------------+
+│  │   └─────────────────────────┼────┼─────────────────────────┘ │  │
 │  │                             │    │                           │  │
 │  │ /kubepods                   │    │ /kata_overhead            │  │
 │  └─────────────────────────────┘    └───────────────────────────┘  │
 │                                                                    │
 │                                                                    │
 │ Node                                                               │
 └────────────────────────────────────────────────────────────────────┘
 ```
-### What does this method do?
+### Implementation Details
-1. Given a container creation let `containerCgroupHost=container.CgroupsPath`
+When `sandbox_cgroup_only` is disabled, the Kata Containers shim will create a per pod
-1. Rename `containerCgroupHost` path to add `kata_`
+sub-cgroup under the pods dedicated cgroup, and another one under the overhead cgroup.
-1. Let `PodCgroupPath=PodSanboxContainerCgroup` where `PodSanboxContainerCgroup` is the cgroup of a container of type `PodSandbox`
+For example, in the Kubernetes context, it will create a `/kata_<PodSandboxID>` under
-1. Limit the `PodCgroupPath` with the sum of all the container limits in the Sandbox
+the `/kubepods` cgroup hierarchy, and a `/<PodSandboxID>` under the `/kata_overhead` one.
 1. Move only vCPU threads of hypervisor to `PodCgroupPath`
 1. Per each container, move its `kata-shim` to its own `containerCgroupHost`
 1. Move hypervisor and applicable threads to memory cgroup `/kata`
-_Note_: the Kata Containers runtime will not add all the hypervisor threads to
+On a typical cgroup v1 hierarchy mounted under `/sys/fs/cgroup/`, for a pod which sandbox
-the cgroup path requested, only vCPUs. These threads are run unconstrained.
+ID is `12345678`, create with `sandbox_cgroup_only` disabled, the 2 memory subsystems
 for the sandbox cgroup and the overhead cgroup would respectively live under 
 `/sys/fs/cgroup/memory/kubepods/kata_12345678` and `/sys/fs/cgroup/memory/kata_overhead/12345678`.
-This mitigates the risk of the VMM and other threads receiving an out of memory scenario (`OOM`).
+Unlike when `sandbox_cgroup_only` is enabled, the Kata Containers shim will move itself
 to the overhead cgroup first, and then move the vCPU threads to the sandbox cgroup as
 they're created. All Kata processes and threads will run under the overhead cgroup except for
 the vCPU threads. 
 With `sandbox_cgroup_only` disabled, Kata Containers assumes the pod cgroup is only sized
 to accommodate for the actual container workloads processes. For Kata, this maps
 to the VMM created virtual CPU threads and so they are the only ones running under the pod
 cgroup. This mitigates the risk of the VMM, the Kata shim and the I/O threads going through
 a catastrophic out of memory scenario (`OOM`).
-#### Impact
+#### Pros and Cons
-If resources are reserved at a system level to account for the overheads of
+Running all non vCPU threads under an unconstrained overhead cgroup could lead to workloads
-running sandbox containers, this configuration can be utilized with adequate
+potentially consuming a large amount of host resources.
-stability. In this scenario, non-negligible amounts of CPU and memory will be
+
-utilized unaccounted for on the host.
+On the other hand, running all non vCPU threads under a dedicated overhead cgroup can provide
 accurate metrics on the actual Kata Container pod overhead, allowing for tuning the overhead
 cgroup size and constraints accordingly.
 [linux-config]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md
 [cgroupspath]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#cgroups-path
 # Supported cgroups
-Kata Containers supports cgroups `v1` and `v2`. In the following sections each cgroup is
+Kata Containers currently only supports cgroups `v1`. 
-described briefly and what changes are needed in Kata Containers to support it.
+
 In the following sections each cgroup is described briefly.
 ## Cgroups V1
@ -244,7 +301,7 @@ diagram:
 A process can join a cgroup by writing its process id (`pid`) to `cgroup.procs` file,
 or join a cgroup partially by writing the task (thread) id (`tid`) to the `tasks` file.
-Kata Containers supports `v1` by default and no change in the configuration file is needed.
+Kata Containers only supports `v1`.
 To know more about `cgroups v1`, see [cgroupsv1(7)][2].
 ## Cgroups V2
@ -297,22 +354,13 @@ Same as `cgroups v1`, a process can join the cgroup by writing its process id (`
 `cgroup.procs` file, or join a cgroup partially by writing the task (thread) id (`tid`) to
 `cgroup.threads` file.
-For backwards compatibility Kata Containers defaults to supporting cgroups v1 by default.
+Kata Containers does not support cgroups `v2` on the host.
 To change this to `v2`, set `sandbox_cgroup_only=true` in the `configuration.toml` file.
 To know more about `cgroups v2`, see [cgroupsv2(7)][3].
 ### Distro Support
 Many Linux distributions do not yet support `cgroups v2`, as it is quite a recent addition.
 For more information about the status of this feature see [issue #2494][4].
 # Summary
 | cgroup option | default? | status | pros | cons | cgroups
 |-|-|-|-|-|-|
 | `SandboxCgroupOnly=false` | yes | legacy | Easiest to make Kata work | Unaccounted for memory and resource utilization | v1
 | `SandboxCgroupOnly=true` | no | recommended | Complete tracking of Kata memory and CPU utilization. In Kubernetes, the Kubelet can fully constrain Kata via the pod cgroup | Requires upper layer orchestrator which sizes sandbox cgroup appropriately | v1, v2
 [1]: http://man7.org/linux/man-pages/man5/tmpfs.5.html
 [2]: http://man7.org/linux/man-pages/man7/cgroups.7.html#CGROUPS_VERSION_1