mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-09-25 10:43:15 +00:00
Merge pull request #543 from jcvenegas/SandboxCgroupOnly-docs
docs: Add documentation about host cgroup management
This commit is contained in:
@@ -138,7 +138,7 @@ these commands is potentially challenging.
|
||||
See issue https://github.com/clearcontainers/runtime/issues/341 and [the constraints challenge](#the-constraints-challenge) for more information.
|
||||
|
||||
For CPUs resource management see
|
||||
[CPU constraints](design/cpu-constraints.md).
|
||||
[CPU constraints](design/vcpu-handling.md).
|
||||
|
||||
### docker run and shared memory
|
||||
|
||||
|
@@ -6,3 +6,5 @@ Kata Containers design documents:
|
||||
- [API Design of Kata Containers](kata-api-design.md)
|
||||
- [Design requirements for Kata Containers](kata-design-requirements.md)
|
||||
- [VSocks](VSocks.md)
|
||||
- [VCPU handling](vcpu-handling.md)
|
||||
- [Host cgroups](host-cgroups.md)
|
||||
|
@@ -130,5 +130,5 @@ the containers are removed automatically.
|
||||
[2]: https://github.com/kata-containers/proxy
|
||||
[3]: https://github.com/hashicorp/yamux
|
||||
[4]: https://wiki.qemu.org/Features/VirtioVsock
|
||||
[5]: ./cpu-constraints.md#virtual-cpus-and-kubernetes-pods
|
||||
[5]: ./vcpu-handling.md#virtual-cpus-and-kubernetes-pods
|
||||
[6]: https://github.com/kata-containers/shim
|
||||
|
208
design/host-cgroups.md
Normal file
208
design/host-cgroups.md
Normal file
@@ -0,0 +1,208 @@
|
||||
- [Host cgroup management](#host-cgroup-management)
|
||||
- [Introduction](#introduction)
|
||||
- [`SandboxCgroupOnly` enabled](#sandboxcgrouponly-enabled)
|
||||
- [What does Kata do in this configuration?](#what-does-kata-do-in-this-configuration)
|
||||
- [Why create a Kata-cgroup under the parent cgroup?](#why-create-a-kata-cgroup-under-the-parent-cgroup)
|
||||
- [Improvements](#improvements)
|
||||
- [`SandboxCgroupOnly` disabled (default, legacy)](#sandboxcgrouponly-disabled-default-legacy)
|
||||
- [What does this method do?](#what-does-this-method-do)
|
||||
- [Impact](#impact)
|
||||
- [Summary](#summary)
|
||||
|
||||
# Host cgroup management
|
||||
|
||||
## Introduction
|
||||
|
||||
In Kata Containers, workloads run in a virtual machine that is managed by a virtual
|
||||
machine monitor (VMM) running on the host. As a result, Kata Containers run over two layers of cgroups. The
|
||||
first layer is in the guest where the workload is placed, while the second layer is on the host where the
|
||||
VMM and associated threads are running.
|
||||
|
||||
The OCI [runtime specification][linux-config] provides guidance on where the container cgroups should be placed:
|
||||
|
||||
> [`cgroupsPath`][cgroupspath]: (string, OPTIONAL) path to the cgroups. It can be used to either control the cgroups
|
||||
> hierarchy for containers or to run a new process in an existing container
|
||||
|
||||
cgroups are hierarchical, and this can be seen with the following pod example:
|
||||
|
||||
- Pod 1: `cgroupsPath=/kubepods/pod1`
|
||||
- Container 1:
|
||||
`cgroupsPath=/kubepods/pod1/container1`
|
||||
- Container 2:
|
||||
`cgroupsPath=/kubepods/pod1/container2`
|
||||
|
||||
- Pod 2: `cgroupsPath=/kubepods/pod2`
|
||||
- Container 1:
|
||||
`cgroupsPath=/kubepods/pod2/container2`
|
||||
- Container 2:
|
||||
`cgroupsPath=/kubepods/pod2/container2`
|
||||
|
||||
Depending on the upper-level orchestrator, the cgroup under which the pod is placed is
|
||||
managed by the orchestrator. In the case of Kubernetes, the pod-cgroup is created by Kubelet,
|
||||
while the container cgroups are to be handled by the runtime. Kubelet will size the pod-cgroup
|
||||
based on the container resource requirements.
|
||||
|
||||
Kata Containers introduces a non-negligible overhead for running a sandbox (pod). Based on this, two scenarios are possible:
|
||||
1) The upper-layer orchestrator takes the overhead of running a sandbox into account when sizing the pod-cgroup, or
|
||||
2) Kata Containers do not fully constrain the VMM and associated processes, instead placing a subset of them outside of the pod-cgroup.
|
||||
|
||||
Kata Containers provides two options for how cgroups are handled on the host. Selection of these options is done through
|
||||
the `SandboxCgroupOnly` flag within the Kata Containers [configuration](https://github.com/kata-containers/runtime#configuration)
|
||||
file.
|
||||
|
||||
## `SandboxCgroupOnly` enabled
|
||||
|
||||
With `SandboxCgroupOnly` enabled, it is expected that the parent cgroup is sized to take the overhead of running
|
||||
a sandbox into account. This is ideal, as all the applicable Kata Containers components can be placed within the
|
||||
given cgroup-path.
|
||||
|
||||
In the context of Kubernetes, Kubelet will size the pod-cgroup to take the overhead of running a Kata-based sandbox
|
||||
into account. This will be feasible in the 1.16 Kubernetes release through the `PodOverhead` feature.
|
||||
|
||||
```
|
||||
+----------------------------------------------------------+
|
||||
| +---------------------------------------------------+ |
|
||||
| | +---------------------------------------------+ | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | | | kata-shimv2, VMM and threads: | | | |
|
||||
| | | | (VMM, IO-threads, vCPU threads, etc)| | | |
|
||||
| | | | | | | |
|
||||
| | | | kata-sandbox-<id> | | | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | | | | |
|
||||
| | |Pod 1 | | |
|
||||
| | +---------------------------------------------+ | |
|
||||
| | | |
|
||||
| | +---------------------------------------------+ | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | | | kata-shimv2, VMM and threads: | | | |
|
||||
| | | | (VMM, IO-threads, vCPU threads, etc)| | | |
|
||||
| | | | | | | |
|
||||
| | | | kata-sandbox-<id> | | | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | |Pod 2 | | |
|
||||
| | +---------------------------------------------+ | |
|
||||
| |kubepods | |
|
||||
| +---------------------------------------------------+ |
|
||||
| |
|
||||
|Node |
|
||||
+----------------------------------------------------------+
|
||||
```
|
||||
|
||||
### What does Kata do in this configuration?
|
||||
1. Given a `PodSandbox` container creation, let:
|
||||
|
||||
```
|
||||
podCgroup=Parent(container.CgroupsPath)
|
||||
KataSandboxCgroup=<podCgroup>/kata-sandbox-<PodSandboxID>
|
||||
```
|
||||
|
||||
2. Create the cgroup, `KataSandboxCgroup`
|
||||
|
||||
3. Join the `KataSandboxCgroup`
|
||||
|
||||
Any process created by the runtime will be created in `KataSandboxCgroup`.
|
||||
The runtime will not limit the cgroup in the host, but the caller is free
|
||||
to set the proper limits for the `podCgroup`.
|
||||
|
||||
In the example above the pod cgroups are `/kubepods/pod1` and `/kubepods/pod2`.
|
||||
Kata creates the unrestricted sandbox cgroup under the pod cgroup.
|
||||
|
||||
### Why create a Kata-cgroup under the parent cgroup?
|
||||
|
||||
`Docker` does not have a notion of pods, and will not create a cgroup directory
|
||||
to place a particular container in (i.e., all containers would be in a path like
|
||||
`/docker/container-id`. To simplify the implementation and continue to support `Docker`,
|
||||
Kata Containers creates the sandbox-cgroup, in the case of Kubernetes, or a container cgroup, in the case
|
||||
of docker.
|
||||
|
||||
### Improvements
|
||||
|
||||
- Get statistics about pod resources
|
||||
|
||||
If the Kata caller wants to know the resource usage on the host it can get
|
||||
statistics from the pod cgroup. All cgroups stats in the hierarchy will include
|
||||
the Kata overhead. This gives the possibility of gathering usage-statics at the
|
||||
pod level and the container level.
|
||||
|
||||
- Better host resource isolation
|
||||
|
||||
Because the Kata runtime will place all the Kata processes in the pod cgroup,
|
||||
the resource limits that the caller applies to the pod cgroup will affect all
|
||||
processes that belong to the Kata sandbox in the host. This will improve the
|
||||
isolation in the host preventing Kata to become a noisy neighbor.
|
||||
|
||||
## `SandboxCgroupOnly` disabled (default, legacy)
|
||||
|
||||
If the cgroup provided to Kata is not sized appropriately, instability will be
|
||||
introduced when fully constraining Kata components, and the user-workload will
|
||||
see a subset of resources that were requested. Based on this, the default
|
||||
handling for Kata Containers is to not fully constrain the VMM and Kata
|
||||
components on the host.
|
||||
|
||||
```
|
||||
+----------------------------------------------------------+
|
||||
| +---------------------------------------------------+ |
|
||||
| | +---------------------------------------------+ | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | | |Container 1 |-|Container 2 | | | |
|
||||
| | | | |-| | | | |
|
||||
| | | | Shim+container1 |-| Shim+container2 | | | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | | | | |
|
||||
| | |Pod 1 | | |
|
||||
| | +---------------------------------------------+ | |
|
||||
| | | |
|
||||
| | +---------------------------------------------+ | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | | |Container 1 |-|Container 2 | | | |
|
||||
| | | | |-| | | | |
|
||||
| | | | Shim+container1 |-| Shim+container2 | | | |
|
||||
| | | +--------------------------------------+ | | |
|
||||
| | | | | |
|
||||
| | |Pod 2 | | |
|
||||
| | +---------------------------------------------+ | |
|
||||
| |kubepods | |
|
||||
| +---------------------------------------------------+ |
|
||||
| +---------------------------------------------------+ |
|
||||
| | Hypervisor | |
|
||||
| |Kata | |
|
||||
| +---------------------------------------------------+ |
|
||||
| |
|
||||
|Node |
|
||||
+----------------------------------------------------------+
|
||||
|
||||
```
|
||||
|
||||
### What does this method do?
|
||||
|
||||
1. Given a container creation let `containerCgroupHost=container.CgroupsPath`
|
||||
1. Rename `containerCgroupHost` path to add `kata_`
|
||||
1. Let `PodCgroupPath=PodSanboxContainerCgroup` where `PodSanboxContainerCgroup` is the cgroup of a container of type `PodSandbox`
|
||||
1. Limit the `PodCgroupPath` with the sum of all the container limits in the Sandbox
|
||||
1. Move only vCPU threads of hypervisor to `PodCgroupPath`
|
||||
1. Per each container, move its `kata-shim` to its own `containerCgroupHost`
|
||||
1. Move hypervisor and applicable threads to memory cgroup `/kata`
|
||||
|
||||
_Note_: the Kata Containers runtime will not add all the hypervisor threads to
|
||||
the cgroup path requested, only vCPUs. These threads are run unconstrained.
|
||||
|
||||
This mitigates the risk of the VMM and other threads receiving an out of memory scenario (`OOM`).
|
||||
|
||||
|
||||
#### Impact
|
||||
|
||||
If resources are reserved at a system level to account for the overheads of
|
||||
running sandbox containers, this configuration can be utilized with adequate
|
||||
stability. In this scenario, non-negligible amounts of CPU and memory will be
|
||||
utilized unaccounted for on the host.
|
||||
|
||||
[linux-config]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md
|
||||
[cgroupspath]: https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#cgroups-path
|
||||
|
||||
## Summary
|
||||
|
||||
| cgroup option | default? | status | pros | cons
|
||||
|-|-|-|-|-|
|
||||
| `SandboxCgroupOnly=false` | yes | legacy | Easiest to make Kata work | Unaccounted for memory and resource utilization
|
||||
| `SandboxCgroupOnly=true` | no | recommended | Complete tracking of Kata memory and CPU utilization. In Kubernetes, the Kubelet can fully constrain Kata via the pod cgroup | Requires upper layer orchestrator which sizes sandbox cgroup appropriately |
|
@@ -1,17 +1,12 @@
|
||||
* [CPU constraints in Kata Containers](#cpu-constraints-in-kata-containers)
|
||||
* [Default number of virtual CPUs](#default-number-of-virtual-cpus)
|
||||
* [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods)
|
||||
* [Container lifecycle](#container-lifecycle)
|
||||
* [Container without CPU constraint](#container-without-cpu-constraint)
|
||||
* [Container with CPU constraint](#container-with-cpu-constraint)
|
||||
* [Do not waste resources](#do-not-waste-resources)
|
||||
* [CPU cgroups](#cpu-cgroups)
|
||||
* [cgroups in the guest](#cgroups-in-the-guest)
|
||||
* [CPU pinning](#cpu-pinning)
|
||||
* [cgroups in the host](#cgroups-in-the-host)
|
||||
- [Virtual machine vCPU sizing in Kata Containers](#virtual-machine-vcpu-sizing-in-kata-containers)
|
||||
* [Default number of virtual CPUs](#default-number-of-virtual-cpus)
|
||||
* [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods)
|
||||
* [Container lifecycle](#container-lifecycle)
|
||||
* [Container without CPU constraint](#container-without-cpu-constraint)
|
||||
* [Container with CPU constraint](#container-with-cpu-constraint)
|
||||
* [Do not waste resources](#do-not-waste-resources)
|
||||
|
||||
|
||||
# CPU constraints in Kata Containers
|
||||
# Virtual machine vCPU sizing in Kata Containers
|
||||
|
||||
## Default number of virtual CPUs
|
||||
|
||||
@@ -171,83 +166,6 @@ docker run --cpus 4 -ti debian bash -c "nproc; cat /sys/fs/cgroup/cpu,cpuacct/cp
|
||||
```
|
||||
|
||||
|
||||
## CPU cgroups
|
||||
|
||||
Kata Containers runs over two layers of cgroups, the first layer is in the guest where
|
||||
only the workload is placed, the second layer is in the host that is more complex and
|
||||
might contain more than one process and task (thread) depending of the number of
|
||||
containers per POD and vCPUs per container. The following diagram represents a Nginx container
|
||||
created with `docker` with the default number of vCPUs.
|
||||
|
||||
|
||||
```
|
||||
$ docker run -dt --runtime=kata-runtime nginx
|
||||
|
||||
|
||||
.-------.
|
||||
| Nginx |
|
||||
.--'-------'---. .------------.
|
||||
| Guest Cgroup | | Kata agent |
|
||||
.-'--------------'--'------------'. .-----------.
|
||||
| Thread: Hypervisor's vCPU 0 | | Kata Shim |
|
||||
.'---------------------------------'. .'-----------'.
|
||||
| Tasks | | Processes |
|
||||
.'-----------------------------------'--'-------------'.
|
||||
| Host Cgroup |
|
||||
'------------------------------------------------------'
|
||||
```
|
||||
|
||||
The next sections explain the difference between processes and tasks and why only hypervisor
|
||||
vCPUs are constrained.
|
||||
|
||||
### cgroups in the guest
|
||||
|
||||
Only the workload process including all its threads are placed into CPU cgroups, this means
|
||||
that `kata-agent` and `systemd` run without constraints in the guest.
|
||||
|
||||
#### CPU pinning
|
||||
|
||||
Kata Containers tries to apply and honor the cgroups but sometimes that is not possible.
|
||||
An example of this occurs with CPU cgroups when the number of virtual CPUs (in the guest)
|
||||
does not match the actual number of physical host CPUs.
|
||||
In Kata Containers to have a good performance and small memory footprint, the resources are
|
||||
hot added when they are needed, therefore the number of virtual resources is not the same
|
||||
as the number of physical resources. The problem with this approach is that it's not possible
|
||||
to pin a process on a specific resource that is not present in the guest. To deal with this
|
||||
limitation and to not fail when the container is being created, Kata Containers does not apply
|
||||
the constraint in the first layer (guest) if the resource does not exist in the guest, but it
|
||||
is applied in the second layer (host) where the hypervisor is running. The constraint is applied
|
||||
in both layers when the resource is available in the guest and host. The next sections provide
|
||||
further details on what parts of the hypervisor are constrained.
|
||||
|
||||
### cgroups in the host
|
||||
|
||||
In Kata Containers the workloads run in a virtual machine that is managed and represented by a
|
||||
hypervisor running in the host. Like other processes the hypervisor might use threads to realize
|
||||
several tasks, for example IO and Network operations. One of the most important uses for the
|
||||
threads is as vCPUs. The processes running in the guest see these vCPUs as physical CPUs, while
|
||||
in the host those vCPU are just threads that are part of a process. This is the key to ensure
|
||||
workloads consumes only the amount of CPU resources that were assigned to it without impacting
|
||||
other operations. From user perspective the easier approach to implement it would be to take the
|
||||
whole hypervisor including its threads and move them into the cgroup, unfortunately this will
|
||||
impact negatively the performance, since vCPUs, IO and Network threads will be fighting for
|
||||
resources. The following table shows a random read performance comparison between a Kata Container
|
||||
with all its hypervisor threads in the cgroup and other with only its hypervisor vCPU threads
|
||||
constrained, the difference is huge.
|
||||
|
||||
|
||||
| Bandwidth | All threads | vCPU threads | Units |
|
||||
|:-------------:|:-------------:|:------------:|:-----:|
|
||||
| 4k | 136.2 | 294.7 | MB/s |
|
||||
| 8k | 166.6 | 579.4 | MB/s |
|
||||
| 16k | 178.3 | 1093.3 | MB/s |
|
||||
| 32k | 179.9 | 1931.5 | MB/s |
|
||||
| 64k | 213.6 | 3994.2 | MB/s |
|
||||
|
||||
|
||||
To have the best performance in Kata Containers only the vCPU threads are constrained.
|
||||
|
||||
|
||||
[1]: https://docs.docker.com/config/containers/resource_constraints/#cpu
|
||||
[2]: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource
|
||||
[3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/
|
Reference in New Issue
Block a user