docs: document runtime-rs sandbox overhead sizing

Add a how-to describing how runtime-rs sizes static sandboxes from
overhead plus requested CPU/memory, including that fractional vCPU
results are rounded up for VMM-visible vCPU counts, and link it from the
how-to README.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
This commit is contained in:
Fabiano Fidêncio
2026-06-24 18:52:25 +02:00
parent a34c74a2d4
commit 346a3be9ad
2 changed files with 368 additions and 0 deletions

View File

@@ -51,6 +51,7 @@
- [How to use mem-agent to decrease the memory usage of Kata container](how-to-use-memory-agent.md)
- [How to use seccomp with runtime-rs](how-to-use-seccomp-with-runtime-rs.md)
- [How to use passthroughfd-IO with runtime-rs and Dragonball](how-to-use-passthroughfd-io-within-runtime-rs.md)
- [How to size sandbox overhead in runtime-rs](how-to-size-sandbox-overhead-runtime-rs.md)
- [How to use EROFS snapshotter with Kata Containers](how-to-use-erofs-snapshotter-with-kata.md)
- [How to use NUMA with Kata Containers](how-to-use-numa-with-kata.md)

View File

@@ -0,0 +1,367 @@
# How to size `overhead_*` for runtime-rs sandbox sizing
This document explains how `overhead_vcpus` and `overhead_memory` are expected
to be used in runtime-rs.
> [!WARNING]
> For runtime-rs, using `static_sandbox_resource_mgmt` is the recommended mode.
> Disabling it is not recommended for production sandbox sizing.
> [!IMPORTANT]
> For correct and predictable Kata sandbox sizing in Kubernetes, workload CPU
> and memory limits **must** be set. Without limits, runtime-rs falls back to
> `default_vcpus` and `default_memory`, which is a compatibility fallback and
> not the intended production sizing model.
## Why these fields exist
In runtime-rs, static sandbox sizing is enabled by default. Kata must pick VM resources before
starting the workload. In Kubernetes, pod limits represent workload resources,
but the VM also needs extra resources for guest/kernel/runtime overhead.
`overhead_vcpus` and `overhead_memory` represent that extra budget.
## Sizing model
With runtime-rs static sandbox sizing, Kata uses:
- If workload limits are present:
- `vm_vcpus = requested_vcpus + overhead_vcpus`
- `vm_memory = requested_memory + overhead_memory`
- If workload limits are not present:
- `vm_vcpus = default_vcpus`
- `vm_memory = default_memory`
In other words, `default_*` is the fallback for "no limits", while
`overhead_*` is the additive budget for "limits are set".
For CPU, runtime-rs sums workload and overhead values, and if the computed
result is fractional it is rounded up to the next integer (`ceil`), since VMMs
expose integer vCPU counts. A minimum of `1` vCPU is enforced for the
limit-driven path, including the `0 + 0` edge case.
## `podFixed` as a sizing function
Treat `RuntimeClass.overhead.podFixed` as a function of expected VM size:
larger VMs usually need larger overhead budgets, for both static and dynamic
allocation environments.
Operationally, this usually leads to one of two models:
- Single runtime class: one conservative `podFixed` value that works across all
expected workload sizes.
- Multiple runtime classes (for example S/M/L/XL): each class has a tailored
`podFixed` and runtime profile for tighter node-level accounting.
Kata cannot ship a single correct value for this function, because it depends on
a large number of deployment-specific factors, including:
- the hypervisor in use (each has a different memory/CPU footprint),
- the file-sharing mechanism (`virtio-fs` vs. others),
- the presence of CoCo guest components,
- the VM image in use (our released images, or downstream-modified ones),
- hardware features such as GPUs (or anything else requiring large DMA buffers).
These factors, the inherent brittleness of overhead measurements, and how much
headroom a cluster owner is willing to "waste" to guarantee stable operation,
all feed into the value. Downstream operators should therefore measure and tune
this function for their own deployments.
## Recommended operator/admin workflow
The Kubernetes documentation defines `RuntimeClass.overhead.podFixed` as:
> podFixed represents the fixed resource overhead associated with running a pod.
For Kata, that overhead has two parts: the *guest-side* overhead (the extra
CPU/memory the VM needs on top of the workload) and the *host-side* overhead
(the runtime, hypervisor, and helper processes running on the node). `podFixed`
must account for **both**, while Kata `overhead_*` accounts for the guest-side
part only.
A practical workflow is therefore:
1. Estimate (or measure) the guest-side overhead. Kata profiles ship with a
starting value, but you should refine it for your environment.
2. Set Kata `overhead_*` per runtime profile to that guest-side value.
3. Estimate (or measure) the host-side overhead.
4. Set `RuntimeClass.overhead.podFixed` to the sum of the guest-side and
host-side overhead. This naturally keeps `podFixed` higher than `overhead_*`.
5. Validate with representative workloads (small/medium/large). As rough
starting points for the measurements:
- guest-side overhead: subtract a container's used memory (for example,
`free` inside the container) from the nominal VM size;
- host-side overhead: subtract the nominal VM size from the pod's host
cgroup usage, for example
`cat /sys/fs/cgroup/kubepods.slice/**/memory.current`.
For production-oriented Kata deployments, assume users provide workload limits.
The no-limits path exists as a compatibility fallback, not as the primary sizing
model.
Kata profiles initialize `overhead_*` to values derived from Pod Overhead (for
example, 80% for CPU and memory), but this is only a policy input and should be
tuned by downstream operators and admins.
## Who sets what: admin vs user
In many environments, the "admin" and the "user" are different personas. In
smaller environments they may be the same person or team.
- Admin/operator responsibilities:
- Set runtime defaults (`default_*`) and overhead values (`overhead_*`).
- Set and maintain `RuntimeClass.overhead.podFixed`.
- Provide runtime classes that users can select per workload profile.
- Ensure those policies are aligned for each runtime profile.
- Validate behavior with representative workloads and adjust if needed.
- User/application responsibilities:
- Set pod/container CPU and memory limits for workload intent.
- Use the runtime class provided by admins for the workload profile.
- Avoid relying on default sizing when deterministic resources are required.
## Example 1: limits set on both CPU and memory
**Scenario intent:** show the standard production case with explicit workload limits.
**Consequence:** users get predictable sizing plus admin-defined overhead budget.
**`RuntimeClass.overhead.podFixed` relationship:** `podFixed` should be higher than
`overhead_*`, since `podFixed` must include host-side runtime components.
Given the runtime profile:
- `default_vcpus = 2`
- `default_memory = 1024`
- `overhead_vcpus = 0.5`
- `overhead_memory = 128`
And the matching `RuntimeClass.overhead.podFixed`:
- `cpu = 600m` (`0.6`)
- `memory = 160Mi`
Workload limits:
- CPU quota/period equivalent to `1.5 vCPUs`
- memory limit `600 MiB`
Kata VM sizing (guest side):
- `vm_vcpus = 1.5 + 0.5 = 2.0`
- `vm_memory = 600 + 128 = 728 MiB`
Kubernetes accounting for the whole pod (`sum(limits) + podFixed`):
- `pod_cpu = 1.5 + 0.6 = 2.1`
- `pod_memory = 600 + 160 = 760 MiB`
Note that `podFixed` (`160Mi`) is higher than `overhead_memory` (`128`), since it
must also cover the host-side runtime components that live outside the VM.
## Example 2: partial limits (split by dimension)
**Scenario intent:** show what happens when only one limit is provided.
**Consequence:** once any limit exists, overhead logic applies to both dimensions.
**`RuntimeClass.overhead.podFixed` relationship:** same rule as Example 1;
`podFixed` should remain higher than `overhead_*`.
Given:
- `default_vcpus = 2`
- `default_memory = 1024`
- `overhead_vcpus = 0.5`
- `overhead_memory = 128`
### 2A. Memory limit only
Workload sets:
- memory limit = `512 MiB`
- no CPU limit
Result:
- CPU is rounded up for boot: `vm_vcpus = ceil(0 + 0.5) = 1`
- Memory uses overhead formula: `vm_memory = 512 + 128 = 640 MiB`
### 2B. CPU limit only
Workload sets:
- CPU quota/period equivalent to `1.5 vCPUs`
- no memory limit
Result:
- CPU uses overhead formula: `vm_vcpus = 1.5 + 0.5 = 2.0`
- Memory still uses overhead baseline: `vm_memory = 0 + 128 = 128 MiB`
This is the reason workload memory limits **must** be set (see the note at the
top of this document): with a CPU limit but no memory limit, the VM is sized
with `overhead_memory` only, which is almost certainly too small to run a real
workload. It is the explicit overhead baseline, not a default fallback to
`default_memory`. As a safety net, if the computed sandbox memory would be `0`
(for example, a CPU-only workload with `overhead_memory = 0`), runtime-rs fails
early with an actionable error instead of booting an unusable VM.
This mirrors runtime-rs behavior: once limits are present for a sandbox, overhead
is applied on both dimensions, and any missing dimension uses `0 + overhead_*`
(with fractional CPU results rounded up).
## Example 3: `overhead_* = 0` (zero-overhead model)
**Scenario intent:** user-driven exact workload sizing by setting `overhead_* = 0`.
**Consequence:** users get exactly requested VM sizes when limits are set, but they
are accountable for accounting workload-related overhead in those limits.
**`RuntimeClass.overhead.podFixed` relationship:** `podFixed` is still required to
cover host-side resource usage (not guest-side), and should be tuned
independently.
Some deployments may choose to set:
- `overhead_vcpus = 0`
- `overhead_memory = 0`
With:
- `default_vcpus = 2`
- `default_memory = 1024`
### 3A. Limits set on both dimensions
Workload limits:
- CPU = `1.5 vCPUs`
- memory = `600 MiB`
Result:
- `vm_vcpus = 1.5 + 0 = 1.5`
- `vm_memory = 600 + 0 = 600 MiB`
### 3B. No limits
Result:
- `vm_vcpus = default_vcpus = 2`
- `vm_memory = default_memory = 1024 MiB`
This keeps defaults as fallback only, while limit-driven sizing becomes purely
workload-based.
## Example 4: no limits (fallback path)
**Scenario intent:** show compatibility fallback behavior when users do not
provide limits.
**Consequence:** VM sizing comes from admin-defined defaults. This is acceptable
for basic workloads and testing, **but not the intended production sizing
posture**.
**`RuntimeClass.overhead.podFixed` relationship:** in this case, `podFixed`
should be higher than the effective default baseline (`default_*`) to account
for host-side components as well. Kubernetes does not know Kata `default_*`
values; if `podFixed` is too low, host-side usage can exceed the pod budget and
the pod may be killed.
Given:
- `default_vcpus = 2`
- `default_memory = 1024` (MiB)
- `overhead_vcpus = 0.5`
- `overhead_memory = 128` (MiB)
Pod/container limits are not set.
Result:
- VM boots with `2 vCPUs` and `1024 MiB`.
- `overhead_*` is not used in this case.
## Runtime profile snippet
```toml
[hypervisor.qemu]
default_vcpus = 2
default_memory = 1024
overhead_vcpus = 0.5
overhead_memory = 128
```
## Helm examples
With kata-deploy Helm, the recommended pattern is to set `overhead_*` in a
runtime `dropIn` and set the corresponding `RuntimeClass.overhead.podFixed`
to a higher value in the same values file.
For runtime-rs, `static_sandbox_resource_mgmt` is already enabled by default, so
these examples focus on `overhead_*` and related policy values.
### Example A: custom runtime profile
```yaml
customRuntimes:
enabled: true
runtimes:
my-qemu-runtime-rs:
baseConfig: "qemu"
dropIn: |
[hypervisor.qemu]
default_vcpus = 2
default_memory = 1024
overhead_vcpus = 0.5
overhead_memory = 128
runtimeClass: |
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: kata-my-qemu-runtime-rs
labels:
app.kubernetes.io/managed-by: kata-deploy
handler: kata-my-qemu-runtime-rs
overhead:
podFixed:
cpu: "600m"
memory: "160Mi"
scheduling:
nodeSelector:
katacontainers.io/kata-runtime: "true"
```
In this example:
- Kata overhead used for VM sizing is `0.5 vCPU` and `128Mi`.
- Kubernetes scheduler/accounting overhead is `600m` and `160Mi`.
- The gap (`podFixed` > `overhead_*`) leaves extra budget for components outside
the guest workload cgroup model.
### Example B: override a default shim with `shims.<shim>.dropIn`
If you do not need a new runtime class, you can patch an existing runtime-rs
shim directly:
```yaml
shims:
qemu:
enabled: true
dropIn: |
[hypervisor.qemu]
overhead_vcpus = 0.5
overhead_memory = 128
```
This updates Kata sizing behavior for that shim. If you also control the
runtime class YAML externally, keep `podFixed` greater than `overhead_*` under
the same sizing policy.
## Kubernetes alignment notes
- `RuntimeClass.overhead.podFixed` and Kata `overhead_*` should be managed by
the same operator/admin policy, with `podFixed` set higher than `overhead_*`.
- Mismatched values can produce surprising behavior under pressure.
- Upstream runtime-rs does not auto-fetch RuntimeClass overhead from Kubernetes;
the configured `overhead_*` values are the source used for VM sizing.