docs: document runtime-rs sandbox overhead sizing

Add a how-to describing how runtime-rs sizes static sandboxes from overhead plus requested CPU/memory, including that fractional vCPU results are rounded up for VMM-visible vCPU counts, and link it from the how-to README. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>
2026-06-30 14:06:46 +00:00 · 2026-06-24 18:52:25 +02:00
parent a34c74a2d4
commit 346a3be9ad
2 changed files with 368 additions and 0 deletions
--- a/docs/how-to/README.md
+++ b/docs/how-to/README.md
@@ -51,6 +51,7 @@
 - [How to use mem-agent to decrease the memory usage of Kata container](how-to-use-memory-agent.md)
 - [How to use seccomp with runtime-rs](how-to-use-seccomp-with-runtime-rs.md)
 - [How to use passthroughfd-IO with runtime-rs and Dragonball](how-to-use-passthroughfd-io-within-runtime-rs.md)
+- [How to size sandbox overhead in runtime-rs](how-to-size-sandbox-overhead-runtime-rs.md)
 - [How to use EROFS snapshotter with Kata Containers](how-to-use-erofs-snapshotter-with-kata.md)
 - [How to use NUMA with Kata Containers](how-to-use-numa-with-kata.md)

--- a/docs/how-to/how-to-size-sandbox-overhead-runtime-rs.md
+++ b/docs/how-to/how-to-size-sandbox-overhead-runtime-rs.md
@@ -0,0 +1,367 @@
+# How to size `overhead_*` for runtime-rs sandbox sizing
+
+This document explains how `overhead_vcpus` and `overhead_memory` are expected
+to be used in runtime-rs.
+
+> [!WARNING]
+> For runtime-rs, using `static_sandbox_resource_mgmt` is the recommended mode.
+> Disabling it is not recommended for production sandbox sizing.
+
+> [!IMPORTANT]
+> For correct and predictable Kata sandbox sizing in Kubernetes, workload CPU
+> and memory limits **must** be set. Without limits, runtime-rs falls back to
+> `default_vcpus` and `default_memory`, which is a compatibility fallback and
+> not the intended production sizing model.
+
+## Why these fields exist
+
+In runtime-rs, static sandbox sizing is enabled by default. Kata must pick VM resources before
+starting the workload. In Kubernetes, pod limits represent workload resources,
+but the VM also needs extra resources for guest/kernel/runtime overhead.
+
+`overhead_vcpus` and `overhead_memory` represent that extra budget.
+
+## Sizing model
+
+With runtime-rs static sandbox sizing, Kata uses:
+
+- If workload limits are present:
+  - `vm_vcpus = requested_vcpus + overhead_vcpus`
+  - `vm_memory = requested_memory + overhead_memory`
+- If workload limits are not present:
+  - `vm_vcpus = default_vcpus`
+  - `vm_memory = default_memory`
+
+In other words, `default_*` is the fallback for "no limits", while
+`overhead_*` is the additive budget for "limits are set".
+For CPU, runtime-rs sums workload and overhead values, and if the computed
+result is fractional it is rounded up to the next integer (`ceil`), since VMMs
+expose integer vCPU counts. A minimum of `1` vCPU is enforced for the
+limit-driven path, including the `0 + 0` edge case.
+
+## `podFixed` as a sizing function
+
+Treat `RuntimeClass.overhead.podFixed` as a function of expected VM size:
+larger VMs usually need larger overhead budgets, for both static and dynamic
+allocation environments.
+
+Operationally, this usually leads to one of two models:
+
+- Single runtime class: one conservative `podFixed` value that works across all
+  expected workload sizes.
+- Multiple runtime classes (for example S/M/L/XL): each class has a tailored
+  `podFixed` and runtime profile for tighter node-level accounting.
+
+Kata cannot ship a single correct value for this function, because it depends on
+a large number of deployment-specific factors, including:
+
+- the hypervisor in use (each has a different memory/CPU footprint),
+- the file-sharing mechanism (`virtio-fs` vs. others),
+- the presence of CoCo guest components,
+- the VM image in use (our released images, or downstream-modified ones),
+- hardware features such as GPUs (or anything else requiring large DMA buffers).
+
+These factors, the inherent brittleness of overhead measurements, and how much
+headroom a cluster owner is willing to "waste" to guarantee stable operation,
+all feed into the value. Downstream operators should therefore measure and tune
+this function for their own deployments.
+
+## Recommended operator/admin workflow
+
+The Kubernetes documentation defines `RuntimeClass.overhead.podFixed` as:
+
+> podFixed represents the fixed resource overhead associated with running a pod.
+
+For Kata, that overhead has two parts: the *guest-side* overhead (the extra
+CPU/memory the VM needs on top of the workload) and the *host-side* overhead
+(the runtime, hypervisor, and helper processes running on the node). `podFixed`
+must account for **both**, while Kata `overhead_*` accounts for the guest-side
+part only.
+
+A practical workflow is therefore:
+
+1. Estimate (or measure) the guest-side overhead. Kata profiles ship with a
+   starting value, but you should refine it for your environment.
+2. Set Kata `overhead_*` per runtime profile to that guest-side value.
+3. Estimate (or measure) the host-side overhead.
+4. Set `RuntimeClass.overhead.podFixed` to the sum of the guest-side and
+   host-side overhead. This naturally keeps `podFixed` higher than `overhead_*`.
+5. Validate with representative workloads (small/medium/large). As rough
+   starting points for the measurements:
+   - guest-side overhead: subtract a container's used memory (for example,
+     `free` inside the container) from the nominal VM size;
+   - host-side overhead: subtract the nominal VM size from the pod's host
+     cgroup usage, for example
+     `cat /sys/fs/cgroup/kubepods.slice/**/memory.current`.
+
+For production-oriented Kata deployments, assume users provide workload limits.
+The no-limits path exists as a compatibility fallback, not as the primary sizing
+model.
+
+Kata profiles initialize `overhead_*` to values derived from Pod Overhead (for
+example, 80% for CPU and memory), but this is only a policy input and should be
+tuned by downstream operators and admins.
+
+## Who sets what: admin vs user
+
+In many environments, the "admin" and the "user" are different personas. In
+smaller environments they may be the same person or team.
+
+- Admin/operator responsibilities:
+  - Set runtime defaults (`default_*`) and overhead values (`overhead_*`).
+  - Set and maintain `RuntimeClass.overhead.podFixed`.
+  - Provide runtime classes that users can select per workload profile.
+  - Ensure those policies are aligned for each runtime profile.
+  - Validate behavior with representative workloads and adjust if needed.
+
+- User/application responsibilities:
+  - Set pod/container CPU and memory limits for workload intent.
+  - Use the runtime class provided by admins for the workload profile.
+  - Avoid relying on default sizing when deterministic resources are required.
+
+## Example 1: limits set on both CPU and memory
+
+**Scenario intent:** show the standard production case with explicit workload limits.
+
+**Consequence:** users get predictable sizing plus admin-defined overhead budget.
+
+**`RuntimeClass.overhead.podFixed` relationship:** `podFixed` should be higher than
+`overhead_*`, since `podFixed` must include host-side runtime components.
+
+Given the runtime profile:
+
+- `default_vcpus = 2`
+- `default_memory = 1024`
+- `overhead_vcpus = 0.5`
+- `overhead_memory = 128`
+
+And the matching `RuntimeClass.overhead.podFixed`:
+
+- `cpu = 600m` (`0.6`)
+- `memory = 160Mi`
+
+Workload limits:
+
+- CPU quota/period equivalent to `1.5 vCPUs`
+- memory limit `600 MiB`
+
+Kata VM sizing (guest side):
+
+- `vm_vcpus = 1.5 + 0.5 = 2.0`
+- `vm_memory = 600 + 128 = 728 MiB`
+
+Kubernetes accounting for the whole pod (`sum(limits) + podFixed`):
+
+- `pod_cpu = 1.5 + 0.6 = 2.1`
+- `pod_memory = 600 + 160 = 760 MiB`
+
+Note that `podFixed` (`160Mi`) is higher than `overhead_memory` (`128`), since it
+must also cover the host-side runtime components that live outside the VM.
+
+## Example 2: partial limits (split by dimension)
+
+**Scenario intent:** show what happens when only one limit is provided.
+
+**Consequence:** once any limit exists, overhead logic applies to both dimensions.
+
+**`RuntimeClass.overhead.podFixed` relationship:** same rule as Example 1;
+`podFixed` should remain higher than `overhead_*`.
+
+Given:
+
+- `default_vcpus = 2`
+- `default_memory = 1024`
+- `overhead_vcpus = 0.5`
+- `overhead_memory = 128`
+
+### 2A. Memory limit only
+
+Workload sets:
+
+- memory limit = `512 MiB`
+- no CPU limit
+
+Result:
+
+- CPU is rounded up for boot: `vm_vcpus = ceil(0 + 0.5) = 1`
+- Memory uses overhead formula: `vm_memory = 512 + 128 = 640 MiB`
+
+### 2B. CPU limit only
+
+Workload sets:
+
+- CPU quota/period equivalent to `1.5 vCPUs`
+- no memory limit
+
+Result:
+
+- CPU uses overhead formula: `vm_vcpus = 1.5 + 0.5 = 2.0`
+- Memory still uses overhead baseline: `vm_memory = 0 + 128 = 128 MiB`
+
+This is the reason workload memory limits **must** be set (see the note at the
+top of this document): with a CPU limit but no memory limit, the VM is sized
+with `overhead_memory` only, which is almost certainly too small to run a real
+workload. It is the explicit overhead baseline, not a default fallback to
+`default_memory`. As a safety net, if the computed sandbox memory would be `0`
+(for example, a CPU-only workload with `overhead_memory = 0`), runtime-rs fails
+early with an actionable error instead of booting an unusable VM.
+
+This mirrors runtime-rs behavior: once limits are present for a sandbox, overhead
+is applied on both dimensions, and any missing dimension uses `0 + overhead_*`
+(with fractional CPU results rounded up).
+
+## Example 3: `overhead_* = 0` (zero-overhead model)
+
+**Scenario intent:** user-driven exact workload sizing by setting `overhead_* = 0`.
+
+**Consequence:** users get exactly requested VM sizes when limits are set, but they
+are accountable for accounting workload-related overhead in those limits.
+
+**`RuntimeClass.overhead.podFixed` relationship:** `podFixed` is still required to
+cover host-side resource usage (not guest-side), and should be tuned
+independently.
+
+Some deployments may choose to set:
+
+- `overhead_vcpus = 0`
+- `overhead_memory = 0`
+
+With:
+
+- `default_vcpus = 2`
+- `default_memory = 1024`
+
+### 3A. Limits set on both dimensions
+
+Workload limits:
+
+- CPU = `1.5 vCPUs`
+- memory = `600 MiB`
+
+Result:
+
+- `vm_vcpus = 1.5 + 0 = 1.5`
+- `vm_memory = 600 + 0 = 600 MiB`
+
+### 3B. No limits
+
+Result:
+
+- `vm_vcpus = default_vcpus = 2`
+- `vm_memory = default_memory = 1024 MiB`
+
+This keeps defaults as fallback only, while limit-driven sizing becomes purely
+workload-based.
+
+## Example 4: no limits (fallback path)
+
+**Scenario intent:** show compatibility fallback behavior when users do not
+provide limits.
+
+**Consequence:** VM sizing comes from admin-defined defaults. This is acceptable
+for basic workloads and testing, **but not the intended production sizing
+posture**.
+
+**`RuntimeClass.overhead.podFixed` relationship:** in this case, `podFixed`
+should be higher than the effective default baseline (`default_*`) to account
+for host-side components as well. Kubernetes does not know Kata `default_*`
+values; if `podFixed` is too low, host-side usage can exceed the pod budget and
+the pod may be killed.
+
+Given:
+
+- `default_vcpus = 2`
+- `default_memory = 1024` (MiB)
+- `overhead_vcpus = 0.5`
+- `overhead_memory = 128` (MiB)
+
+Pod/container limits are not set.
+
+Result:
+
+- VM boots with `2 vCPUs` and `1024 MiB`.
+- `overhead_*` is not used in this case.
+
+## Runtime profile snippet
+
+```toml
+[hypervisor.qemu]
+default_vcpus = 2
+default_memory = 1024
+overhead_vcpus = 0.5
+overhead_memory = 128
+```
+
+## Helm examples
+
+With kata-deploy Helm, the recommended pattern is to set `overhead_*` in a
+runtime `dropIn` and set the corresponding `RuntimeClass.overhead.podFixed`
+to a higher value in the same values file.
+
+For runtime-rs, `static_sandbox_resource_mgmt` is already enabled by default, so
+these examples focus on `overhead_*` and related policy values.
+
+### Example A: custom runtime profile
+
+```yaml
+customRuntimes:
+  enabled: true
+  runtimes:
+    my-qemu-runtime-rs:
+      baseConfig: "qemu"
+      dropIn: |
+        [hypervisor.qemu]
+        default_vcpus = 2
+        default_memory = 1024
+        overhead_vcpus = 0.5
+        overhead_memory = 128
+      runtimeClass: |
+        kind: RuntimeClass
+        apiVersion: node.k8s.io/v1
+        metadata:
+          name: kata-my-qemu-runtime-rs
+          labels:
+            app.kubernetes.io/managed-by: kata-deploy
+        handler: kata-my-qemu-runtime-rs
+        overhead:
+          podFixed:
+            cpu: "600m"
+            memory: "160Mi"
+        scheduling:
+          nodeSelector:
+            katacontainers.io/kata-runtime: "true"
+```
+
+In this example:
+
+- Kata overhead used for VM sizing is `0.5 vCPU` and `128Mi`.
+- Kubernetes scheduler/accounting overhead is `600m` and `160Mi`.
+- The gap (`podFixed` > `overhead_*`) leaves extra budget for components outside
+  the guest workload cgroup model.
+
+### Example B: override a default shim with `shims.<shim>.dropIn`
+
+If you do not need a new runtime class, you can patch an existing runtime-rs
+shim directly:
+
+```yaml
+shims:
+  qemu:
+    enabled: true
+    dropIn: |
+      [hypervisor.qemu]
+      overhead_vcpus = 0.5
+      overhead_memory = 128
+```
+
+This updates Kata sizing behavior for that shim. If you also control the
+runtime class YAML externally, keep `podFixed` greater than `overhead_*` under
+the same sizing policy.
+
+## Kubernetes alignment notes
+
+- `RuntimeClass.overhead.podFixed` and Kata `overhead_*` should be managed by
+  the same operator/admin policy, with `podFixed` set higher than `overhead_*`.
+- Mismatched values can produce surprising behavior under pressure.
+- Upstream runtime-rs does not auto-fetch RuntimeClass overhead from Kubernetes;
+  the configured `overhead_*` values are the source used for VM sizing.