Add fields to DmVerityInfo needed for dm-verity device creation:
(1) salt: Optional salt value for the hash computation
(2) hash_type: dm-verity version
(3) no_superblock: whether to skip the superblock at hash offset
Uses serde defaults for backward compatibility with existing serialized
data that lacks these fields.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
With sandbox_cgroup_only the shim, QEMU and virtiofsd run inside the
pod's memory cgroup, whose limit is the workload limit plus the
RuntimeClass pod overhead. On aarch64 the VMM host footprint is much
larger than on x86 (QEMU's own anon RSS is ~160Mi+ before any guest
RAM, on top of the shmem-backed guest memory), so the 160Mi overhead
is too small: small-memory-limit pods get their qemu-system process
OOM-killed by the pod cgroup (CONSTRAINT_MEMCG), and the agent vsock
never comes up (ENODEV), so the sandbox fails to start.
Raise the pod overhead to 320Mi for the qemu shims that run on
aarch64 (qemu, qemu-runtime-rs, qemu-coco-dev-runtime-rs). The value
is applied on all architectures for simplicity; x86 is over-provisioned
by ~160Mi, which is acceptable. The TEE/GPU shims already carry far
larger overhead and amd64-only shims (clh*, dragonball, fc) are
unaffected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add a how-to describing how runtime-rs sizes static sandboxes from
overhead plus requested CPU/memory, including that fractional vCPU
results are rounded up for VMM-visible vCPU counts, and link it from the
how-to README.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
When static sandbox sizing is enabled, keep configured defaults when
workloads do not specify CPU or memory limits. When limits are present,
size the VM as requested resources plus overhead_vcpus/overhead_memory
values derived from runtime-rs profile defaults.
Limit-driven vCPU sizing is clamped to a minimum of one vCPU so a 0.0
result never yields an unbootable VM, and sandbox setup fails early with
a clear, actionable error when the computed memory is 0 MiB (pointing at
memory limits or non-zero default/overhead memory settings).
This keeps static VM sizing predictable across runtime-rs profiles,
including NVIDIA ones.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
When cgroup v2 is enabled, exec can fail with EBUSY while writing the
process to cgroup.procs if the container process has been delegated to an
init subcgroup.
PR #10845 fixed this behavior for the systemd/D-Bus cgroup manager
path, which was related to #10733. The cgroupfs manager still writes the
process directly to the container cgroup, so apply the same init
subcgroup handling there.
Also fix the cgroupfs init-subcgroup existence check for absolute OCI
cgroup paths by joining the trimmed cgroup path under the cgroup root.
Fixes: #9701
Signed-off-by: Chris Ayoub <cayoub@openai.com>
Generated-By: OpenAI Codex
This addresses an issue where the disable_guest_empty_dir=true code paths did
not take into account that hugepage-backed emptyDirs should always be recreated
in the guest (using guest hugepages).
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
This makes the runtime share the host Kubelet emptyDir folder with the guest
instead of the agent creating an empty folder in the container rootfs. Doing so
enables the Kubelet to track emptyDir usage and evict greedy pods.
In other words, with virtio-fs the container rootfs uses host storage whether
this is true or false, however with true, Kata uses the k8s emptyDir folder so
the sizeLimit is properly enforced by k8s.
Addresses the ephemeral storage part of #12203.
History:
* Initially, emptyDirs are slow because they are shared from the host with 9p.
https://github.com/kata-containers/runtime/issues/1472
* To address above, emptyDirs are hardcoded to be created by the agent in the
pause container's rootfs, potentially leveraging devicemapper and improving
perf.
https://github.com/kata-containers/runtime/pull/1485
* The previous PR regressed an (interesting?) use case where emptyDirs were
used to share data from the host to the guest, so the behavior was made
configurable and `disable_guest_empty_dir = false` is introduced, defaulting
to the behavior of the previous PR.
https://github.com/kata-containers/kata-containers/pull/2056
* Another resource accounting regression remains which is addressed in this PR.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
When the kata configuration does not set log_level to debug, the
containerd-shim-v2 defaults to WarnLevel, which suppresses important
diagnostic information logged at Info level.
Key Info-level logs that are currently hidden:
- QEMU command line (qemu.go:3566) - critical for debugging VM issues
- VM lifecycle events (creation, start, stop)
- Device hotplug operations (VFIO, network, volumes)
- Resource configuration (NUMA, memory)
- QMP socket details
Info level provides significantly better diagnostic data without
flooding logs with excessive detail (which would occur at Debug level).
This change improves troubleshooting capabilities for production
deployments where debug mode is not enabled.
Note: runtime-rs already defaults to Info level (see
src/runtime-rs/crates/shim/src/logger.rs:13,30), so this change only
affects the Go runtime.
Fixes: #13260
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
This skill will inform AI agents how to properly write and format
docs in the new docs system. There is nothing too fancy, just reminding
agents to use mkdocs-materialx features instead of treating the
markdown like the legacy Github-based format.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>
Update the composable-vm-images proposal with the design decisions we only
arrived at after experimenting with the implementation:
* Replace the hardcoded agent path-resolution table with the data-driven
components.toml manifest (process levels, args/optional_args, env,
wait_socket, ${...} substitution, and select/variants), keeping the agent
generic.
* Document the attester-variant contract: NVRC exports KATA_ATTESTER_VARIANT
and the manifest selects the stock vs NVIDIA attestation-agent.
* Document the runtime dependency requirements found during bring-up: the
nvidia attester's LD_LIBRARY_PATH (libnvat closure in the coco addon +
NVML in the gpu addon) and the NVML-init failure mode, plus CDH
secure_mount tooling placement -- plain storage (mke2fs/mkfs.ext4/dd) in
the base vs encrypted storage (cryptsetup) in the coco addon, the CDH
PATH, and the base/addon ABI lockstep.
* Reflect the storage tooling and bundled libraries in the base/coco-addon
build sections, and mark the GPU addon as implemented.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
The kata-monitor image has no job-dispatcher sidecar, so opt out of the
kata-deploy-specific dispatcher manifest derivation in the
payload-after-push workflow by setting
KATA_DEPLOY_PUBLISH_JOB_DISPATCHER=false, mirroring the same fix already
applied to the release workflows.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Network devices for VM-based containers are allowed to be placed in the
host netns to eliminate as many hops as possible, which is what we
aim for to achieve near-native networking performance.
This commit introduces the `dan_conf` field to the configuration file.
This allows the runtime to specify the configuration path for
Direct Attached Network (DAN) devices, enabling interfaces to remain
in the host network namespace while being utilized by the VM-based(qemu)
containers.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The experimental configuration allows enabling features not yet
stable for production. These features may break compatibility and
are prepared for major version bumps.
Add documentation with force_guest_pull example across all
runtime-rs configuration files. This feature enables guest-side
image pulling in CoCo (Confidential Computing) scenarios.
Example usage:
experimental = ["force_guest_pull"]
Fixes inconsistent documentation across configuration files
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When a container process is terminated by a signal, the agent's SIGCHLD
reaper stored the raw signal number as the process exit code. As a result
a process killed by SIGKILL(9) reported exit code 9 instead of the
conventional 137 (128+9).
Apply the standard shell convention of 128+signal_number so that
signal-terminated processes report the expected exit codes, e.g.
SIGKILL(9) -> 137, SIGTERM(15) -> 143, SIGINT(2) -> 130. This mimics
runc, which encodes wait-status exit codes the same way:
https://github.com/opencontainers/runc/blob/v1.4.3/libcontainer/utils/utils.go#L19
Both runc and this new Kata behaviour follow the conventional exit code
semantics documented at https://tldp.org/LDP/abs/html/exitcodes.html.
The conversion is factored into a small helper and covered by a unit
test. The runtime and shim already pass the exit code through unchanged,
so no further changes are needed for the corrected value to surface.
Fixes: signal-terminated containers reporting raw signal numbers
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the "Not yet implemented" stub in QemuInner::remove_device()
with a working implementation that calls hotunplug_device() to perform
the QMP-level device removal, then cleans up the internal devices list
via retain() to remove stale coldplug entries.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce hotunplug_device() as the device-type dispatcher that routes
hot removal requests to the appropriate QMP method. Currently supports
Block and BlockModern device types, which are forwarded to
Qmp::hotunplug_block_device(). All other device types return an
explicit "unsupported" error.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Implement QMP-level block device hot-unplug by issuing device_del to
remove the frontend device and blockdev_del to remove the backend
blockdev node. For virtio-blk-ccw on s390x, the CCW subchannel slot
is also released.
Since QMP device_del is asynchronous and only initiates the removal
request, introduce wait_for_device_deleted() to poll for the
DEVICE_DELETED event before tearing down the backend. This prevents
blockdev_del from failing with "Node is still in use".
If blockdev_del fails, the error is logged but CCW cleanup still
proceeds before the error is propagated, ensuring consistent
subchannel state.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Improve the reliability of block device hotplug by ensuring that
blockdev-add nodes are properly cleaned up when subsequent device_add
operations fail.
To address this, A new method of device_add_with_rollback is introduced
to do device_add and do properly cleaned up when it fails.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The behavior we had before was that, for a starting k8s pod,
it sees enable_template=true and therefore:
1. Tries NewFactory with fetchOnly=true
2. When that fails (because template.Fetch fails to find the artifacts,
we retry with fetchOnly=false. This creates a direct factory
which creates the template from scratch
(hence we pay a full pod sandbox boot time here)
and then restores from that. Hence the boot times
are strictly worse on this path.
Now, even when enable_template=true, we don't try to force a direct factory.
Instead we just revert to the standard sandbox boot path.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Add k8s-vm-templating-test.bats which exercises pod create
with the factory initialized on the target node.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Add support for VM Template factory on the clh path.
In order to support snapshot/restore-based VM templating,
the following changes were needed:
1. For clh.go, implement SaveVM, PauseVM, restoreVM, ResumeVM
2. Remove initrd config check for VM Templating path. The
root disk image (when using image mode) is created in memory
and therefore captured in the VM snapshot.
3. Truncate the memory file to the size of the VM at factory VM
create time. This allows CLH to use the memory file
as the backing for the template VM memory, allowing O(1)
snapshot times.
4. CLH uses memory zones as backing for its memory on the template paths
5. Update StartVM in CLH to use the restore path when template is
configured and available
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>