Add Go runtime support for the block-plain emptyDir mode.
Disk-backed Kubernetes emptyDir mounts remain bind mounts so the block
emptyDir handling path can intercept them. The runtime creates a sparse
disk.img in the kubelet emptyDir directory and records direct-volume
metadata for the agent-visible block storage path.
Fresh block emptyDirs request filesystem creation through a dedicated
metadata flag. Plain emptyDirs also record discard support on the block
device. Encrypted emptyDirs keep the existing ephemeral encryption
metadata and carry the same filesystem-creation signal.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
The IBM SEL runtime requires a larger overhead_memory budget than
other TEE runtimes (SNP, TDX) because the kernel command line baked
into the SE image sets:
swiotlb=262144 (262144 × 2 KiB slots = 512 MiB)
This buffer is pre-allocated at boot from the guest's physical RAM
before any workload runs.
With static_sandbox_resource_mgmt = true the VM gets:
vm_memory = overhead_memory + container_limit
In k8s-limit-range.bats, DEFOVERHEADMEMSZ_TEE (128 MiB) resulted in
a 256 MiB VM when a container with a 128 MiB memory limit was scheduled
— far too small to even fit the swiotlb allocation, causing boot failure.
In a similar way, the failure is also observed for k8s-oom.bats.
Introduce DEFOVERHEADMEMSZ_TEE_SE := 768 MiB, sized to cover:
- 512 MiB swiotlb bounce buffer (fixed by sealed kernel cmdline)
- ~128 MiB SE kernel + initrd + agent baseline
- ~128 MiB headroom for other stuff
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Update the two affected entries in required-tests.yaml accordingly
so the gatekeeper keeps matching them instead of blocking subsequent
PRs after this one merges.
Co-authored-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Enable devicemapper (dm-verity) support in CI by forwarding the
USE_DEVMAPPER environment variable to the agent build step across
all architectures (amd64, arm64, s390x).
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The deploy will read EROFS_SNAPSHOTTER_MODE and EROFS_DMVERITY from
the environment to enable dmverity_mode and enable_dmverity in the
containerd erofs snapshotter/differ config.
Add validation for the mode value and use an explicit 300s timeout
for node-readiness checks during kata-deply in github CI.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Install libdevmapper-dev and pkg-config in the agent build container
so devicemapper-sys can link against libdevmapper. Add the GNU libc
rustup target alongside musl since USE_DEVMAPPER forces LIBC=gnu.
Forward USE_DEVMAPPER through build.sh and build-static-agent.sh.
And you can compile the device mapper in kata-agent as below:
```
$ make LIBC=gnu USE_DEVMAPPER=yes
```
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Pass USE_DEVMAPPER through the Docker environment in local build
scripts. Extract the OCI tag sanitization logic into a public helper
of sanitize_tag_component to keep push and pull paths consistent.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Expose erofsSnapshotterMode in the helm chart values and render it as
the EROFS_SNAPSHOTTER_MODE environment variable in the kata-deploy pod.
Update gha-run-k8s-common.sh to load dm-mod/dm-verity kernel modules
and configure the erofs default size when the mode is "integrity".
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add k8s-erofs-dmverity.bats integration test that verifies dm-verity
protected EROFS layers work end-to-end, and register the integrity
mode in the CoCo test matrix.
This commit introduces two new files to enable it:
- k8s-erofs-dmverity.bats
- pod-erofs-dmverity-probe.yaml
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Replace the agent's inline devicemapper implementation with the libs
kata-types::dmverity module. The agent's devicemapper Cargo feature
now forwards to kata-types/devicemapper, removing the direct
libdevmapper link dependency from the agent crate. Gate all dm-verity
imports, constants, and call sites behind libdevmapper.
Add USE_DEVMAPPER Makefile variable (default no) that appends the
devicemapper feature flag and forces LIBC=gnu when enabled.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit is just a enhancement without any functionality changes.
Replace the sequential loop in handle_multi_layer_erofs_group with
join_all-based concurrent mounting. Base device paths and mount
directories are pre-resolved before spawning futures to avoid lock
contention. On partial failure, successfully mounted layers are
unmounted and dm-verity devices cleaned up before propagating the
error.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce a new `dmverity` module in kata-types that provides dm-verity
device creation, destruction and lifecycle management via devicemapper
ioctls. The module is conditionally compiled behind the `devicemapper`
feature flag, which also pulls in tokio for async device-node polling.
The workspace devicemapper dependency is pinned to a specific git
revision for reproducible builds.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When a GPT-partitioned VMDK is split into individual partition images,
padding files may be generated between partitions to maintain correct
byte offsets. These were not tracked for cleanup, leading to stale
temporary files after container removal.
Iterate over the partition layout and check for pad-{idx}.img files
alongside the head image; add any that exist to gpt_metadata_paths
so they are removed during teardown.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Wire the dm-verity helpers into the layer mount flow so that GPT
partitions carrying verity metadata are mounted through a verified
device-mapper target instead of the raw partition.
Refactor wait_and_mount_layer to resolve partition path and verity
device as separate steps: create a dm-verity device when
X-kata.dmverity-enabled=true is set, fall back to direct partition
mount otherwise, and return the verity device path for cleanup
tracking.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add per-container verity_devices tracking in Sandbox and wire the
teardown path: destroy_partition_dmverity_device removes the
device-mapper target via deferred-remove ioctl and deletes the mknod
node, cleanup_dmverity_devices iterates all devices in reverse order.
Wire into remove_container_resources (rpc.rs) so verity devices are
torn down after unmount, and record verity device paths in
add_storages (storage/mod.rs) for tracking.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
GPT-partitioned EROFS layers can carry dm-verity hashes appended after
the filesystem data within the same partition. The host runtime passes
the root hash and parameters as X-kata.dmverity.* storage options; the
agent must set up the kernel dm-verity target before mounting so that
every read is integrity-checked against the Merkle tree.
Implement dm-verity device creation: option parsing from storage
options, device name generation, and create helper via devicemapper
ioctls with hash_start_block calculation (accounting for v1 superblock
presence).
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The kata guest VM runs without udev, so device-mapper nodes under
/dev/mapper are never created automatically. Add the foundational
helpers that subsequent dm-verity integration will rely on:
It focus on the following key points:
(1) DmOptions builders that disable all udev synchronization flags,
with read-only and deferred-remove variants.
(2) mknod-based device node creation/removal under /dev/mapper, since
devtmpfs nodes are not auto-created without udev.
Also add the devicemapper crate dependency (default-features = false).
But note that the commit depends on device mapper with no-udev support
with the PR:https://github.com/stratis-storage/devicemapper-rs/pull/1036
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Extract dm-verity metadata from containerd mount annotations and pass
them through to kata-agent as X-kata.dmverity.* storage options. This
enables the agent to create dm-verity devices for integrity-verified
EROFS partitions.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When containerd creates dm-verity-protected EROFS layers, it stores
the root hash and parameters as OCI annotations — but the format
does not directly map to the kernel dm-verity table that the guest
agent needs to construct.
Bridge this gap with functions that parse containerd's dm-verity
annotation JSON, detect whether a v1 superblock is embedded at the
hash offset (to extract the salt automatically rather than relying
on containerd's hardcoded default), and produce the X-kata.dmverity.*
storage options the agent expects.
This keeps all dm-verity metadata translation on the host side, so
the agent can consume a flat list of options without understanding
the containerd annotation schema.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add fields to DmVerityInfo needed for dm-verity device creation:
(1) salt: Optional salt value for the hash computation
(2) hash_type: dm-verity version
(3) no_superblock: whether to skip the superblock at hash offset
Uses serde defaults for backward compatibility with existing serialized
data that lacks these fields.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add unit coverage for image config User values that include a
group component. For Kubernetes, containerd CRI ImageStatus exposes
only the user side before kubelet creates the container security
context, so genpolicy keeps treating those values like the user-only
form.
The fixture uses in-memory passwd and group data so the test does not
rely on private reproducer images.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Peer pods don't support fs sharing, hence we need to be thoughtful about
removing disable_guest_empty_dir there (=false for peer pods today, missed it
in my previous PR).
So we preserve disable_guest_empty_dir=false behavior for peer pods only (ie.
using guest-local mounts) but we detect the need for guest-local mounts directly
in code instead of using a config flag.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Follow-up to #12373 which defaulted disable_guest_empty_dir=true for
runtime-go/rs.
Here we remove the config option entirely from runtime-rs to make 4.0
secure by design, as with disable_guest_empty_dir=false, a pod could starve
the host storage.
Closes: #12494
Generated-by: GitHub Copilot
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
With sandbox_cgroup_only the shim, QEMU and virtiofsd run inside the
pod's memory cgroup, whose limit is the workload limit plus the
RuntimeClass pod overhead. On aarch64 the VMM host footprint is much
larger than on x86 (QEMU's own anon RSS is ~160Mi+ before any guest
RAM, on top of the shmem-backed guest memory), so the 160Mi overhead
is too small: small-memory-limit pods get their qemu-system process
OOM-killed by the pod cgroup (CONSTRAINT_MEMCG), and the agent vsock
never comes up (ENODEV), so the sandbox fails to start.
Raise the pod overhead to 320Mi for the qemu shims that run on
aarch64 (qemu, qemu-runtime-rs, qemu-coco-dev-runtime-rs). The value
is applied on all architectures for simplicity; x86 is over-provisioned
by ~160Mi, which is acceptable. The TEE/GPU shims already carry far
larger overhead and amd64-only shims (clh*, dragonball, fc) are
unaffected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add a how-to describing how runtime-rs sizes static sandboxes from
overhead plus requested CPU/memory, including that fractional vCPU
results are rounded up for VMM-visible vCPU counts, and link it from the
how-to README.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
When static sandbox sizing is enabled, keep configured defaults when
workloads do not specify CPU or memory limits. When limits are present,
size the VM as requested resources plus overhead_vcpus/overhead_memory
values derived from runtime-rs profile defaults.
Limit-driven vCPU sizing is clamped to a minimum of one vCPU so a 0.0
result never yields an unbootable VM, and sandbox setup fails early with
a clear, actionable error when the computed memory is 0 MiB (pointing at
memory limits or non-zero default/overhead memory settings).
This keeps static VM sizing predictable across runtime-rs profiles,
including NVIDIA ones.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
When cgroup v2 is enabled, exec can fail with EBUSY while writing the
process to cgroup.procs if the container process has been delegated to an
init subcgroup.
PR #10845 fixed this behavior for the systemd/D-Bus cgroup manager
path, which was related to #10733. The cgroupfs manager still writes the
process directly to the container cgroup, so apply the same init
subcgroup handling there.
Also fix the cgroupfs init-subcgroup existence check for absolute OCI
cgroup paths by joining the trimmed cgroup path under the cgroup root.
Fixes: #9701
Signed-off-by: Chris Ayoub <cayoub@openai.com>
Generated-By: OpenAI Codex
This addresses an issue where the disable_guest_empty_dir=true code paths did
not take into account that hugepage-backed emptyDirs should always be recreated
in the guest (using guest hugepages).
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
This makes the runtime share the host Kubelet emptyDir folder with the guest
instead of the agent creating an empty folder in the container rootfs. Doing so
enables the Kubelet to track emptyDir usage and evict greedy pods.
In other words, with virtio-fs the container rootfs uses host storage whether
this is true or false, however with true, Kata uses the k8s emptyDir folder so
the sizeLimit is properly enforced by k8s.
Addresses the ephemeral storage part of #12203.
History:
* Initially, emptyDirs are slow because they are shared from the host with 9p.
https://github.com/kata-containers/runtime/issues/1472
* To address above, emptyDirs are hardcoded to be created by the agent in the
pause container's rootfs, potentially leveraging devicemapper and improving
perf.
https://github.com/kata-containers/runtime/pull/1485
* The previous PR regressed an (interesting?) use case where emptyDirs were
used to share data from the host to the guest, so the behavior was made
configurable and `disable_guest_empty_dir = false` is introduced, defaulting
to the behavior of the previous PR.
https://github.com/kata-containers/kata-containers/pull/2056
* Another resource accounting regression remains which is addressed in this PR.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
When the kata configuration does not set log_level to debug, the
containerd-shim-v2 defaults to WarnLevel, which suppresses important
diagnostic information logged at Info level.
Key Info-level logs that are currently hidden:
- QEMU command line (qemu.go:3566) - critical for debugging VM issues
- VM lifecycle events (creation, start, stop)
- Device hotplug operations (VFIO, network, volumes)
- Resource configuration (NUMA, memory)
- QMP socket details
Info level provides significantly better diagnostic data without
flooding logs with excessive detail (which would occur at Debug level).
This change improves troubleshooting capabilities for production
deployments where debug mode is not enabled.
Note: runtime-rs already defaults to Info level (see
src/runtime-rs/crates/shim/src/logger.rs:13,30), so this change only
affects the Go runtime.
Fixes: #13260
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
This skill will inform AI agents how to properly write and format
docs in the new docs system. There is nothing too fancy, just reminding
agents to use mkdocs-materialx features instead of treating the
markdown like the legacy Github-based format.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>