kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 14:38:33 +00:00

Author	SHA1	Message	Date
Manuel Huber	24c51cfbbf	runtime: support block-plain emptyDirs Add Go runtime support for the block-plain emptyDir mode. Disk-backed Kubernetes emptyDir mounts remain bind mounts so the block emptyDir handling path can intercept them. The runtime creates a sparse disk.img in the kubelet emptyDir directory and records direct-volume metadata for the agent-visible block storage path. Fresh block emptyDirs request filesystem creation through a dedicated metadata flag. Plain emptyDirs also record discard support on the block device. Encrypted emptyDirs keep the existing ephemeral encryption metadata and carry the same filesystem-creation signal. Signed-off-by: Manuel Huber <manuelh@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-26 21:05:51 +00:00
manuelh-dev	b05d705ea0	Merge pull request #13286 from kata-containers/mahuber/genpolicy-image-user-group-tests genpolicy: test image user group handling	2026-06-26 13:55:44 -07:00
Hyounggyu Choi	5c7c49aa5d	Merge pull request #13291 from BbolroC/set-overhead_memory-ibm-sel runtime-rs: use SE-specific overhead_memory for qemu-se config	2026-06-26 16:27:04 +02:00
Hyounggyu Choi	b5aa4cef35	runtime-rs: use SE-specific overhead_memory for qemu-se config The IBM SEL runtime requires a larger overhead_memory budget than other TEE runtimes (SNP, TDX) because the kernel command line baked into the SE image sets: swiotlb=262144 (262144 × 2 KiB slots = 512 MiB) This buffer is pre-allocated at boot from the guest's physical RAM before any workload runs. With static_sandbox_resource_mgmt = true the VM gets: vm_memory = overhead_memory + container_limit In k8s-limit-range.bats, DEFOVERHEADMEMSZ_TEE (128 MiB) resulted in a 256 MiB VM when a container with a 128 MiB memory limit was scheduled — far too small to even fit the swiotlb allocation, causing boot failure. In a similar way, the failure is also observed for k8s-oom.bats. Introduce DEFOVERHEADMEMSZ_TEE_SE := 768 MiB, sized to cover: - 512 MiB swiotlb bounce buffer (fixed by sealed kernel cmdline) - ~128 MiB SE kernel + initrd + agent baseline - ~128 MiB headroom for other stuff Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-06-26 13:29:41 +02:00
Fupan Li	3f5ffa42a0	Merge pull request #12958 from Apokleos/integrated-erofslayers-gpt-vmdk runtime-rs: Support erofs snapshotter integrety with dmverity	2026-06-26 15:35:10 +08:00
Aurélien Bombo	66cb12d260	Merge pull request #13280 from kata-containers/disable-guest-empty-dir runtime-rs: remove disable_guest_empty_dir config	2026-06-25 22:35:20 -05:00
Alex Lyn	e77795f573	ci: Update libs required-test names for libdevmapper dependency Update the two affected entries in required-tests.yaml accordingly so the gatekeeper keeps matching them instead of blocking subsequent PRs after this one merges. Co-authored-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:52:47 +08:00
Alex Lyn	a5a0f1a5d0	gha: Pass USE_DEVMAPPER to agent static tarball builds Enable devicemapper (dm-verity) support in CI by forwarding the USE_DEVMAPPER environment variable to the agent build step across all architectures (amd64, arm64, s390x). Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	adcbef0c53	kata-deploy: Configure containerd erofs for dm-verity integrity mode The deploy will read EROFS_SNAPSHOTTER_MODE and EROFS_DMVERITY from the environment to enable dmverity_mode and enable_dmverity in the containerd erofs snapshotter/differ config. Add validation for the mode value and use an explicit 300s timeout for node-readiness checks during kata-deply in github CI. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	562c9acdb2	packaging: Add libdevmapper-dev and GNU target to agent Dockerfile Install libdevmapper-dev and pkg-config in the agent build container so devicemapper-sys can link against libdevmapper. Add the GNU libc rustup target alongside musl since USE_DEVMAPPER forces LIBC=gnu. Forward USE_DEVMAPPER through build.sh and build-static-agent.sh. And you can compile the device mapper in kata-agent as below: ``` $ make LIBC=gnu USE_DEVMAPPER=yes ``` Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	b084c0df36	kata-deploy: Forward USE_DEVMAPPER in local build scripts Pass USE_DEVMAPPER through the Docker environment in local build scripts. Extract the OCI tag sanitization logic into a public helper of sanitize_tag_component to keep push and pull paths consistent. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	2dd9426029	kata-deploy: Add erofsSnapshotterMode helm value and integrity mode Expose erofsSnapshotterMode in the helm chart values and render it as the EROFS_SNAPSHOTTER_MODE environment variable in the kata-deploy pod. Update gha-run-k8s-common.sh to load dm-mod/dm-verity kernel modules and configure the erofs default size when the mode is "integrity". Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	59fc0613bc	tests: Add E2E test for erofs dm-verity integrity in CI tests Add k8s-erofs-dmverity.bats integration test that verifies dm-verity protected EROFS layers work end-to-end, and register the integrity mode in the CoCo test matrix. This commit introduces two new files to enable it: - k8s-erofs-dmverity.bats - pod-erofs-dmverity-probe.yaml Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	b2d0e5b712	kata-agent: Use kata-types dmverity with optional devicemapper support Replace the agent's inline devicemapper implementation with the libs kata-types::dmverity module. The agent's devicemapper Cargo feature now forwards to kata-types/devicemapper, removing the direct libdevmapper link dependency from the agent crate. Gate all dm-verity imports, constants, and call sites behind libdevmapper. Add USE_DEVMAPPER Makefile variable (default no) that appends the devicemapper feature flag and forces LIBC=gnu when enabled. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	274a904bf7	kata-agent: Mount multi-layer EROFS partitions concurrently This commit is just a enhancement without any functionality changes. Replace the sequential loop in handle_multi_layer_erofs_group with join_all-based concurrent mounting. Base device paths and mount directories are pre-resolved before spawning futures to avoid lock contention. On partial failure, successfully mounted layers are unmounted and dm-verity devices cleaned up before propagating the error. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	c74bddddaf	kata-types: Add dmverity module with optional devicemapper support Introduce a new `dmverity` module in kata-types that provides dm-verity device creation, destruction and lifecycle management via devicemapper ioctls. The module is conditionally compiled behind the `devicemapper` feature flag, which also pulls in tokio for async device-node polling. The workspace devicemapper dependency is pinned to a specific git revision for reproducible builds. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	a08267faaf	runtime-rs: Track GPT partition padding files for cleanup When a GPT-partitioned VMDK is split into individual partition images, padding files may be generated between partitions to maintain correct byte offsets. These were not tracked for cleanup, leading to stale temporary files after container removal. Iterate over the partition layout and check for pad-{idx}.img files alongside the head image; add any that exist to gpt_metadata_paths so they are removed during teardown. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	51e8310ef3	kata-agent: Integrate dm-verity into multi-layer EROFS mount path Wire the dm-verity helpers into the layer mount flow so that GPT partitions carrying verity metadata are mounted through a verified device-mapper target instead of the raw partition. Refactor wait_and_mount_layer to resolve partition path and verity device as separate steps: create a dm-verity device when X-kata.dmverity-enabled=true is set, fall back to direct partition mount otherwise, and return the verity device path for cleanup tracking. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	963ba6c6cd	kata-agent: Add dm-verity device cleanup for GPT-partitioned layers Add per-container verity_devices tracking in Sandbox and wire the teardown path: destroy_partition_dmverity_device removes the device-mapper target via deferred-remove ioctl and deletes the mknod node, cleanup_dmverity_devices iterates all devices in reverse order. Wire into remove_container_resources (rpc.rs) so verity devices are torn down after unmount, and record verity device paths in add_storages (storage/mod.rs) for tracking. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	dce409bc35	kata-agent: Add dm-verity device creation for GPT-partitioned layers GPT-partitioned EROFS layers can carry dm-verity hashes appended after the filesystem data within the same partition. The host runtime passes the root hash and parameters as X-kata.dmverity.* storage options; the agent must set up the kernel dm-verity target before mounting so that every read is integrity-checked against the Merkle tree. Implement dm-verity device creation: option parsing from storage options, device name generation, and create helper via devicemapper ioctls with hash_start_block calculation (accounting for v1 superblock presence). Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	e900eae388	kata-agent: Add no-udev DmOptions builders and mknod device node helpers The kata guest VM runs without udev, so device-mapper nodes under /dev/mapper are never created automatically. Add the foundational helpers that subsequent dm-verity integration will rely on: It focus on the following key points: (1) DmOptions builders that disable all udev synchronization flags, with read-only and deferred-remove variants. (2) mknod-based device node creation/removal under /dev/mapper, since devtmpfs nodes are not auto-created without udev. Also add the devicemapper crate dependency (default-features = false). But note that the commit depends on device mapper with no-udev support with the PR:https://github.com/stratis-storage/devicemapper-rs/pull/1036 Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	c471644477	runtime-rs: Add dm-verity annotation extraction to GPT+VMDK integration Extract dm-verity metadata from containerd mount annotations and pass them through to kata-agent as X-kata.dmverity.* storage options. This enables the agent to create dm-verity devices for integrity-verified EROFS partitions. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	3051b8d11a	runtime-rs: Add dm-verity utility functions to gpt_disk module When containerd creates dm-verity-protected EROFS layers, it stores the root hash and parameters as OCI annotations — but the format does not directly map to the kernel dm-verity table that the guest agent needs to construct. Bridge this gap with functions that parse containerd's dm-verity annotation JSON, detect whether a v1 superblock is embedded at the hash offset (to extract the salt automatically rather than relying on containerd's hardcoded default), and produce the X-kata.dmverity.* storage options the agent expects. This keeps all dm-verity metadata translation on the host side, so the agent can consume a flat list of options without understanding the containerd annotation schema. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Alex Lyn	499fefd972	kata-types: Extend DmVerityInfo with salt, hash_type, no_superblock fields Add fields to DmVerityInfo needed for dm-verity device creation: (1) salt: Optional salt value for the hash computation (2) hash_type: dm-verity version (3) no_superblock: whether to skip the superblock at hash offset Uses serde defaults for backward compatibility with existing serialized data that lacks these fields. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-26 09:51:05 +08:00
Manuel Huber	c6ee1c70a8	genpolicy: test image user group handling Add unit coverage for image config User values that include a group component. For Kubernetes, containerd CRI ImageStatus exposes only the user side before kubelet creates the container security context, so genpolicy keeps treating those values like the user-only form. The fixture uses in-memory passwd and group data so the test does not rely on private reproducer images. Signed-off-by: Manuel Huber <manuelh@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-26 00:04:17 +00:00
Aurélien Bombo	b1e6b9449d	runtime-rs: special case emptyDirs with peer pods Peer pods don't support fs sharing, hence we need to be thoughtful about removing disable_guest_empty_dir there (=false for peer pods today, missed it in my previous PR). So we preserve disable_guest_empty_dir=false behavior for peer pods only (ie. using guest-local mounts) but we detect the need for guest-local mounts directly in code instead of using a config flag. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-25 16:37:34 -05:00
Aurélien Bombo	b20f974ddd	runtime-rs: remove disable_guest_empty_dir config Follow-up to #12373 which defaulted disable_guest_empty_dir=true for runtime-go/rs. Here we remove the config option entirely from runtime-rs to make 4.0 secure by design, as with disable_guest_empty_dir=false, a pod could starve the host storage. Closes: #12494 Generated-by: GitHub Copilot Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-25 15:31:46 -05:00
Fabiano Fidêncio	79cf2aed66	Merge pull request #13282 from fidencio/topic/revert-qos-test-skip Revert "tests: skip Guaranteed QoS test for SNP/TDX runtime-rs"	2026-06-25 22:18:21 +02:00
Fabiano Fidêncio	850b385f6b	Revert "tests: skip Guaranteed QoS test for SNP/TDX runtime-rs" This reverts commit `6588014b54`, as the needed PR[0] was merged this morning, allowing us to just revert the image. [0]: https://github.com/kata-containers/kata-containers/pull/13173 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-25 18:18:17 +02:00
Fabiano Fidêncio	31d349f999	Merge pull request #13173 from fidencio/topic/fixed-sandbox-sizing runtime-rs: size sandboxes with fixed overheads	2026-06-25 15:50:00 +02:00
Fabiano Fidêncio	a664595084	kata-deploy: bump qemu RuntimeClass overhead for the aarch64 VMM With sandbox_cgroup_only the shim, QEMU and virtiofsd run inside the pod's memory cgroup, whose limit is the workload limit plus the RuntimeClass pod overhead. On aarch64 the VMM host footprint is much larger than on x86 (QEMU's own anon RSS is ~160Mi+ before any guest RAM, on top of the shmem-backed guest memory), so the 160Mi overhead is too small: small-memory-limit pods get their qemu-system process OOM-killed by the pod cgroup (CONSTRAINT_MEMCG), and the agent vsock never comes up (ENODEV), so the sandbox fails to start. Raise the pod overhead to 320Mi for the qemu shims that run on aarch64 (qemu, qemu-runtime-rs, qemu-coco-dev-runtime-rs). The value is applied on all architectures for simplicity; x86 is over-provisioned by ~160Mi, which is acceptable. The TEE/GPU shims already carry far larger overhead and amd64-only shims (clh*, dragonball, fc) are unaffected. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-06-25 13:56:11 +02:00
Fabiano Fidêncio	b2f7314d31	tests: harden sandbox sizing manifests for k8s cpu workloads Route runtime-rs tests to dedicated manifests/templates and ensure the CPU allocation workloads always carry explicit memory limits, avoiding Dragonball sandbox startup failures from InvalidMemorySize(0). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-06-25 13:56:11 +02:00
Fabiano Fidêncio	346a3be9ad	docs: document runtime-rs sandbox overhead sizing Add a how-to describing how runtime-rs sizes static sandboxes from overhead plus requested CPU/memory, including that fractional vCPU results are rounded up for VMM-visible vCPU counts, and link it from the how-to README. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-06-25 13:56:11 +02:00
Fabiano Fidêncio	a34c74a2d4	runtime-rs: size static sandboxes with overhead values When static sandbox sizing is enabled, keep configured defaults when workloads do not specify CPU or memory limits. When limits are present, size the VM as requested resources plus overhead_vcpus/overhead_memory values derived from runtime-rs profile defaults. Limit-driven vCPU sizing is clamped to a minimum of one vCPU so a 0.0 result never yields an unbootable VM, and sandbox setup fails early with a clear, actionable error when the computed memory is 0 MiB (pointing at memory limits or non-zero default/overhead memory settings). This keeps static VM sizing predictable across runtime-rs profiles, including NVIDIA ones. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-06-25 13:56:11 +02:00
Fabiano Fidêncio	65a266f532	Merge pull request #13272 from cayoub-oai/codex/upstream-cgroupfs-init-subcgroup agent: Apply init subcgroup in cgroupfs manager	2026-06-25 13:54:19 +02:00
Aurélien Bombo	1217dd1584	Merge pull request #12373 from kata-containers/disable-guest-empty-dir runtime: Set `disable_guest_empty_dir = true` by default	2026-06-24 20:09:46 -05:00
Chris Ayoub	4e3d257dc0	agent: Apply init subcgroup in cgroupfs manager When cgroup v2 is enabled, exec can fail with EBUSY while writing the process to cgroup.procs if the container process has been delegated to an init subcgroup. PR #10845 fixed this behavior for the systemd/D-Bus cgroup manager path, which was related to #10733. The cgroupfs manager still writes the process directly to the container cgroup, so apply the same init subcgroup handling there. Also fix the cgroupfs init-subcgroup existence check for absolute OCI cgroup paths by joining the trimmed cgroup path under the cgroup root. Fixes: #9701 Signed-off-by: Chris Ayoub <cayoub@openai.com> Generated-By: OpenAI Codex	2026-06-24 21:25:49 +00:00
Aurélien Bombo	10cf6816aa	kernel: Fix FUSE crash with host emptyDir This patch was submitted by Miklos Szeredi: https://lore.kernel.org/fuse-devel/20260528142306.1792392-1-mszeredi@redhat.com/ It fixes a FUSE oops with the k8s-shared-volume.bats test. Fixes: #12589 Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-24 15:22:13 -05:00
Aurélien Bombo	77c3e36cf7	tests: Support GENPOLICY_SETTINGS_DIR with drop-in-examples Follow-up to `3dd77bf576`. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-24 15:22:13 -05:00
Aurélien Bombo	3acb618f6b	genpolicy: Assume `disable_guest_empty_dir = true` This option should be removed for 4.0, so we don't handle `false`. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-24 15:22:13 -05:00
Aurélien Bombo	e191c5b716	runtime-go/rs: Reconcile hugepage emptyDirs and disable_guest_empty_dir This addresses an issue where the disable_guest_empty_dir=true code paths did not take into account that hugepage-backed emptyDirs should always be recreated in the guest (using guest hugepages). Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-24 15:22:13 -05:00
Aurélien Bombo	a3e91d9ed2	runtime-go/rs: Set `disable_guest_empty_dir = true` by default This makes the runtime share the host Kubelet emptyDir folder with the guest instead of the agent creating an empty folder in the container rootfs. Doing so enables the Kubelet to track emptyDir usage and evict greedy pods. In other words, with virtio-fs the container rootfs uses host storage whether this is true or false, however with true, Kata uses the k8s emptyDir folder so the sizeLimit is properly enforced by k8s. Addresses the ephemeral storage part of #12203. History: * Initially, emptyDirs are slow because they are shared from the host with 9p. https://github.com/kata-containers/runtime/issues/1472 * To address above, emptyDirs are hardcoded to be created by the agent in the pause container's rootfs, potentially leveraging devicemapper and improving perf. https://github.com/kata-containers/runtime/pull/1485 * The previous PR regressed an (interesting?) use case where emptyDirs were used to share data from the host to the guest, so the behavior was made configurable and `disable_guest_empty_dir = false` is introduced, defaulting to the behavior of the previous PR. https://github.com/kata-containers/kata-containers/pull/2056 * Another resource accounting regression remains which is addressed in this PR. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-24 15:21:53 -05:00
Fabiano Fidêncio	6528e7a72f	Merge pull request #13228 from fidencio/topic/dont-set-slots-maxmem-for-confidential-guests runtime-rs: qemu: don't set slots/maxmem for confidential guests	2026-06-24 17:27:28 +02:00
Greg Kurz	13b3020c34	Merge pull request #13261 from c3d/bug/13260-Info-log-level runtime: Change default log level from Warn to Info	2026-06-24 08:57:13 +02:00
Fabiano Fidêncio	392b802f61	Merge pull request #12878 from Apokleos/fix-configs runtime-rs: Fix configs differences between runtime-rs and runtime-go	2026-06-23 13:53:16 +02:00
Steve Horsman	811914a372	Merge pull request #13246 from Apokleos/copyfile-with-gid-uid runtime-rs: correct uid/gid for K8s secret/configmap copy_file	2026-06-23 10:43:03 +01:00
Steve Horsman	3e429a8afb	Merge pull request #13234 from LandonTClipp/docs-skill docs: Add AI agent skill for doc contributions	2026-06-23 09:59:15 +01:00
Christophe de Dinechin	631fd96715	runtime: Change default log level from Warn to Info When the kata configuration does not set log_level to debug, the containerd-shim-v2 defaults to WarnLevel, which suppresses important diagnostic information logged at Info level. Key Info-level logs that are currently hidden: - QEMU command line (qemu.go:3566) - critical for debugging VM issues - VM lifecycle events (creation, start, stop) - Device hotplug operations (VFIO, network, volumes) - Resource configuration (NUMA, memory) - QMP socket details Info level provides significantly better diagnostic data without flooding logs with excessive detail (which would occur at Debug level). This change improves troubleshooting capabilities for production deployments where debug mode is not enabled. Note: runtime-rs already defaults to Info level (see src/runtime-rs/crates/shim/src/logger.rs:13,30), so this change only affects the Go runtime. Fixes: #13260 Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2026-06-23 10:29:33 +02:00
LandonTClipp	85e828cc9b	docs: Add AI agent skill for doc contributions This skill will inform AI agents how to properly write and format docs in the new docs system. There is nothing too fancy, just reminding agents to use mkdocs-materialx features instead of treating the markdown like the legacy Github-based format. Signed-off-by: LandonTClipp <lclipp@coreweave.com>	2026-06-23 08:57:37 +01:00
Fabiano Fidêncio	bbe714ae03	Merge pull request #13227 from fidencio/topic/rfc-composable-vm-images-update docs: detail composable image runtime contracts in proposal	2026-06-22 21:07:04 +02:00

1 2 3 4 5 ...

19491 Commits