kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-05-17 04:52:23 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	6b802a4e30	nvidia: switch GPU rootfs images to erofs Switch the NVIDIA GPU rootfs images (both standard and confidential) from ext4 to erofs (Enhanced Read-Only File System). Unlike ext4, which is a read-write filesystem mounted read-only by convention, erofs is structurally read-only -- no journal, no write metadata, no superblock write path. This eliminates accidental mutation and reduces the attack surface inside the guest VM, which is particularly important for confidential workloads using dm-verity. Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4 so non-NVIDIA configurations are unaffected. Update all six NVIDIA GPU configuration templates (base, SNP, TDX for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global @DEFROOTFSTYPE@. Export FS_TYPE=erofs in install_image_nvidia_gpu() and install_image_nvidia_gpu_confidential() so the build pipeline produces erofs images via the image builder. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	bfcd249f40	image-builder: add erofs dm-verity support and lz4hc compression Add full dm-verity and measured rootfs support to create_erofs_rootfs_image(), bringing it to parity with the ext4 path. Unlike ext4, which is a read-write filesystem mounted read-only by convention, erofs is structurally read-only -- no journal, no write metadata, no superblock write path. This is a natural fit for dm-verity: erofs never attempts writes, so verity never has to reject anything. With ext4, the kernel must skip journal replay on verity-protected devices, which is a fragile assumption. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	d2e0555cf0	image-builder: refactor dm-verity setup into shared functions Extract build_kernel_verity_params() and setup_verity() from the inline block inside create_rootfs_image() into top-level functions. This is a pure refactoring with no behavior change. The verity logic is moved verbatim, with the only difference being that build_kernel_verity_params() now takes the image path as an explicit parameter instead of capturing it from the enclosing scope. The extracted functions will be reused by create_erofs_rootfs_image() in a subsequent commit to add dm-verity support for erofs images. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
manuelh-dev	2ffd1538a2	Merge pull request #13021 from fidencio/topic/kata-deploy-log-level-containerd-version-4 kata-deploy: Fix containerd debug level path for config schema v4	2026-05-10 07:28:26 -07:00
Fabiano Fidêncio	341a0d366c	kata-deploy: Fix containerd debug level path for config schema v4 Containerd 2.3 (config schema v4) uses the top-level [debug] table for log level configuration, not plugins."io.containerd.server.v1.debug" as was the case in the RC builds. Update containerd_debug_level_toml_path() to use .debug.level for all schema versions, matching the released containerd behavior. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 12:02:24 +02:00
Fabiano Fidêncio	46b46589a6	Merge pull request #13020 from manuelh-dev/mahuber/nim-op-placement tests: nvidia: place NIM service into namespace	2026-05-10 12:01:58 +02:00
Manuel Huber	1c081ff434	tests: nvidia: place NIM service into namespace Place the NIM service into our test namespace. We are still observing various situations where for some reasons, the NIM service appears in the default namespace in our CI. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-10 07:36:23 +00:00
Fabiano Fidêncio	905303b6b0	Merge pull request #13013 from BbolroC/filter-vfio-gk-only-runtime-rs runtime-rs: filter VFIO devices only in guest-kernel mode	2026-05-08 23:49:50 +02:00
Fabiano Fidêncio	a447a1fb03	Merge pull request #13015 from stevenhorsman/kernel-6.18.28-bump version: Bump to latest 6.18 kernel	2026-05-08 21:12:50 +02:00
Fabiano Fidêncio	f7be57efe2	Merge pull request #13007 from manuelh-dev/mahuber/dbg-nim-svc tests: nvidia: Wait for NIM operator pod and print	2026-05-08 20:58:51 +02:00
stevenhorsman	87664c608d	version: Bump to latest 6.18 kernel Pick up the latest kernel that fixes CVE-2026-43284 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-08 17:15:24 +01:00
Hyounggyu Choi	754707fe83	runtime-rs: filter VFIO devices only in guest-kernel mode After #12857, the VFIO-AP hotplug test fails because runtime-rs unconditionally removes all /dev/vfio/* devices from the OCI spec before sending it to the kata agent. The agent then rejects the container creation with: ``` Missing devices in OCI spec ``` Filter devices from the OCI spec conditionally based on the vfio_mode configuration (e.g. guest-kernel). Also factor the filtering logic out into a separate function and add unit tests. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-05-08 15:39:16 +02:00
Fabiano Fidêncio	8e65e89ade	Merge pull request #13011 from kata-containers/fix-warnings runtime-rs: Fix warnings in rust runtime	2026-05-08 15:12:53 +02:00
Fabiano Fidêncio	a541827a7e	Merge pull request #12984 from fidencio/topic/network-pair-use-name-for-lookup runtime-rs: network: use provided name for virt interface lookup	2026-05-08 14:31:58 +02:00
Fabiano Fidêncio	09bbc70302	Merge pull request #13002 from manuelh-dev/mahuber/unrequire-nim-svc gatekeeper: Unrequire NVIDIA GPU test (temporary)	2026-05-08 10:02:00 +02:00
Fabiano Fidêncio	2879619d07	Merge pull request #12981 from fidencio/topic/kata-deploy-reduce-memory-consumption kata-deploy: reduce memory consumption	2026-05-08 09:51:47 +02:00
Alex Lyn	1441b2b84a	runtime-rs: Fix warnings in rust runtime So many unformatted rust codes cause uncommitted change files in rust runtime and its libs or agent sources, which can be easily found just by `cargo fmt --all`. Let's reduce such noisy bad experiences Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-08 14:56:00 +08:00
Manuel Huber	714adec3f8	tests: nvidia: Wait for NIM operator pod and print Wait for the NIM operator pod to run before deploying NIM services. Add a temporary debug function to print resource placement into the different namespaces. Remove this function again when the NIM tests are stabilized. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-08 06:27:48 +00:00
Manuel Huber	edfb6f5716	gatekeeper: Unrequire NVIDIA GPU test (temporary) Temporarily unrequire the NVIDIA GPU test. We are experiencing situations in which two NIM service instances get deployed almost at the same time into the kata-containers-k8s-tests namespace (expected current context) and into the default namespace. This causes the NIM operator to create two deployments in the two namespaces and to then schedule two pods at the same time. This usually causes the NIM pod in the default namespace to fail and to linger. We can't explain yet why this does not happen in the TEE CI path and why this is happening at all. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-07 14:39:24 +02:00
Fabiano Fidêncio	8dde5f39b7	tests: dump kata-deploy pod describe+logs on install timeout When kubectl wait times out the pod never reached Ready, so the existing log collection (which runs after wait succeeds) produces "-- No entries --" with zero useful information. Capture kubectl describe and kubectl logs (including previous container) immediately on timeout so the next CI run shows exactly why the pod is stuck (ImagePullBackOff, OOMKilled, probe failures, containerd restart hang, etc.). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	0f3160276b	ci: k8s: skip no-op Helm uninstall on free runners In cleanup_kata_deploy, bail out early when no kata-deploy Helm release exists so baremetal-* pre-deploy cleanup on fresh clusters does not block on helm uninstall --wait (up to 10m). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	f5533950e6	kata-deploy: helm: cap container RSS via resources block Plumb a resources block into the kata-deploy DaemonSet container in the Helm chart so the cluster can size its memory footprint predictably. Defaults are sized from real /proc/<pid>/status numbers on an unpatched 3.30.0 build running on a ~220-vCPU GPU node: VmRSS: 9944 kB (~9.7 MiB) <- actual physical memory RssAnon: 2628 kB (~2.6 MiB) <- heap + dirty stack pages VmData: 464668 kB (~454 MiB) <- tokio multi-thread workers' reserved-but-untouched stacks Threads: 225 <- num_cpus()-driven worker pool That VmData number is the source of the original "kata-deploy is using 400 MB" reports: any monitoring layer that surfaces virtual data size, committed memory, or memory.usage_in_bytes on a kernel that includes mapped-but-untouched memory will happily reproduce ~400 MB even though only ~10 MiB is ever made resident. The earlier commits in this series (current_thread tokio, mimalloc, shared kube client, JSONPath removal, post-install re-exec) collapse VmData into the tens of MiB and drop the post-install resident set further. The defaults below are picked accordingly: requests: cpu: 25m # install is mostly I/O wait; the post-install # waiter is genuinely idle memory: 16Mi # ~2x headroom over the unpatched VmRSS we # measured, far more over the patched waiter Operators who hit OOMKilled on unusually large or churny clusters can override `resources` directly in their Helm values (or set it to {} to remove all requests and inherit cluster defaults). Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	9e99b21ec5	kata-deploy: re-exec into a tiny post-install waiter After install completes the kata-deploy DaemonSet pod has nothing else to do for the rest of its lifetime — it just blocks on SIGTERM and then runs cleanup. Up to here, the install path has built up substantial peak heap (kube clients, deserialised Node/RuntimeClass objects, hyper + rustls TLS pools, parsed JSON / YAML), and on musl essentially none of that is ever returned to the kernel. Idling in the same process therefore pins the pod's RSS at the install peak indefinitely. Re-exec the binary into a hidden `internal-post-install-wait` action the moment install succeeds. execve(2) discards the entire address space, so the waiter starts up holding only the working set it actually needs (a config struct, the SIGTERM handler, and the health server). To avoid a probe-availability gap during the handover the install process clears FD_CLOEXEC on the health listener and passes the raw FD to the child via KATA_DEPLOY_HEALTH_FD. The child reattaches the FD as a tokio TcpListener and resumes serving /healthz and /readyz without ever closing the socket — the kubelet sees no failure. The detected container runtime is similarly threaded through KATA_DEPLOY_DETECTED_RUNTIME so the waiter doesn't have to re-query the apiserver. The new action is tagged `#[clap(hide = true)]` so `--help` doesn't expose it; users should never invoke it directly. Add the FD-inheritance helpers in health.rs: - prepare_listener_for_exec(): clears FD_CLOEXEC on a listener and returns its raw fd number. - listener_from_inherited_fd(): wraps an inherited fd back into a tokio::net::TcpListener (and re-sets FD_CLOEXEC so future host shellouts don't leak the socket). Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	af03ab2228	kata-deploy: replace JSONPath node lookups with typed accessors The two pieces of node metadata kata-deploy actually reads are .status.nodeInfo.containerRuntimeVersion and a single label, both of which were being fetched through a homegrown JSONPath walker: - get_node_field() serialised the entire Node object back into a serde_json::Value tree on every call, - split_jsonpath() / get_jsonpath_value() then walked that tree by string key. Both the deep clone and the helpers themselves are unnecessary — kube's Node type is already strongly typed. Replace get_node_field() with two purpose-built accessors that read straight off the Node struct: - get_container_runtime_version(): pulls status.node_info.container_runtime_version with a clear error if the field isn't populated. - get_node_label(key): returns Option<String> directly from metadata.labels. Drop split_jsonpath, get_jsonpath_value, and their unit tests (which existed only to cover the JSONPath walker we no longer have). Update the three callers (config.rs, runtime/manager.rs, runtime/containerd.rs) to use the typed accessors. This removes the entire serde_json::Value clone-and-walk path from the hot read path and meaningfully cuts allocator churn during install. Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	52e6a19253	kata-deploy: size-optimise the release profile Apply per-package release-profile overrides for the kata-deploy crate only: opt-level = "z" # optimise for size, not speed codegen-units = 1 # let LLVM see the whole crate when inlining The binary is throwaway: it runs once at DaemonSet pod start, finishes the install in seconds, and then sits idle waiting for SIGTERM. There is no hot path to optimise for speed, so trading a bit of compile time and a few percent of CPU for a meaningfully smaller text segment is the right call here. These overrides live at the workspace root and are scoped via [profile.release.package."kata-deploy"], so they do not affect the agent, runtime-rs, dragonball, or any of the libs / tools crates. Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	6cd842494c	kata-deploy: cap the tokio worker pool to 2 threads The default #[tokio::main] expands with flavor = "multi_thread" and worker_threads = num_cpus::get(). On a typical NVIDIA GPU node (200+ vCPUs) that allocates 200+ worker threads with ~2 MiB stacks each, which is the single largest contributor to the DaemonSet pod's VmData reservation — hundreds of MiB of address space mapped but never touched, easily reproducing the "kata-deploy is using ~400 MB" reports on any monitoring layer that surfaces VSZ / committed virtual memory. Switch to a fixed two-worker multi-thread runtime instead: #[tokio::main(flavor = "multi_thread", worker_threads = 2)] Two workers is exactly the right number for kata-deploy: - the install path is overwhelmingly I/O-bound and runs serially; one worker is enough to drive the install future itself, - install does shell out to `nsenter --target 1 systemctl restart containerd` (and friends) via the synchronous std::process:: Command::output(), which wedges the worker thread it runs on for tens of seconds; the second worker keeps the spawned health-server task able to answer kubelet probes inside timeoutSeconds while the first is blocked. flavor = "current_thread" would be tighter still on stacks (~4 MiB saved) but is fundamentally unsafe here: with a single runtime thread, any blocking host_systemctl call freezes the health server too, the kubelet fails the readiness probe, and the pod is restarted long before install completes. The CI lifecycle test reliably reproduces this as a 15-minute timeout waiting for the kata-deploy DaemonSet pod to become Ready. Net result vs. upstream's num_cpus()-driven pool on a 200-vCPU node: ~200 fewer worker threads, ~400 MiB less VmData reservation, while keeping kubelet probes responsive across the entire install path. Add the "sync" tokio feature here too so subsequent commits in the series can use tokio::sync primitives (OnceCell) without another features bump. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	346119108e	kata-deploy: drop unused kube features The binary doesn't use kube::runtime (controllers, watchers, reflectors) or kube::derive (the CustomResource macro). Pulling them in only added transitive deps (kube-runtime, kube-derive, backon, educe, ahash, async-broadcast, ...) and inflated the binary's static data segment for no functional gain. Set default-features = false and select only what the binary actually calls into: the kube-client surface plus the rustls-tls backend that hyper-rustls already pulled in transitively. Behaviour is unchanged. Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	114cacf439	Merge pull request #12857 from kata-containers/topic/runtime-rs-coldplug-gpu runtime-rs: coldplug GPU support	2026-05-07 12:54:04 +02:00
Fabiano Fidêncio	19c194aa94	ci: Add runtime-rs GPU shims to NVIDIA GPU CI workflow Add qemu-nvidia-gpu-runtime-rs and qemu-nvidia-gpu-snp-runtime-rs to the NVIDIA GPU test matrix so CI covers the new runtime-rs shims. Introduce a `coco` boolean field in each matrix entry and use it for all CoCo-related conditionals (KBS, snapshotter, KBS deploy/cleanup steps). This replaces fragile name-string comparisons that were already broken for the runtime-rs variants: `nvidia-gpu (runtime-rs)` was incorrectly getting KBS steps, and `nvidia-gpu-snp (runtime-rs)` was not getting the right env vars. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	1682b73e38	kata-deploy: Add qemu-nvidia-gpu-tdx-runtime-rs shim Register the new qemu-nvidia-gpu-tdx-runtime-rs shim across the kata-deploy stack so it is built, installed, and exposed as a RuntimeClass. This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the runtime-rs binary), SHIMS list, the qemu-tdx-experimental share name mapping, and the x86_64 default shim set. The Helm chart gets the new shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the RuntimeClass overhead definition in runtimeclasses.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	8a33007806	runtime-rs: Add configuration-qemu-nvidia-gpu-tdx-runtime-rs.toml.in Add a new runtime-rs configuration template that combines the NVIDIA GPU cold-plug stack with Intel TDX confidential guest support. This is the runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-tdx template. The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API, NV-specific kernel/image/firmware, extended timeouts) with TDX confidential guest settings (confidential_guest, OVMF.inteltdx.fd firmware, TDX Quote Generation Service socket, confidential NV kernel and image). The Makefile is updated with the new config file registration and the FIRMWARETDVFPATH_NV variable pointing to OVMF.inteltdx.fd. Also removes a stray tdx_quote_generation_service_socket_port setting from the SNP GPU template where it did not belong. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	2280620cb9	kata-deploy: Add qemu-nvidia-gpu-snp-runtime-rs shim Register the new qemu-nvidia-gpu-snp-runtime-rs shim across the kata-deploy stack so it is built, installed, and exposed as a RuntimeClass. This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the runtime-rs binary), SHIMS list, the qemu-snp-experimental share name mapping, and the x86_64 default shim set. The Helm chart gets the new shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the RuntimeClass overhead definition in runtimeclasses.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	e98a864285	runtime-rs: Add configuration-qemu-nvidia-gpu-snp-runtime-rs.toml.in Add a new runtime-rs configuration template that combines the NVIDIA GPU cold-plug stack with AMD SEV-SNP confidential guest support. This is the runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-snp template. The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API, NV-specific kernel/image/firmware, extended timeouts) with the SNP confidential guest settings (confidential_guest, sev_snp_guest, SNP ID block/auth, guest policy, AMDSEV.fd firmware, confidential NV kernel and image). The Makefile is updated with the new config file registration, the CONFIDENTIAL_NV image/kernel variables, and FIRMWARESNPPATH_NV pointing to AMDSEV.fd. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	92a8cd56d1	kata-deploy: Add qemu-nvidia-gpu-runtime-rs shim Register the Rust NVIDIA GPU runtime as a kata-deploy shim so it gets installed and configured alongside the existing Go-based qemu-nvidia-gpu shim. Add qemu-nvidia-gpu-runtime-rs to the RUST_SHIMS list and the default enabled shims, create its RuntimeClass entry in the Helm chart, and include it in the try-kata-nvidia-gpu values overlay. The kata-deploy installer will now copy the runtime-rs configuration and create the containerd runtime entry for it. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	1ada256581	runtime-rs: Add configuration-qemu-nvidia-gpu-runtime-rs.toml.in Add a QEMU configuration template for the NVIDIA GPU runtime-rs shim, mirroring the Go runtime's configuration-qemu-nvidia-gpu.toml.in. The template uses _NV-suffixed Makefile variables for kernel, image, and verity params so the GPU-specific rootfs and kernel are selected at build time. Wire the new config into the runtime-rs Makefile: define FIRMWAREPATH_NV with arch-specific OVMF/AAVMF paths (matching the Go runtime's PR #12780), add EDK2_NAME for x86_64, and register the config in CONFIGS/CONFIG_PATHS/SYSCONFIG_PATHS so it gets installed alongside the other runtime-rs configurations. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	a51e0b630e	agent: Update VFIO device handling for GPU cold-plug Extend the in-guest agent's VFIO device handler to support the cold-plug flow. When the runtime cold-plugs a GPU before the VM boots, the agent needs to bind the device to the vfio-pci driver inside the guest and set up the correct /dev/vfio/ group nodes so the workload can access the GPU. This updates the device discovery logic to handle the PCI topology that QEMU presents for cold-plugged vfio-pci devices and ensures the IOMMU group is properly resolved from the guest's sysfs. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	cb6fb51920	runtime-rs: Do not pass through audio device from IOMMU group NVIDIA GPUs often have an HDA audio controller (PCI class 0x0403) in the same IOMMU group. This device should not be passed through to the guest, just like Host and PCI bridges. Change filter_bridge_device() to accept a slice of PCI class bitmasks and add 0x0403 (audio) to the ignore list alongside 0x0600 (host/PCI bridge). This matches the Go runtime fix from NVIDIA/kata-containers#26. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	7e2dff8179	runtime-rs: Wire BlockDeviceModern into rawblock volume and container Use BlockCfgModern for rawblock volumes when the hypervisor supports it, passing logical and physical sector sizes from the volume metadata. In the container manager, clear Linux.Resources fields (Pids, BlockIO, Network) that genpolicy expects to be null, and filter VFIO character devices from Linux.Devices to avoid policy rejection. Update Dragonball's inner_device to handle the DeviceType::VfioModern variant in its no-op match arm. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	eecb1a246c	runtime-rs: Add resource manager VFIO modern handling and CDI wiring Extend the resource manager to handle VfioModern and BlockModern device types when building the agent's device list and storage list. For VFIO modern devices, the manager resolves the container path and sets the agent Device.id to match what genpolicy expects. Rework CDI device annotation handling in container_device.rs: - Strip the "vfio" prefix from device names when building CDI annotation keys (cdi.k8s.io/vfio0, cdi.k8s.io/vfio1, etc.) - Remove the per-device index suffix that caused policy mismatches - Add iommufd cdev path support alongside legacy VFIO group paths Update the vfio driver to detect iommufd cdev vs legacy group from the CDI device node path. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	4f618d09d5	runtime-rs: Add Pod Resources CDI discovery in sandbox Query the kubelet Pod Resources API during sandbox setup to discover which GPU devices have been allocated to the pod. When cold_plug_vfio is enabled, the sandbox resolves CDI device specs, extracts host PCI addresses and IOMMU groups from sysfs, and creates VfioModernCfg device entries that get passed to the hypervisor for cold-plug. Add pod-resources and cdi crate dependencies to the runtimes and virt_container workspace members. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	21a47cfe8d	runtime-rs: Wire VFIO cold-plug into QEMU inner Implement add_device() and remove_device() support for DeviceType::VfioModern and DeviceType::BlockModern in the QEMU inner hypervisor layer. For cold-plug (before VM boot): VfioDeviceConfig/VfioDeviceGroup structs are constructed from the device's resolved PCI address, IOMMU group, and bus assignment, then appended to the QEMU command line via cmdline_generator. Block devices use VirtioBlkDevice with the modern config's sector size fields and are always cold-plugged onto the command line. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	0f9ab37abe	runtime-rs: Bump QMP timeouts for VFIO cold-plug Bump QMP connection timeout from 10s to 30s and initial read timeout from 250ms to 5s to accommodate the longer initialization time when VFIO devices are cold-plugged (IOMMU domain setup and device reset can be slow for GPUs). Re-export cmdline_generator types from qemu/mod.rs for downstream use. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	a975a998a6	runtime-rs: Add QEMU VFIO command-line parameter structs Add QEMU command-line parameter types for VFIO device cold-plug: - ObjectIommufd: /dev/iommu object for iommufd-backed passthrough - PCIeVfioDevice: vfio-pci device on a PCIe root port or switch port, supporting both legacy VFIO group and iommufd cdev backends - FWCfgDevice: firmware config device for fw_cfg blob injection - VfioDeviceBase/VfioDeviceConfig/VfioDeviceGroup: high-level wrappers that compose the above into complete QEMU argument sets, resolving IOMMU groups, device nodes, and per-device fw_cfg entries Refactor existing cmdline structs (BalloonDevice, VirtioNetDevice, VirtioBlkDevice, etc.) to use a shared devices_to_params() helper and align the ToQemuParams implementations. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	074e9e9423	runtime-rs: Add PCIe topology cold-plug port management Extend PCIeTopology to support cold-plug port reservation and release for VFIO devices. New fields track the topology mode (NoPort, RootPort, SwitchPort), whether cold-plug dynamic expansion is enabled, and a map of reserved bus assignments per device. PCIeTopology::new() now infers the mode from the configured root-port and switch-port counts, pre-seeds the port structures, and makes add_root_ports_on_bus() idempotent so that PortDevice::attach can safely call it again after the topology has already been initialized. New methods: - reserve_bus_for_device: allocate a free root port or switch downstream port for a device, expanding the port map when cold_plug is enabled - release_bus_for_device: free the previously reserved port - find_free_root_port / find_free_switch_down_port: internal helpers - release_root_port / release_switch_down_port: internal helpers Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	064aa340ab	runtime-rs: Wire modern device types into device config and manager Add DeviceConfig::VfioModernCfg and DeviceConfig::BlockCfgModern variants so the device manager can accept creation requests for the modern VFIO and block drivers introduced in the previous commits. Wire find_device() to look up VfioModern devices by iommu_group_devnode and BlockModern devices by path_on_host. Add create_block_device_modern() for BlockConfigModern with the same driver-option normalization and virt-path assignment as the legacy path. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	6c0b53fe36	runtime-rs: Add BlockDeviceModern driver Add a modern block device driver using the Arc<Mutex> pattern for interior mutability, matching the VfioDeviceModern approach. The driver implements the Device trait with attach/detach/hotplug lifecycle management, and supports BlockConfigModern with logical and physical sector size fields. Add the DeviceType::BlockModern enum variant so the driver compiles. The device_manager and hypervisor cold-plug wiring follow in subsequent commits. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	e72ed1c12e	runtime-rs: Add VFIO modern device driver Add the VfioDeviceModern driver for VFIO device passthrough in runtime-rs. The driver handles device discovery through sysfs, detects whether the host uses iommufd cdev or legacy VFIO group interfaces, resolves PCI BDF addresses and IOMMU groups, and implements the Device and PCIeDevice traits for hypervisor integration. The module is structured as: - core.rs: sysfs discovery, BDF parsing, IOMMU group resolution, device-node path logic for both iommufd cdev and legacy group paths - device.rs: VfioDeviceModern/VfioDeviceModernHandle types, Device and PCIeDevice trait implementations - mod.rs: host capability detection (iommufd vs legacy), backend selection logic The DeviceType::VfioModern enum variant and stub PCIeTopology methods (reserve_bus_for_device, release_bus_for_device) are added so the driver compiles; full topology wiring follows in a subsequent commit. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	564c39907a	runtime-rs: Improve vsock connect with spawn_blocking and backoff The vsock connect loop previously ran the blocking connect(2) syscall directly on a tokio async worker thread, which could stall other async tasks. Move the socket creation and connect(2) call into spawn_blocking so the async runtime remains responsive. Replace the fixed-interval retry loop with an Instant-based deadline and bounded exponential backoff (10ms-500ms, doubling each attempt). This avoids hammering the vsock endpoint during slow VM boots while still converging quickly once the guest agent is ready. Also improve log messages to include attempt counts and remaining time. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	b4768cfc61	dragonball: Adapt VFIO DMA calls to vfio-ioctls 0.6 API The vfio-ioctls 0.6.0 crate changed the vfio_dma_map signature: the host address parameter is now a raw pointer (*mut u8) instead of u64, and the size parameter is usize instead of u64. Since the kernel uses the host address to set up DMA mappings to physical memory — and the caller must guarantee the memory behind that pointer remains valid for the lifetime of the mapping — upstream marked vfio_dma_map as unsafe fn. Wrap vfio_dma_map calls in unsafe blocks and adjust the type casts accordingly. vfio_dma_unmap only needed the usize cast for the size parameter (it does not take a host address, so it remains safe). Bump workspace dependencies: - vfio-bindings 0.6.1 -> 0.6.2 - vfio-ioctls 0.5.0 -> 0.6.0 Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	0bb9b66815	kata-sys-util: Add PCI helpers for VFIO cold-plug paths The VFIO cold-plug path needs to resolve a PCI device's sysfs address from its /dev/vfio/ group or iommufd cdev node. Extend the PCI helpers in kata-sys-util to support this: add a function that walks /sys/bus/pci/devices to find a device by its IOMMU group, and expose the guest BDF that the QEMU command line will reference. These helpers are consumed by the runtime-rs hypervisor crate when building VFIO device descriptors for the QEMU command line. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00

1 2 3 4 5 ...

18953 Commits