kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-02 23:23:44 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	774e698aeb	Merge pull request #12293 from Apokleos/graceful-errors runtime-rs: make OOM watcher and signal handling lifecycle-aware	2026-06-16 15:02:54 +02:00
Alex Lyn	8fc1a16225	runtime-rs: Make signal_process idempotent for exited init processes Address the issue where signal_process returns an INTERNAL error when the container's init process has already exited, and ensure teardown is never aborted by signal failures. Introduce is_no_such_process_error() to detect "no such process" conditions (ESRCH/ENOENT codes or equivalent messages). When the init process is already gone, treat it as success with an info log instead of an error. In stop_process(), never propagate signal failures. During sandbox shutdown the agent connection is often already closed, causing AgentConnectionClosed errors that bypass is_no_such_process_error(). If stop_process() aborts on such errors, cleanup_container() is skipped and leftover mounts cause "Resource busy" failures in sandbox cleanup. Restore "always proceed to cleanup" semantics: log the failure as a warning, but never skip resource cleanup. Resource cleanup must be best-effort and idempotent regardless of kill outcome. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-16 15:12:28 +08:00
Alex Lyn	44dd2b1f34	runtime-rs: Refine OOM watcher error reporting for sandbox teardown This commit refines the error handling within the OOM watcher to distinguish between genuine failures and errors that occur as a natural consequence of sandbox shutdown via the helper is_normal_shutdown_error. Previously, various connection-related errors during teardown were logged as warnings, contributing to noisy logs. It aims to improve OOM error handling, distinguish error types: The logic now differentiates between "normal shutdown" errors (e.g., Connection reset by peer, broken pipe) and actual OOM watcher failures. This enhancement makes OOM event logs more informative and less prone to clutter during normal sandbox termination. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-16 15:12:24 +08:00
Alex Lyn	3095bd379b	runtime-rs: Introduce cancellation for OOM watcher during teardown This commit introduces an explicit cancellation mechanism for the OOM watcher loop within VirtSandbox. This addresses the issue where the watcher continues to poll for OOM events even when the sandbox is being stopped, leading to spurious "Connection reset by peer" errors. Key changes: (1) A CancellationToken is added to VirtSandbox to signal the watcher loop when the sandbox is undergoing teardown. (2) The OOM watcher loop in VirtSandbox::start() is now wrapped in a tokio::select! statement. This allows it to concurrently listen for two events: - cancel_token.cancelled(): Triggered when the sandbox/VM is stopping. - agent.get_oom_event(): The regular OOM event polling. (3) In the sandbox stop/teardown path, cancel_token.cancel() is called before stopping the VM. This ensures the OOM watcher loop exits cleanly via the cancellation token, preventing the occurrence of ECONNRESET/EOF errors on a closed channel. This change improves the robustness of OOM event handling during sandbox lifecycle management. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-16 12:56:54 +08:00
Alex Lyn	0ffdc576d3	runtime-rs: Introduce a helper to check if process/container exists Returns `true` if the error indicates that the target process/container no longer exists. This is used to determine if an operation, like signaling a process, failed because the target is no longer available. The function checks for standard OS error codes (`ESRCH`, `ENOENT`) and common error message patterns. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-16 12:56:54 +08:00
Alex Lyn	59677688ee	runtime-rs: Introduce a helper to check normal oom shutdown errors It mainly for checking if an error is a normal oom shutdown error due to network disconnection issues. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-16 12:56:54 +08:00
Fabiano Fidêncio	cfab6f496b	runtime-rs: Propagate block device read-only flag to the VMM Block volumes and block-mode device nodes were attached to the guest read-write regardless of the volume's read-only intent, so the guest-visible virtio-blk device was always writable. This matters beyond simple write protection: filesystems such as XFS inspect the block device read-only state to decide whether to attempt journal/log recovery. When the device is writable, XFS tries to replay the log even on a read-only mount, which fails badly. Mounting with "-o ro" inside the guest is not sufficient; the device itself must advertise read-only (VIRTIO_BLK_F_RO), which only happens when the VMM opens the backing image read-only. Set is_readonly on the block device config from two signals, combined with OR so either one marks the device read-only: - the read-only intent from the OCI spec: * bind-mounted block volumes and direct-assigned (raw block) volumes derive it from the "ro" mount option, and * block-mode volumes (e.g. Kubernetes volumeDevices) arrive as device nodes in spec.Linux.Devices with no mount option; their intent is expressed only via the cgroup device access in spec.Linux.Resources.Devices ("rm" = read+mknod, no write, for read-only; "rwm" for read-write). handler_devices() derives the flag from the matching cgroup allow rule, and - the host block device's own read-only flag (queried via the BLKROGET ioctl). Both the volume path (block_volume/rawblock_volume) and the device-node path (handler_devices, resolving the host node via get_host_path) honor it, so a device that is physically read-only on the host is exposed read-only to the guest even when the intent is not encoded in the OCI spec. All in-tree hypervisors (qemu, cloud-hypervisor, dragonball) already honor BlockConfig.is_readonly, so no hypervisor changes are required. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor	2026-06-15 23:18:36 +02:00
Fabiano Fidêncio	37c4a0b6a2	Merge pull request #13128 from nikolasgkou/fix/guest-protection-fallback runtime-rs: don't fail VM start when guest protection detection errors	2026-06-13 08:56:56 +02:00
nikolasgkou	80b8f592a0	runtime-rs: skip guest protection detection for non-confidential guests prepare_protection_device_config() called available_guest_protection() unconditionally and propagated any error before the "confidential_guest is not set" case was handled. On AMD hosts where the kvm_amd `sev` module parameter is "Y" but the CPU does not expose the SEV-SNP CPUID bit (8000_001f EAX[4]) -- e.g. consumer Ryzen -- available_guest_protection() returns Err("SEV not supported"), which blocked every non-confidential VM from booting even though no protection was requested. When confidential_guest is not set there is no reason to probe the host, so return Ok(None) before calling available_guest_protection(). Detection (and any error it produces) now runs only when a confidential guest is actually requested. Signed-off-by: nikolasgkou <nikolasgkou@disroot.org>	2026-06-12 22:20:13 +02:00
Gregory Ling	d90178c179	runtime-rs: Fix queue_size of zero in block_rootfs Fix BlockRootfs to save the queue_size, num_queues, logical_sector_size, and physical_sector_size of the hypervisor's block device info in the BlockConfig passed to the vm Fixes #13210 Signed-off-by: Gregory Ling <17791817+glingy@users.noreply.github.com>	2026-06-12 18:24:50 +02:00
Fabiano Fidêncio	110843d6e1	Merge pull request #13138 from manuelh-dev/mahuber/runt-rs-mem-file-removal runtime(-rs): remove file_mem_backend config option	2026-06-12 17:13:04 +02:00
Fupan Li	9553614f32	Merge pull request #12772 from Apokleos/nydus-standalone runtime-rs: Nydus standalone mode support in runtime-rs	2026-06-12 10:36:17 +08:00
Manuel Huber	86fd65271c	runtime-rs: remove file_mem_backend config option While the config knob is being parsed, it is being unused in the rust shim. This renders the config knob useless. Remove the file_mem_backend config option as there is no current users for it. As this option is being usable in the go shim, we leave it intact. For the rust shim, /dev/shm is still being used in a similar way to the go shim when filesystem sharing is enabled (virtio-fs). Future use cases where other file_mem_backends are being utilized are currently planning to define these backends in a similar manner: based on the configuration/platform, determine the proper file memory backend, but do not let end users determine the file memory backend. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-06-12 00:07:16 +00:00
Alex Lyn	fa84eecd2d	runtime-rs: Implement ShareVirtioFsNydus for standalone mode Introduce `ShareVirtioFsNydus` to enable standalone Nydus rootfs support. This implementation acts as the bridge between runtime-rs and the external `nydusd` daemon. Key Capabilities: (1) Trait Implementation: Implements `ShareFs` (for VM device/storage) and `NydusShareFs` (for RAFS lifecycle) traits. (2) Daemon Lifecycle Management: Handles `nydusd` spawning, supervision, and graceful shutdown. (3) Native Overlay Support: Configures `nydusd` with `passthrough_fs` backend to provide native overlay (upperdir/workdir) support. (4) API Integration: Utilizes `NydusClient` for granular control over RAFS mount/umount operations. (5) QEMU Integration: Enables `virtio-fs-nydus` device support, facilitating standalone mode execution. This implementation allows Kata containers to utilize an external `nydusd` process for Nydus rootfs management, providing a cleaner separation between the runtime and the Nydus daemon lifecycle. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 21:42:48 +02:00
Alex Lyn	edfe9ea403	runtime-rs: refine ShareFs abstraction with lifecycle and Nydus traits Refactor the `ShareFs` trait to improve modularity and support standalone Nydus mode: (1) Added `stop()` method to manage daemon teardown. (2) Introduced a dedicated trait for Nydus-specific data-plane operations. This refactoring cleans up the `ShareFs` trait by consolidating daemon lifecycle handling and isolating Nydus-specific extensions, paving the way for cleaner standalone Nydus implementation. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 21:42:48 +02:00
Alex Lyn	720a8688b4	runtime-rs: Add daemon manager for nydusd process lifecycle Implement Nydusd to manage nydusd daemon process: (1) start: spawn process, validate paths, wait for API ready, setup passthrough fs. (2) stop: kill process, cleanup socket files. (3) mount_rafs/mount_rafs_with_overlay: high-level filesystem mount operations. (4) build_args: construct virtiofs mode command line arguments. This provides process lifecycle management with internal NydusClient Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 21:42:48 +02:00
Alex Lyn	c1ebf269f7	runtime-rs: Add nydus client for nydusd API communication via HTTP Implement NydusClient to interact with nydusd daemon via Unix socket: (1) check_status: query daemon state via GET /api/v1/daemon. (2) mount/umount: manage filesystem mounts via POST/DELETE /api/v1/mount. (3) wait_until_ready: poll daemon until RUNNING state. This provides a lightweight, stateless HTTP client layer for nydusd API. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 21:42:48 +02:00
Alex Lyn	b0ebbc685d	runtime-rs: Add support for independent iothreads for virtio blk devices As independent iothreads can work in both virtio-scsi and virtio-blk devices, this commit aims to enable such feature in virtio-blk-pci devices. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 20:47:20 +02:00
Alex Lyn	980ecfdd96	runtime-rs: Add support iodependent iothreads within virtio-blk 1. Determine iothread for virtio-blk devices, only attach iothread when: (1) enable_iothreads is true (2) indep_iothreads > 0 (3) block driver is not virtio-scsi (i.e., it's virtio-blk) And for more complex cases, some enhancements will be done in future 2. Add iothread parameter for virtio-blk devices if specified. If iothreads set and passed, we will have to set it correctly for virtio-blk devices via qmp with device_add arguments. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 20:47:20 +02:00
Alex Lyn	36e626649d	runtime-rs: Add support independent IO threads in qemu cmdline To make it work well for independent IO threads for virtio-blk devices. A new method for independent IO threads for virtio-blk hotplug devices within qemu command line. Note that as ObjectIoThread has been done for days, it can be directly reused in this case. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 20:47:20 +02:00
Alex Lyn	bdc57b16e5	runtime-rs: Add configurable indep_iothreads in configurations It's useful and helpful to set indep_iothreads with enable_iothreads for high IO performance. And we need provide an entry for people to set it if needed. This commit will introduce two configurable items: - Makefile: DEFINDEPIOTHREADS when make build. - configurations: indep_iothreads for people to set. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-06-11 20:47:20 +02:00
stevenhorsman	fb4600d66a	runtime-rs: Fix test breakage In #13147, for some reason a test block was added in the middle of code and the code was stale when merged, which meant that a second `mod test` section was added, breaking our tests. Merge the two to fix this. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-11 19:03:33 +02:00
Fabiano Fidêncio	21657b9cd9	Merge pull request #13147 from manuelh-dev/mahuber/debug-go-rust runtime-rs: Honor enable_debug for logs and adjust debugging documentation	2026-06-11 08:57:36 +02:00
Hyounggyu Choi	7cc6767fa2	runtime*: use static_sandbox_resource_mgmt defaults for qemu-se Switch qemu-se config templates to use the TEE/CoCo-specific static_sandbox_resource_mgmt defaults instead of the generic QEMU defaults. qemu-se-runtime-rs config now uses DEFSTATICRESOURCEMGMT_COCO while runtime qemu-se config now uses DEFSTATICRESOURCEMGMT_TEE. This aligns static sandbox resource management behavior with confidential container expectations for qemu-se variants. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-06-09 14:45:50 +02:00
Fabiano Fidêncio	4dc288401e	runtime-rs: make sandbox cgroup runtime attach idempotent The dragonball nerdctl CI job can race when creating and attaching the runtime process to the sandbox cgroup, surfacing an os error 17 (AlreadyExists) during shim task creation. Let's retry add_proc once on this pre-existing cgroup condition so startup remains robust. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex <codex@openai.com>	2026-06-08 13:11:34 +02:00
Fabiano Fidêncio	4d569c22b4	runtime-rs: enforce a minimum vsock reconnect window Low-CPU sandboxes can take longer than a few seconds to complete guest boot and start the agent. Let's clamp the reconnect timeout to a safe minimum so sandbox startup does not fail early with transient vsock ECONNRESET. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex <codex@openai.com>	2026-06-08 13:11:34 +02:00
Fabiano Fidêncio	ed34d7811d	runtime-rs: supplement static sizing from sandbox annotations When static sandbox resource management is enabled, CRI CPU/memory sizing may live only in sandbox annotations and be missing from the OCI spec. Let's fill missing sizing fields from annotations before applying static VM sizing so runtime-rs follows the expected Kubernetes behavior for constrained pods. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex <codex@openai.com>	2026-06-08 13:11:34 +02:00
Fabiano Fidêncio	e93558e810	runtime-rs: default static sizing-related config flags to true Add top-level runtime-rs Makefile options `DEFSANDBOXCGROUP_ONLY` and `DEFSTATICRESOURCEMGMT`, both defaulting to true, and use them for the runtime defaults that previously disabled these paths. This aligns runtime-rs defaults with static sandbox resource management, which sizes sandbox memory up front instead of relying on memory hotplug, helping avoid architecture-specific hotplug limitations. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-08 12:57:40 +02:00
Fabiano Fidêncio	80e2473440	runtime-rs: shut down shim daemon on a failed create When CreateContainer fails before the runtime instance is registered (e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal teardown. containerd's follow-up Shutdown RPC then reaches get_runtime_instance(), fails with "runtime not ready", and returns before the service loop is ever told to stop. Because the shim ignores SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned. Make the Shutdown RPC force the daemon to exit when there is no runtime instance, emitting the same Action::Shutdown that sandbox.shutdown() sends on the normal path. This guarantees the shim process is reaped after a failed create instead of leaking. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-04 14:12:01 +02:00
Mikko Ylinen	2e625d0bab	runtime-rs: qemu: don't set maxcpus when confidential guest is enabled QEMU maxcpus enables CPU hotplug capabilities but it's unused when confidential guest is enabled. Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug is not needed. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
Fabiano Fidêncio	a2bb3f64b0	Merge pull request #12436 from mythi/tdx-updates-2026-3 runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default	2026-06-03 08:50:26 +02:00
Fabiano Fidêncio	ecd9344dd1	Merge pull request #13144 from stevenhorsman/bump-rust-to-1.94 Bump rust to 1.94	2026-06-02 09:58:56 +02:00
Fabiano Fidêncio	230e01b04e	Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs runtime/runtime-rs: introduce Azure specific configs	2026-06-02 09:17:09 +02:00
Manuel Huber	57ee67a6aa	runtime-rs: Honor enable_debug for logs Make enable_debug promote the effective component log level from the default info level to debug for runtime, agent, and hypervisor logs. Keep an explicit log_level value authoritative so users can still choose trace, warn, or another level. Signed-off-by: Manuel Huber <manuelh@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-01 21:29:08 +00:00
stevenhorsman	b1928cc22f	runtime-rs: run cargo fmt for Rust 1.94 Run cargo fmt on runtime-rs to ensure consistent formatting with Rust 1.94 toolchain. Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-01 17:32:06 +01:00
manuelh-dev	953b306ff3	Merge pull request #12979 from manuelh-dev/mahuber/erofs-tmpfs-mount runtime-rs/agent: support EROFS snapshots without a rwlayer	2026-05-29 13:50:27 -07:00
Fabiano Fidêncio	10e70a2a9f	runtime-rs: expose InfiniBand devices to VFIO containers The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a PCIDEVICE_* environment variable; it does not add the VFIO char device to linux.devices in the OCI spec. As a result the agent's container_has_vfio_device() gate stays closed and expose_guest_infiniband_devices() is never triggered — leaving /dev/infiniband absent from the container even though the guest kernel created the IB devices (mlx5_core.rdma.0 probes successfully). The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network endpoints via host_bdf()) was already present inside handler_devices() but could never be consumed because the LinuxDeviceType::C loop has no entries to iterate over when linux.devices is empty. After that loop, iterate over any unmatched cold-plug BDFs, derive the VFIO group path via bdf_to_vfio_group_path() (reads /sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk ContainerDevice. The vfio_group_to_bdf() short-circuit inside the loop handles the case where the device plugin does add VFIO char devices to linux.devices; it now supports both legacy (/dev/vfio/N) and iommufd (/dev/vfio/devices/vfioN) path formats. Add host_bdf() to the Endpoint trait (default: None) so that PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	60f2878c68	runtime-rs: call network.remove() during resource cleanup network.remove() — which detaches endpoints and rebinds VFs from vfio-pci back to the host driver — was never being called. ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs, swap and ephemeral disks, but completely omitted the network teardown. Call network.remove() at the start of cleanup(), using the already-held self.hypervisor reference. Errors are logged as warnings rather than propagated, so they don't block the rest of the cleanup sequence. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	0b4b51dff6	runtime-rs: always detach endpoints on network removal network_with_netns::remove() bailed out early when network_created=false (i.e. the netns was created by the CNI, not by kata). This caused physical endpoint VFs to remain bound to vfio-pci after pod deletion, because PhysicalEndpoint::detach() — which calls bind_device_to_host() to rebind the VF from vfio-pci back to mlx5_core — was never reached. Separate endpoint detachment from netns deletion: always detach endpoints, but only remove the netns if kata created it. Detach errors are logged as warnings rather than propagated, to mirror the Go runtime's best-effort approach and avoid blocking sandbox teardown. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	be2ec02c9a	runtime-rs: resolve cold-plug VFIO guest PCI path via QMP The PCIe topology pre-computes a wrong path for cold-plugged physical- endpoint VFs because the root port has no explicit addr and QEMU auto- assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] } resolves to 0000:00:00.0 (the Q35 MCH), causing wait_for_pci_net_interface to time out looking for a netdev there. Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait. Implement it in QemuInner using qmp.get_device_by_qdev_id(), which queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00" = slot 5 on pcie.0 / slot 0 on the root port bus). Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the Endpoint trait and add an endpoints() accessor to the Network trait. In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths() before apply_network_to_agent() to populate the correct path from QMP into each PhysicalEndpoint's guest_pci_path field. The field is then consumed by network_with_netns::interfaces() to fill Interface.device_path before update_interface is sent to the agent. This is the runtime-rs counterpart of the Go runtime's ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	f8ee9133e5	runtime-rs: populate device_path for cold-plug VFIO physical endpoints Without device_path the agent receives Interface.device_path="" in update_interface, falls back to a by-MAC link lookup, and fails for SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after the vfio-pci unbind/rebind cycle. The guest PCI path is computed at attach() time by do_add_pcie_endpoint() inside VfioDevice::register() — no QMP query is needed. Cache it in PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach() when do_handle_device() returns the DeviceType::Vfio with the path already filled in. Add a default-None guest_pci_path() method to the Endpoint trait; PhysicalEndpoint overrides it to return the cached path. In network_with_netns.rs::interfaces(), after building each Interface from network_info, fill device_path from endpoint.guest_pci_path() when the field would otherwise be empty. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	67843220f8	runtime-rs: set VF admin MAC before vfio-pci rebind for IB/RoCE support Without an admin MAC the guest mlx5_core inherits whatever firmware- default MAC the VF was created with. This MAC differs from the IB port HCA MAC, so mlx5_ib's GID cache refuses to populate /sys/class/infiniband/mlx5_/ports/N/gids/. RoCE appears active but every verb needing a GID fails. Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF as an "admin MAC" via the parent PF using RTM_SETLINK with IFLA_VFINFO_LIST — the netlink equivalent of ip link set <PF> vf <N> mac <MAC> The operation runs in a spawn_blocking closure that enters the host network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is called while the thread is inside the pod netns. Best-effort: failures are logged at warn and the existing agent-side MAC reconciliation (update_interface in rpc.rs) remains as a fallback for L2/L3 connectivity. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	9e9b50c79e	runtime-rs: cold-plug Vfio physical endpoints at VM launch DeviceType::Vfio (used by physical network VFs) was silently dropped in start_vm()'s cold-plug loop, falling through to the unsupported- device info log. The VF never appeared on the QEMU command line and therefore never became visible inside the guest. Add handling for DeviceType::Vfio in the start_vm() cold-plug loop. For each HostDevice in the VfioDevice, emit: -device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \ [x-pci-vendor-id=...,x-pci-device-id=...] The bus assignment and guest PCI path are already computed by do_add_pcie_endpoint() at VfioDevice::register() time (called from VfioDevice::attach() via the PCIe topology), so no additional QMP resolution is needed here. Add id= support to PCIeVfioDevice so the QEMU device name is stable and matchable in QMP queries. Add new_without_iommufd() constructor for the non-IOMMUFD (legacy VFIO container) path used by physical endpoints, and add_physical_vfio_device() to QemuCmdLine as a direct emission helper. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	f36c383b4f	runtime: generate dedicated CLH Azure config variants Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH configs during build. This keeps Mariner-specific defaults in explicit config artifacts instead of ad-hoc runtime mutation. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Manuel Huber	ebf2c99df3	runtime-rs: allow EROFS rootfs without rwlayer Treat the containerd erofs snapshotter active snapshot as an EROFS lower plus overlay metadata, with an optional ext4 rwlayer when host rw backing is enabled. This also covers default_size=0, where containerd sends no rwlayer and the agent provides the writable upper inside the guest. Forward overlay mkdir hints on the EROFS storage so the guest agent sees them in both layouts, and add unit coverage for the dispatcher patterns. Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-27 17:12:20 +00:00
Mikko Ylinen	2b38d9f45e	runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default TDX QGS takes raw TD report from QEMU/guest VM and signs it in an SGX enclave. Historically, QGS has supported two transports: vsock and unix-domain-socket. The former was necessary before the guest kernel supported the GetQuote "TDVMCALL" hypercall: DCAP library inside the guest used vsock to talk to QGS directly. However, with GetQuote, QEMU gets the TDREPORT and sends it to QGS. In process-to-process communication, unix-domain-socket is a better approach. This is also the only transport supported by libvirt by default. With that, align Kata default configuration to use unix-domain-socket as well. The change in impacts QEMU commandline: old: "quote-generation-socket":{"type":"vsock","cid":"2","port":"4050"} new: "quote-generation-socket":{"type":"unix","path":"/var/run/tdx-qgs/qgs.socket"} Host QGS configuration must be changed to listen unix-domain-sockets. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-05-26 17:08:56 +03:00
Mikko Ylinen	733c6791d3	runtime(-rs): make TDX QGS port=0 change backwards compatible Changing Kata runtime configurations to use TDX QGS port=0 (unix domain socket transport) means cluster admins must also reconfigure qgsd to the same and have /var/run/tdx-qgs/qgs.sock available. Since the early days of TDX attestation in Kata, the configuration has used vsock with cid=2, port=4050. To avoid unncessary breakages when Kata default moves to unix domain socket, fall back to the old configuration if /var/run/tdx-qgs/qgs.sock is not available on the worker node. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-05-26 17:01:52 +03:00
Fabiano Fidêncio	3dc02a8604	Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only runtime-rs: Support erofs snapshotter with gpt vmdk mode	2026-05-25 16:29:59 +02:00
Zvonko Kaiser	aeadb1af35	Merge pull request #12948 from fidencio/topic/numa runtime (go): agent: Add NUMA support for QEMU	2026-05-25 15:33:14 +02:00
Alex Lyn	0bd150e5f1	runtime-rs: Integrate GPT+VMDK mode for multi-layer EROFS rootfs When multiple EROFS layers are present, wrap them into a single GPT-partitioned virtual disk delivered via one VMDK descriptor and a single block device hotplug which significantly reduce pci bus slots compared with the previous one-device-per-layer approach that exhausts virtio-blk slots for large layer counts. The host detects multi-layer mounts, computes the GPT layout, generates head metadata plus a VMDK descriptor referencing all EROFS images, and hot-plugs the composite disk. Per-partition Storage entries are created with X-kata.gpt-partitioned and X-kata.partition-number options so the guest agent can resolve each layer to its partition device. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00

1 2 3 4 5 ...

1335 Commits