kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Author	SHA1	Message	Date
manuelh-dev	953b306ff3	Merge pull request #12979 from manuelh-dev/mahuber/erofs-tmpfs-mount runtime-rs/agent: support EROFS snapshots without a rwlayer	2026-05-29 13:50:27 -07:00
Aurélien Bombo	9acef4bc55	Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"	2026-05-29 14:57:07 -05:00
Cameron Baird	7a9d207ab2	Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair" This reverts commit `2799f7d36b`.	2026-05-29 17:05:40 +00:00
Fabiano Fidêncio	10e70a2a9f	runtime-rs: expose InfiniBand devices to VFIO containers The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a PCIDEVICE_* environment variable; it does not add the VFIO char device to linux.devices in the OCI spec. As a result the agent's container_has_vfio_device() gate stays closed and expose_guest_infiniband_devices() is never triggered — leaving /dev/infiniband absent from the container even though the guest kernel created the IB devices (mlx5_core.rdma.0 probes successfully). The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network endpoints via host_bdf()) was already present inside handler_devices() but could never be consumed because the LinuxDeviceType::C loop has no entries to iterate over when linux.devices is empty. After that loop, iterate over any unmatched cold-plug BDFs, derive the VFIO group path via bdf_to_vfio_group_path() (reads /sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk ContainerDevice. The vfio_group_to_bdf() short-circuit inside the loop handles the case where the device plugin does add VFIO char devices to linux.devices; it now supports both legacy (/dev/vfio/N) and iommufd (/dev/vfio/devices/vfioN) path formats. Add host_bdf() to the Endpoint trait (default: None) so that PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	60f2878c68	runtime-rs: call network.remove() during resource cleanup network.remove() — which detaches endpoints and rebinds VFs from vfio-pci back to the host driver — was never being called. ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs, swap and ephemeral disks, but completely omitted the network teardown. Call network.remove() at the start of cleanup(), using the already-held self.hypervisor reference. Errors are logged as warnings rather than propagated, so they don't block the rest of the cleanup sequence. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	0b4b51dff6	runtime-rs: always detach endpoints on network removal network_with_netns::remove() bailed out early when network_created=false (i.e. the netns was created by the CNI, not by kata). This caused physical endpoint VFs to remain bound to vfio-pci after pod deletion, because PhysicalEndpoint::detach() — which calls bind_device_to_host() to rebind the VF from vfio-pci back to mlx5_core — was never reached. Separate endpoint detachment from netns deletion: always detach endpoints, but only remove the netns if kata created it. Detach errors are logged as warnings rather than propagated, to mirror the Go runtime's best-effort approach and avoid blocking sandbox teardown. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	be2ec02c9a	runtime-rs: resolve cold-plug VFIO guest PCI path via QMP The PCIe topology pre-computes a wrong path for cold-plugged physical- endpoint VFs because the root port has no explicit addr and QEMU auto- assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] } resolves to 0000:00:00.0 (the Q35 MCH), causing wait_for_pci_net_interface to time out looking for a netdev there. Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait. Implement it in QemuInner using qmp.get_device_by_qdev_id(), which queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00" = slot 5 on pcie.0 / slot 0 on the root port bus). Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the Endpoint trait and add an endpoints() accessor to the Network trait. In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths() before apply_network_to_agent() to populate the correct path from QMP into each PhysicalEndpoint's guest_pci_path field. The field is then consumed by network_with_netns::interfaces() to fill Interface.device_path before update_interface is sent to the agent. This is the runtime-rs counterpart of the Go runtime's ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	f8ee9133e5	runtime-rs: populate device_path for cold-plug VFIO physical endpoints Without device_path the agent receives Interface.device_path="" in update_interface, falls back to a by-MAC link lookup, and fails for SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after the vfio-pci unbind/rebind cycle. The guest PCI path is computed at attach() time by do_add_pcie_endpoint() inside VfioDevice::register() — no QMP query is needed. Cache it in PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach() when do_handle_device() returns the DeviceType::Vfio with the path already filled in. Add a default-None guest_pci_path() method to the Endpoint trait; PhysicalEndpoint overrides it to return the cached path. In network_with_netns.rs::interfaces(), after building each Interface from network_info, fill device_path from endpoint.guest_pci_path() when the field would otherwise be empty. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	67843220f8	runtime-rs: set VF admin MAC before vfio-pci rebind for IB/RoCE support Without an admin MAC the guest mlx5_core inherits whatever firmware- default MAC the VF was created with. This MAC differs from the IB port HCA MAC, so mlx5_ib's GID cache refuses to populate /sys/class/infiniband/mlx5_/ports/N/gids/. RoCE appears active but every verb needing a GID fails. Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF as an "admin MAC" via the parent PF using RTM_SETLINK with IFLA_VFINFO_LIST — the netlink equivalent of ip link set <PF> vf <N> mac <MAC> The operation runs in a spawn_blocking closure that enters the host network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is called while the thread is inside the pod netns. Best-effort: failures are logged at warn and the existing agent-side MAC reconciliation (update_interface in rpc.rs) remains as a fallback for L2/L3 connectivity. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	9e9b50c79e	runtime-rs: cold-plug Vfio physical endpoints at VM launch DeviceType::Vfio (used by physical network VFs) was silently dropped in start_vm()'s cold-plug loop, falling through to the unsupported- device info log. The VF never appeared on the QEMU command line and therefore never became visible inside the guest. Add handling for DeviceType::Vfio in the start_vm() cold-plug loop. For each HostDevice in the VfioDevice, emit: -device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \ [x-pci-vendor-id=...,x-pci-device-id=...] The bus assignment and guest PCI path are already computed by do_add_pcie_endpoint() at VfioDevice::register() time (called from VfioDevice::attach() via the PCIe topology), so no additional QMP resolution is needed here. Add id= support to PCIeVfioDevice so the QEMU device name is stable and matchable in QMP queries. Add new_without_iommufd() constructor for the non-IOMMUFD (legacy VFIO container) path used by physical endpoints, and add_physical_vfio_device() to QemuCmdLine as a direct emission helper. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	91df041803	agent: expose guest InfiniBand devices to VFIO containers When a VF is cold-plugged in guest-kernel mode, mlx5_core binds to the PCI device inside the VM and mlx5_ib creates IB character devices under /dev/infiniband/ (uverbs, rdma_cm, umad). The container cannot reach these devices unless they are explicitly added to its OCI spec. Add expose_guest_infiniband_devices(), called from create_devices() when the container carries at least one VFIO device entry. The function: - Walks /dev/infiniband/ inside the guest VM. - Appends each char device to spec.linux.devices. - Inserts matching cgroup allow rules (rwm). - Is a no-op if /dev/infiniband/ is absent or empty (no IB driver, or VF not yet rebound), so non-RDMA pods are unaffected. Gate the call on container_has_vfio_device() so unrelated containers sharing the sandbox do not get IB device access widened. Add is_vfio_device_type() and snapshot_infiniband() to kata-sys-util/pcilibs. is_vfio_device_type() lets the agent check device type strings against the VFIO driver name constants without duplication. snapshot_infiniband() summarises /sys/class/infiniband, /sys/class/infiniband_verbs, and /dev/infiniband as a single diagnostic string for log context; it lives in pcilibs because it has no agent-specific dependencies (pure sysfs/devfs reads). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	025202a52a	runtime: expose InfiniBand devices to VFIO containers The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a PCIDEVICE_* environment variable; it does not add the VFIO char device to linux.devices in the OCI spec. As a result the agent's container_has_vfio_device() gate stays closed and expose_guest_infiniband_devices() is never triggered — leaving /dev/infiniband absent from the container even though the guest kernel created the IB devices (mlx5_core.rdma.0 probes successfully). Add appendPhysicalEndpointDevices() which runs after appendDevices() in createContainer(). It walks the sandbox network endpoints; for each PhysicalEndpoint with a resolved guest PCI path it derives the VFIO group char path from sysfs (iommu_group symlink) and synthesises a vfio-pci-gk Device entry. Both legacy group paths (/dev/vfio/N) and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by reading the iommu_group sysfs symlink. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	fa9a9f3aeb	runtime: set VF admin MAC before vfio-pci rebind for IB/RoCE support Without an admin MAC, the guest's mlx5_core inherits the VF's firmware-default MAC. This MAC differs from the IB port's HCA MAC, so mlx5_ib's GID cache refuses to populate /sys/class/infiniband/mlx5_/ports/N/gids/. RoCE then appears active (port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID — RoCEv2 packets, address handles, librdmacm bind — fails silently. Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF using RTM_SETLINK before the bind-to-vfio-pci step. The firmware applies the admin MAC during the VF reset that accompanies the unbind/rebind cycle, so the guest sees a single consistent MAC across netdev, IB port, and HCA. Best-effort: failures are logged at warn and the existing agent-side MAC reconciliation (rpc.rs::update_interface) remains as a fallback for L2/L3 connectivity. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	992a723392	runtime: resolve cold-plug VFIO guest PCI path via QMP For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged VFIO device is auto-allocated at boot (each pcie-root-port is added with chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot on pcie.0). The hot-plug path already queries QMP via qomGetPciPath; reuse that same mechanism for cold-plugged devices. Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface. Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all other hypervisors. Call it at the start of setupNetworks so that the PCI paths are resolved before generateVCNetworkStructures emits the agent Interface proto. Also stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs exposed as physical network devices) so that update_interface carries a non-empty devicePath. Without devicePath the agent falls back to a by-MAC link lookup which fails when the VF firmware MAC differs from the CNI-assigned MAC after the vfio-pci unbind/rebind cycle. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	23c5250933	runtime/qemu: emit id= for VFIODevice on -device cmdline Without an explicit id= on the vfio-pci device, QEMU auto-generates an internal name that does not match vfioDev.ID, so any subsequent qomGetPciPath(vfioDev.ID) call via QMP fails with "Device 'X' not found". This breaks resolveColdPlugVFIOGuestPciPaths which needs the device ID to look up the guest PCI path, leaving GuestPciPath nil and causing update_interface to fail repeatedly as the agent can't find the interface to configure. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	e6777f0866	runtime: keep cold-plug VFIO devices in guest-kernel mode Container.createDevices was dropping cold-plug VFIO entries from the container's deviceInfos whenever vfio_mode = "guest-kernel", which in turn meant the agent's CreateContainer request carried no vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the workload container, so update_env_pci then aborted with "No PCI mapping found for container <id>" and the container failed with CrashLoopBackOff. Include cold-plug VFIO devices in deviceInfos for both VFIO modes. The existing vfio-pci-gk agent handler returns dev: None (so /dev/vfio/<group> is not materialised in the container spec, and constrainGRPCSpec(stripVfio=true) already strips it from the grpc spec for guest-kernel mode), while still recording the host->guest PCI mapping into sandbox.pcimap[cid] so env-var translation works. devManager.NewDevice calls FindDevice first, which matches the already cold-plugged sandbox-level device by HostPath / major / minor, so this does not double-attach. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	9893b6dc03	runtime: correctly resolve cold-plug VFIO guest PCI paths Populate missing VFIO guest PCI paths via QMP before serializing container devices so guest-kernel PCI env translation has the mappings it needs. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	118b7fa611	agent: reconcile VFIO netdev MAC before UpdateInterface lookup When a VFIO cold-plugged network device appears in guest with a different MAC than the runtime request, resolve the netdev by PCI path and apply the requested MAC before the normal by-MAC update flow. This preserves existing behavior while avoiding UpdateInterface mismatches in SR-IOV cold-plug cases. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	e89eb77245	agent: keep PCIDEVICE env unchanged when pcimap is missing Avoid failing container creation when per-container PCI mappings are unavailable by preserving PCIDEVICE entries unchanged and warning instead. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 21:54:52 +02:00
Cameron Baird	2799f7d36b	runtime: Enforce >= 1 queue pairs for tapNetworkPair In the xConnectVMNetwork path, we have queues = 0 as a baseline, set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise MultiQueueSupport. This is certainly incorrect as we always want, as a baseline, at least one queue pair. Make queues := 1 by default to ensure the NetworkPair has at least one queue pair for all virtio-net paths. Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>	2026-05-27 18:55:11 +00:00
Manuel Huber	ebf2c99df3	runtime-rs: allow EROFS rootfs without rwlayer Treat the containerd erofs snapshotter active snapshot as an EROFS lower plus overlay metadata, with an optional ext4 rwlayer when host rw backing is enabled. This also covers default_size=0, where containerd sends no rwlayer and the agent provides the writable upper inside the guest. Forward overlay mkdir hints on the EROFS storage so the guest agent sees them in both layouts, and add unit coverage for the dispatcher patterns. Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-27 17:12:20 +00:00
Manuel Huber	4fbfba2f79	agent: support run-backed EROFS upper Support multi-layer EROFS storage without an explicit ext4 upper layer. When runtime-rs sends only EROFS lower storage and overlay metadata, create the overlay upper/work directories under the container bundle in /run/kata-containers. Keep the explicit ext4 rwlayer path for disk-backed snapshots, and only track real temporary mount points for cleanup. The implicit /run-backed upper is bundle-scoped state and is removed with the container bundle. Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-27 17:12:20 +00:00
Fabiano Fidêncio	5adfb27297	Merge pull request #13118 from PiotrProkop/fix-missing-cwd agent: restore process CWD auto-creation	2026-05-27 13:32:05 +02:00
PiotrProkop	60a2e27f02	agent: Restore process CWD auto-creation Commit `b56313472` ("agent: Align agent OCI spec with oci-spec-rs", PR #9944) inverted the condition guarding the create_dir_all call for process.cwd: the leading `!` was dropped during the refactor. As a result, the CWD is created only when process.cwd is the empty string. When the guest then runs chdir(process.cwd) and CWD doesn't exist it returns ENOENT. The agent propagates that to the shim, which surfaces it to containerd as "failed to create shim task: ENOENT: No such file or directory" — indistinguishable from a missing argv[0]. This regressed the original fix in PR #2375 (Fixes #2374), which deliberately mirrored runc's behavior. Put the `!` back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2026-05-27 09:59:15 +02:00
Manuel Huber	e838cd7d8d	agent: compact EROFS overlay lowerdirs Use kata_types::mount::Mount for the final multi-layer EROFS overlay mount instead of calling baremount() directly. The mount helper detects overlay option strings close to the kernel mount data limit. When lowerdir entries share a common parent, it changes into that directory and rewrites lowerdir to relative paths. That avoids repeating the same long prefix for every layer. Multi-layer EROFS images can have many lower layers under /run/kata-containers/<cid>/multi-layer. Passing the raw absolute lowerdir list can exceed the mount option buffer and fail the final overlay mount, even after all layer devices mounted successfully. Reuse the helper so this path follows Kata's normal overlay mount handling, including lowerdir compaction before mount(2). Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-26 18:42:11 +00:00
Dan Mihai	c81dadaba1	Merge pull request #13064 from burgerdev/add-arp-neighbour agent: use rtnetlink to add ARP neighbour	2026-05-26 09:59:44 -07:00
Fabiano Fidêncio	3dc02a8604	Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only runtime-rs: Support erofs snapshotter with gpt vmdk mode	2026-05-25 16:29:59 +02:00
Zvonko Kaiser	aeadb1af35	Merge pull request #12948 from fidencio/topic/numa runtime (go): agent: Add NUMA support for QEMU	2026-05-25 15:33:14 +02:00
Alex Lyn	2036e66bc3	kata-agent: Integrate GPT partition support into multi-layer handler In GPT mode, all partitions share the same base block device, so resolving it once per uevent source and caching the result avoids redundant hotplug waits that would otherwise scale linearly with layer count. Layers are sorted by partition number before mounting to guarantee correct overlay lowerdir precedence regardless of the order the host emits Storage entries. And it will remove dead_code attributes to mark the codes working. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	17fadde6d8	kata-agent: Add GPT partition utility functions The guest agent needs to resolve individual partition devices from a single GPT-partitioned block device, but the kernel does not always create partition nodes immediately after the base device appears, especially when another fd holds the device open during hot-plug. Add utility functions that handle two problems: (1) Mapping a base device path to its partition path following the kernel naming convention (bare suffix vs 'p' separator). (2) And ensuring the partition node exists before mount. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	8119a561ae	kata-agent: Refactor wait_and_mount_layer to return LayerMountInfo This commit has No functional change — all callers pass None, so every call still resolves the device via uevent exactly as before. It just prepare the multi-layer EROFS handler for GPT partition and dm-verity support by widening the wait_and_mount_layer() interface without changing behavior. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	0bd150e5f1	runtime-rs: Integrate GPT+VMDK mode for multi-layer EROFS rootfs When multiple EROFS layers are present, wrap them into a single GPT-partitioned virtual disk delivered via one VMDK descriptor and a single block device hotplug which significantly reduce pci bus slots compared with the previous one-device-per-layer approach that exhausts virtio-blk slots for large layer counts. The host detects multi-layer mounts, computes the GPT layout, generates head metadata plus a VMDK descriptor referencing all EROFS images, and hot-plugs the composite disk. Per-partition Storage entries are created with X-kata.gpt-partitioned and X-kata.partition-number options so the guest agent can resolve each layer to its partition device. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	c3b06af4c7	kata-types: Add gpt_disk module for GPT metadata generation Introduce gpt_disk.rs to compute GPT partition layouts and generate metadata files for multi-layer EROFS rootfs. The module creates GPT head metadata that are combined with EROFS layer images via VMDK descriptors, presenting a single GPT-partitioned virtual disk to the guest VM — each EROFS layer mapped to its own partition. The layout engine calculates LBA positions for an arbitrary number of EROFS layers, then writes a full protective-MBR + GPT image and extracts the head (MBR + primary GPT table) segments as standalone files for VMDK extent assembly. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	148810312d	runtime-rs: Refactor VMDK writer and erofs rootfs handling logic Restructure the erofs rootfs handler to support multi-layer GPT+VMDK mode where multiple EROFS layers are wrapped into a single virtual disk with a GPT partition table. Extract VmdkDescriptorWriter as a reusable struct for atomic VMDK descriptor generation. Change erofs_storage from Option<Storage> to Vec<Storage> to hold per-layer metadata, and add GPT metadata path tracking for proper cleanup with path-traversal guards. Bump MAX_VIRTIO_BLK_DEVICES from 10 to 127 to accommodate GPT disks carrying many partitions. Pre-extract mkdir directives from overlay mounts before the main loop to avoid redundant option parsing. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	7086caaddf	kata-agent: Remove unused mode field from MkdirDirective As previous unused codes are with attribute of dead_code which actually are never used, we'd better remove them totally. It will remove the mode field from MkdirDirective structure and also remove its relavent test cases. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	39c512bc36	kata-agent: Enhance virtio block matcher to reject partition uevents Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This prevents the matcher from incorrectly matching partition uevents (e.g., /dev/vdaX) which is critical for partitioned disks where partition uevents appear alongside whole-disk uevents. This commit aims to eliminate such bad cases. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	56f05aa534	kata-agent: Enhance SCSI block device matcher to reject partition uevents Refactor ScsiBlockMatcher to only match whole-disk uevents. This prevents the matcher from incorrectly matching partition uevents (e.g., block/sdd/sdd9) which is critical for partitioned disks where partition uevents appear alongside whole-disk uevents. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Fabiano Fidêncio	7ddea26137	Merge pull request #13086 from fvichot/flo-kata-monitor-fix kata-monitor: use full URI for connecting to containerd	2026-05-25 10:16:11 +02:00
Fabiano Fidêncio	8787da13a9	agent: Add NUMA-aware PCI path parsing Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path format "root_complex/bus/device" (e.g. "10/00/02") in addition to the legacy "bus/device" format, defaulting to root complex "00" for backward compatibility. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1cbe930fc9	runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices When NUMA placement is active and VFIO devices are cold-plugged, create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has devices. Each pxb-pcie carries a numa_node property that gives the guest kernel correct NUMA affinity for all PCI devices beneath it. Root ports are created on each pxb-pcie bus instead of pcie.0, and VFIODevice.Attach() assigns each device to the root port on its host NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0. NUMA placement is "active" when there is more than one guest NUMA node OR a single guest node mapped to a specific host node (the latter happens when maybeRightSizeAutoNUMA() collapses a multi-node sandbox to the GPU's host NUMA node). In both cases buildNUMATopology() also emits the matching memory-backend-ram,host-nodes=,policy=bind entries so guest memory is sourced from the right host node. So pxb-pcie can never capture a leaf virtio-pci device as the default bus, every virtio-pci device emitter (NetDevice, VSOCK, vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when the machine actually exposes a pcie.0 root. Detection is done via a new hasPCIeRoot() helper that returns true only for q35/virt machine types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW transport) and microvm (no PCI) intentionally skip the pin to avoid "Bus 'pcie.0' not found" at startup. This is the only QEMU mechanism that works for both regular and confidential (TDX/SNP) guests, as it operates through the PCI bus hierarchy rather than ACPI table injection. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	15292da217	config: Enable NUMA by default for nvidia-gpu configurations Enable enable_numa=true in the three nvidia-gpu QEMU configuration templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since buildNUMATopology() returns nil when there is only one node. On multi-NUMA hosts it ensures GPU memory accesses are NUMA-local. Add documentation to all QEMU config templates explaining the VFIO device NUMA placement validation that occurs when NUMA is enabled. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	feeb5d8ecc	runtime-rs: Fix vCPU pinning race with backoff retry QEMU can report fewer vCPU threads during early startup, causing partial affinity setup. Let's retry with exponential backoff until the expected thread count is visible, then continue with best-effort pinning if the window is exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	f53f427859	runtime: Fix vCPU pinning race for Go runtime QEMU may not have spawned all vCPU threads when pinning starts, so query_cpus_fast can return an incomplete list and leave some vCPUs unpinned. To fix it, let's add exponential backoff retries before pinning and fall back to available threads if retries are exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	b688619314	runtime: oci: Fix sandbox CPU sizing with cpuManagerPolicy=static When cpuManagerPolicy=static is configured, kubelet sets the sandbox CPU quota to -1 (unconstrained) because it uses cpuset pinning instead of CFS quota. This causes CalculateSandboxSizing to compute 0 workload CPUs, resulting in the VM starting with only default_vcpus. Fall back to deriving the CPU count from sandbox CPU shares (1024 shares per CPU) when the quota-based calculation yields 0. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	12e5985dbd	runtime: Add NUMA-aware vCPU pinning and cpuset.mems forwarding Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured, vCPU threads are pinned to host CPUs belonging to the same NUMA node as the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(), preserving memory locality. vCPUs are distributed proportionally across NUMA nodes, matching the distribution in buildNUMATopology(). Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and container update(). When multi-NUMA is configured, translate host NUMA node IDs to guest NUMA node IDs using translateHostMemsToGuest() before forwarding to the agent. This allows the agent to enforce NUMA-aware memory placement for containers. Filter guest NUMA nodes at VM creation time: before calling CreateVM(), prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox cpuset. This avoids exposing fake NUMA topology to the guest when Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all CPUs from node 0 on a 2-node host), improving memory locality and avoiding unnecessary cross-node memory traffic. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	d0d7deb262	runtime: Add host NUMA distance discovery and build guest NUMA topology Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA distance matrix into the guest via -numa dist entries. Implement buildNUMATopology() which translates the GuestNUMANodes configuration into govmm NUMANode and NUMADist slices. Each guest NUMA node gets a floor-divided share of vCPUs and memory, with the last node absorbing any remainder. This handles the common Kata case of +1 VMM overhead vCPU gracefully. Memory backends are selected based on hugepages/virtio-fs/file-backed-mem configuration. Guard multi-NUMA topology generation to amd64 and arm64 only, since other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM. Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA nodes and distances. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	447e2a3faf	runtime: Add VFIO device NUMA node detection and placement validation Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO devices. Store the result in the new NUMANode field on VFIODev (-1 for unknown/no affinity). Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup() (legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every discovered VFIO device carries its host NUMA node. Add validateVFIODeviceNUMAPlacement() which runs at the end of buildNUMATopology(). It checks every cold-plugged VFIO device's host NUMA node against the guest NUMA topology and logs a warning if a device is on a host NUMA node not covered by any guest NUMA node (indicating potential cross-NUMA memory access overhead), or an info message confirming correct placement. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1ee8bb5740	runtime: Add NUMA-aware SMP topology Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter. When multiple NUMA nodes are configured, restructure the SMP topology so that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by socket per NUMA node. Use ceiling division so that uneven vCPU counts (e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP topology where MaxCPUs == Sockets * Cores * Threads. When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus, Cores=1) is preserved. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1e9da61d48	govmm: Add multi-NUMA memory backend and distance matrix support Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node memory-backend objects with host-nodes/policy=bind, -numa node entries with cpus= ranges, and -numa dist entries for the distance matrix. Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported() to ensure architectures without DIMM support (s390x, riscv64) fall back to the single-node path. Drop 386 from isDimmSupported since 32-bit x86 is not a supported Kata target. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	ed4d0fb51f	runtime-rs: qemu: pass `-bios` for non-confidential guests The `boot_info.firmware` field from the hypervisor configuration is loaded by kata-types and surfaces in the TOML as `firmware = "..."`, but the qemu cmdline generator never consumed it for non-CC guests. Today, `-bios <path>` is only appended via the `Bios` device pushed by `add_{sev,sev_snp,tdx}_protection_device()` in `QemuInner::start_vm()`, which use the firmware copied into the `ProtectionDeviceConfig`. That path is taken only when `confidential_guest = true` and a SEV/SEV-SNP/TDX protection device is configured. For plain Q35 profiles (notably the nvidia-gpu one, which needs OVMF to boot the GPU passthrough VM), the `firmware` set in the TOML was silently dropped and qemu fell back to its default BIOS. Wire `boot_info.firmware` directly in `QemuCmdLine::new()` when no protection device path is going to emit `-bios` (i.e. for non-CC guests). CC paths are left untouched so we don't end up with a duplicated `-bios` argument. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 15:05:26 +02:00

1 2 3 4 5 ...

6495 Commits