kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-02 07:02:16 +00:00

Author	SHA1	Message	Date
Mikko Ylinen	e475d870fb	runtime: qemu: don't set maxcpus when confidential guest is enabled QEMU maxcpus enables CPU hotplug capabilities but it's unused when confidential guest is enabled. Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug is not needed. Commit `07db945b09` built a relationship between kernel's cmdline nr_cpus and the maxcpus config. Now that maxcpus is dropped for confidential guests, drop nr_cpus from kernel commandline too. This hopefully helps with the reference values computation too. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
Aurélien Bombo	9acef4bc55	Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"	2026-05-29 14:57:07 -05:00
Cameron Baird	7a9d207ab2	Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair" This reverts commit `2799f7d36b`.	2026-05-29 17:05:40 +00:00
Fabiano Fidêncio	025202a52a	runtime: expose InfiniBand devices to VFIO containers The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a PCIDEVICE_* environment variable; it does not add the VFIO char device to linux.devices in the OCI spec. As a result the agent's container_has_vfio_device() gate stays closed and expose_guest_infiniband_devices() is never triggered — leaving /dev/infiniband absent from the container even though the guest kernel created the IB devices (mlx5_core.rdma.0 probes successfully). Add appendPhysicalEndpointDevices() which runs after appendDevices() in createContainer(). It walks the sandbox network endpoints; for each PhysicalEndpoint with a resolved guest PCI path it derives the VFIO group char path from sysfs (iommu_group symlink) and synthesises a vfio-pci-gk Device entry. Both legacy group paths (/dev/vfio/N) and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by reading the iommu_group sysfs symlink. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	fa9a9f3aeb	runtime: set VF admin MAC before vfio-pci rebind for IB/RoCE support Without an admin MAC, the guest's mlx5_core inherits the VF's firmware-default MAC. This MAC differs from the IB port's HCA MAC, so mlx5_ib's GID cache refuses to populate /sys/class/infiniband/mlx5_/ports/N/gids/. RoCE then appears active (port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID — RoCEv2 packets, address handles, librdmacm bind — fails silently. Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF using RTM_SETLINK before the bind-to-vfio-pci step. The firmware applies the admin MAC during the VF reset that accompanies the unbind/rebind cycle, so the guest sees a single consistent MAC across netdev, IB port, and HCA. Best-effort: failures are logged at warn and the existing agent-side MAC reconciliation (rpc.rs::update_interface) remains as a fallback for L2/L3 connectivity. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	992a723392	runtime: resolve cold-plug VFIO guest PCI path via QMP For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged VFIO device is auto-allocated at boot (each pcie-root-port is added with chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot on pcie.0). The hot-plug path already queries QMP via qomGetPciPath; reuse that same mechanism for cold-plugged devices. Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface. Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all other hypervisors. Call it at the start of setupNetworks so that the PCI paths are resolved before generateVCNetworkStructures emits the agent Interface proto. Also stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs exposed as physical network devices) so that update_interface carries a non-empty devicePath. Without devicePath the agent falls back to a by-MAC link lookup which fails when the VF firmware MAC differs from the CNI-assigned MAC after the vfio-pci unbind/rebind cycle. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	e6777f0866	runtime: keep cold-plug VFIO devices in guest-kernel mode Container.createDevices was dropping cold-plug VFIO entries from the container's deviceInfos whenever vfio_mode = "guest-kernel", which in turn meant the agent's CreateContainer request carried no vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the workload container, so update_env_pci then aborted with "No PCI mapping found for container <id>" and the container failed with CrashLoopBackOff. Include cold-plug VFIO devices in deviceInfos for both VFIO modes. The existing vfio-pci-gk agent handler returns dev: None (so /dev/vfio/<group> is not materialised in the container spec, and constrainGRPCSpec(stripVfio=true) already strips it from the grpc spec for guest-kernel mode), while still recording the host->guest PCI mapping into sandbox.pcimap[cid] so env-var translation works. devManager.NewDevice calls FindDevice first, which matches the already cold-plugged sandbox-level device by HostPath / major / minor, so this does not double-attach. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	9893b6dc03	runtime: correctly resolve cold-plug VFIO guest PCI paths Populate missing VFIO guest PCI paths via QMP before serializing container devices so guest-kernel PCI env translation has the mappings it needs. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Cameron Baird	2799f7d36b	runtime: Enforce >= 1 queue pairs for tapNetworkPair In the xConnectVMNetwork path, we have queues = 0 as a baseline, set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise MultiQueueSupport. This is certainly incorrect as we always want, as a baseline, at least one queue pair. Make queues := 1 by default to ensure the NetworkPair has at least one queue pair for all virtio-net paths. Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>	2026-05-27 18:55:11 +00:00
Fabiano Fidêncio	1cbe930fc9	runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices When NUMA placement is active and VFIO devices are cold-plugged, create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has devices. Each pxb-pcie carries a numa_node property that gives the guest kernel correct NUMA affinity for all PCI devices beneath it. Root ports are created on each pxb-pcie bus instead of pcie.0, and VFIODevice.Attach() assigns each device to the root port on its host NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0. NUMA placement is "active" when there is more than one guest NUMA node OR a single guest node mapped to a specific host node (the latter happens when maybeRightSizeAutoNUMA() collapses a multi-node sandbox to the GPU's host NUMA node). In both cases buildNUMATopology() also emits the matching memory-backend-ram,host-nodes=,policy=bind entries so guest memory is sourced from the right host node. So pxb-pcie can never capture a leaf virtio-pci device as the default bus, every virtio-pci device emitter (NetDevice, VSOCK, vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when the machine actually exposes a pcie.0 root. Detection is done via a new hasPCIeRoot() helper that returns true only for q35/virt machine types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW transport) and microvm (no PCI) intentionally skip the pin to avoid "Bus 'pcie.0' not found" at startup. This is the only QEMU mechanism that works for both regular and confidential (TDX/SNP) guests, as it operates through the PCI bus hierarchy rather than ACPI table injection. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	f53f427859	runtime: Fix vCPU pinning race for Go runtime QEMU may not have spawned all vCPU threads when pinning starts, so query_cpus_fast can return an incomplete list and leave some vCPUs unpinned. To fix it, let's add exponential backoff retries before pinning and fall back to available threads if retries are exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	12e5985dbd	runtime: Add NUMA-aware vCPU pinning and cpuset.mems forwarding Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured, vCPU threads are pinned to host CPUs belonging to the same NUMA node as the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(), preserving memory locality. vCPUs are distributed proportionally across NUMA nodes, matching the distribution in buildNUMATopology(). Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and container update(). When multi-NUMA is configured, translate host NUMA node IDs to guest NUMA node IDs using translateHostMemsToGuest() before forwarding to the agent. This allows the agent to enforce NUMA-aware memory placement for containers. Filter guest NUMA nodes at VM creation time: before calling CreateVM(), prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox cpuset. This avoids exposing fake NUMA topology to the guest when Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all CPUs from node 0 on a 2-node host), improving memory locality and avoiding unnecessary cross-node memory traffic. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	d0d7deb262	runtime: Add host NUMA distance discovery and build guest NUMA topology Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA distance matrix into the guest via -numa dist entries. Implement buildNUMATopology() which translates the GuestNUMANodes configuration into govmm NUMANode and NUMADist slices. Each guest NUMA node gets a floor-divided share of vCPUs and memory, with the last node absorbing any remainder. This handles the common Kata case of +1 VMM overhead vCPU gracefully. Memory backends are selected based on hugepages/virtio-fs/file-backed-mem configuration. Guard multi-NUMA topology generation to amd64 and arm64 only, since other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM. Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA nodes and distances. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	447e2a3faf	runtime: Add VFIO device NUMA node detection and placement validation Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO devices. Store the result in the new NUMANode field on VFIODev (-1 for unknown/no affinity). Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup() (legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every discovered VFIO device carries its host NUMA node. Add validateVFIODeviceNUMAPlacement() which runs at the end of buildNUMATopology(). It checks every cold-plugged VFIO device's host NUMA node against the guest NUMA topology and logs a warning if a device is on a host NUMA node not covered by any guest NUMA node (indicating potential cross-NUMA memory access overhead), or an info message confirming correct placement. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1ee8bb5740	runtime: Add NUMA-aware SMP topology Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter. When multiple NUMA nodes are configured, restructure the SMP topology so that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by socket per NUMA node. Use ceiling division so that uneven vCPU counts (e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP topology where MaxCPUs == Sockets * Cores * Threads. When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus, Cores=1) is preserved. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	f9eafb3341	runtime: drop host time namespace from OCI spec Docker 29.5+ adds a private time namespace to container bundles by default, but kata agent only supports the classic namespace set and then fails with "invalid namespace type". Let's strip time namespaces in both the Go and rust runtimes before the spec reaches the agent, matching how network and cgroup namespaces are handled. Fixes: #13080 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-21 13:56:45 +02:00
Alex Lyn	8dca734008	Merge pull request #12959 from DataDog/mayeul/fix-race-condition-when-adding-qdisc shim: Add backoff retry to ingress qdisc creation to avoid potential race condition	2026-05-19 14:06:37 +08:00
Sebastian Wolf	26746c9ce8	runtime/fc: track real firecracker PID instead of jailer PID When the jailer is in use (the default for kata-fc), cmd.Process.Pid in fcInit() is the jailer's PID, not firecracker's. The jailer forks + execs firecracker as a separate child and exits. fc.info.PID was therefore stored as the (soon-to-be-dead) jailer PID. At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...). syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess returns nil immediately, and the real firecracker microVM never receives a signal. It gets reparented to init and stays alive indefinitely, holding open resources from the host. Over many container lifecycles this becomes a serious resource leak. Read the real PID from <jailerRoot>/firecracker.pid, which firecracker itself writes after the exec. Update fc.info.PID with that value so all downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates on the actual firecracker process. Also fix a small adjacent bug in Sandbox.Stop where the per-container teardown loop ignored the force flag, causing any container.stop error to short-circuit Stop before stopVM ran. Signed-off-by: Sebastian Wolf <swolf@nvidia.com>	2026-05-18 21:09:51 +02:00
Mayeul Blanzat	26f60ddd9b	shim: Add backoff retry to ingress qdisc creation to avoid race condition We sometimes get this error when creating the pod sandbox: failed to create shim task: Failed to add qdisc for network index 2 : device or resource busy. Adding a linear backoff retry when adding the qdisc to help mitigate the issue at the source and avoid the cascading error. Signed-off-by: Mayeul Blanzat <mayeul.blanzat@datadoghq.com>	2026-05-18 17:46:50 +02:00
PiotrProkop	5065058d4a	runtime: fix device allowlist detection comparing pointers Because intptr() returns a fresh pointer on every call, those comparisons compared addresses, never values, so every check evaluated to false. As a result /dev/null, /dev/urandom, /dev/ptmx, /dev/loop-control and /dev/loop* were appended to devices allowlist for sandbox_cgroup even when the runtime spec already listed them, producing duplicate entries. Switch to nil-safe value comparisons via a type switch on the cgroup device type and dereferenced d.Major / d.Minor, keeping the same detection semantics but actually matching existing entries. Assisted-By: Claude 4.7 Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2026-05-12 18:52:53 +02:00
PiotrProkop	5cd187619e	runtime: allow loopback devices for sandbox cgroup only When sandbox_cgroup_only is enabled, the kata shim threads inherit the sandbox device cgroup. For container rootfs whose mount source is a regular file backed by a loop device (notably the blockfile snapshotter), containerd's mount package opens /dev/loop-control to allocate a free /dev/loopN and then opens that block node to attach the backing file. Neither device is on the sandbox cgroup allowlist, so both opens fail with EPERM. This change adds /dev/loop-control (char 10:237) and the /dev/loopN block nodes (block major 7, any minor) to the sandbox device cgroup allowlist when sandbox_cgroup_only is true, mirroring the existing treatment of /dev/null, /dev/urandom and /dev/ptmx. The additions are gated on SandboxCgroupOnly because that is the only mode in which the shim itself is constrained by this cgroup. Assisted-By: Claude 4.7 Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2026-05-12 18:48:58 +02:00
Greg Kurz	b44e56d3db	runtime: Remove vendor directory Now shipped in the vendored code tarball. Drop the git tree status check since it isn't needed anymore. Also stop building with `-mod=vendor`. This requires to expose GOMODCACHE as suggested by Fabiano Fidêncio. Signed-off-by: Greg Kurz <groug@kaod.org>	2026-05-06 09:47:30 +02:00
Fabiano Fidêncio	6436922f5b	runtime: network: handle "device" type interfaces (mlx5 SFs) Interfaces whose drivers do not register a specific netlink kind (e.g. mlx5 Scalable Functions) are reported with the generic type "device". The endpoint creation code did not handle this type, causing sandbox creation to fail with: "Unsupported network interface: device" This is particularly visible on arm64 with Mellanox ConnectX NICs using Scalable Functions, where the ethtool BusInfo returns a non-PCI identifier (e.g. "mlx5_core.sf.4") so isPhysicalIface() cannot classify the interface as physical either. Handle "device" type interfaces the same way as veth endpoints, connecting them through a TAP + TC-filter bridge. Additionally, relax getLinkForEndpoint() for VethEndpoint so it accepts the concrete link type returned by the kernel instead of asserting netlink.Veth. A "device" type interface wrapped in a VethEndpoint returns netlink.Device from LinkByName(), which would fail the strict type assertion. All callers only need link.Attrs(), so accepting any link type is safe. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-25 12:26:20 +02:00
Fabiano Fidêncio	77e558deb0	runtime: Fix shellcheck issues in git_push.sh Fix shellcheck warnings and notes identified by running shellcheck --severity=style. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-24 08:14:07 +02:00
Saul Paredes	83bbfedc08	network: preseed default-gateway neighbor This change mirrors host networking into the guest as before, but now also includes the default gateway neighbor entry for each interface. Pods using overlay/synthetic gateways (e.g., 169.254.1.1) can hit a first-connect race while the guest performs the initial ARP. Preseeding the gateway neighbor removes that latency and makes early connections (e.g., to the API Service) deterministic. Signed-off-by: Saul Paredes <saulparedes@microsoft.com>	2026-04-20 10:00:19 -07:00
Fabiano Fidêncio	64c139208f	agent: add GetDiagnosticData RPC with termination log support Add a new extensible GetDiagnosticData RPC that retrieves diagnostic information from the guest VM. The request carries a log_type string field to specify what kind of data is requested, and a container_id field to identify the target container. The first supported log_type is "termination_log", which reads the Kubernetes termination message file from inside the guest. This is needed for shared_fs=none configurations where the host cannot directly access the guest filesystem. On the Go runtime side, the container stop() path now calls GetDiagnosticData to copy the termination message to the host when running with NoSharedFS and the terminationMessagePolicy annotation is set to "File". The call is best-effort: failures are logged as warnings rather than blocking container teardown. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>	2026-04-17 13:01:13 +02:00
Fabiano Fidêncio	e8f34a2b26	agent: Update protocol This is not related to this PR, but rather to #12734, which ended up not running the `make src/agent generate-protocols`. While here, let's also fix it. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-10 14:47:01 +02:00
Fabiano Fidêncio	36a2d8e7f2	agent: Make launch_process_timeout configurable The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata agent is insufficient for environments with NVIDIA GPUs and NVSwitches, where the attestation-agent needs significantly more time to collect evidence during initialization (e.g. ~2 seconds per NVSwitch). When the timeout expires, the agent (PID 1) exits with an error, causing the guest kernel to perform an orderly shutdown before the attestation-agent has finished starting. Make this timeout configurable via the kernel parameter agent.launch_process_timeout (in seconds), preserving the 6-second default for backward compatibility. The Go runtime is wired up to pass this value from the TOML config's [agent.kata] section through to the kernel command line. The NVIDIA GPU configs set the new default to 15 seconds. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-10 14:47:01 +02:00
llink5	f7878cc385	runtime: fix Docker 26+ networking by rescanning after Start Docker 26+ configures container networking (veth pair, IP addresses, routes) after task creation rather than before. Kata's endpoint scan runs during CreateSandbox, before the interfaces exist, resulting in VMs starting without network connectivity (no -netdev passed to QEMU). Add RescanNetwork() which runs asynchronously after the Start RPC. It polls the network namespace until Docker's interfaces appear, then hotplugs them to QEMU and informs the guest agent to configure them inside the VM. Additional fixes: - mountinfo parser: find fs type dynamically instead of hardcoded field index, fixing parsing with optional mount tags (shared:, master:) - IsDockerContainer: check CreateRuntime hooks for Docker 26+ - DockerNetnsPath: extract netns path from libnetwork-setkey hook args with path traversal protection - detectHypervisorNetns: verify PID ownership via /proc/pid/cmdline to guard against PID recycling - startVM guard: rescan when len(endpoints)==0 after VM start Fixes: #9340 Signed-off-by: llink5 <llink5@users.noreply.github.com>	2026-04-02 21:23:16 +02:00
PiotrProkop	64735222c6	runtime: allow specifying logical/physical sector size for block devices Add two new configuration knobs that control the logical and physical sector sizes advertised by virtio-blk devices to the guest: block_device_logical_sector_size (config file) block_device_physical_sector_size (config file) io.katacontainers.config.hypervisor.blk_logical_sector_size (annotation) io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation) The annotation names are abbreviated relative to the config file keys because Kubernetes enforces a 63-character limit on annotation name segments, and the full names would exceed it. Both settings default to 0 (let QEMU decide). When set, they are passed as logical_block_size and physical_block_size in the QMP device_add command during block device hotplug. Setting logical_sector_size smaller then container filesystem block size will cause EINVAL on mount. The physical_sector_size can always be set independently. Values must be 0 or a power of 2 in the range [512, 65536]; other values are rejected with an error at sandbox creation time. Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2026-03-27 18:56:54 +01:00
Roaa Sakr	858620d2e7	clh: Add VFIO device cold-plug support Enable VFIO device pass-through at VM creation time on Cloud Hypervisor, in addition to the existing hot-plug path. Signed-off-by: Roaa Sakr <romoh@microsoft.com>	2026-03-25 16:39:25 -07:00
Zvonko Kaiser	6a853a9684	gpu: Bump NVRC We have a new release add this one to the next Kata release. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Zvonko Kaiser	8ff5d164c6	runtime: make CDI annotation vendor-agnostic with lookup table Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI vendor/class pairs to CDI device kinds. This makes it straightforward to add support for new device types by adding entries to the table. Refactor siblingAnnotation to resolve device BDFs once upfront and reuse them for both CDI type detection and sibling matching, eliminating redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches) are skipped with errNoSiblingFound, while known device types that fail to match a sibling produce a hard error. Consolidate the hot-plug and cold-plug device loops into a single loop over extracted container paths, removing duplicated filtering logic. Export GetPCIDeviceProperty from the device drivers package to allow vendor/class lookup from sysfs in the container annotation path. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Aurélien Bombo	a4fd32a29a	runtime: Support trusted ephemeral data storage * Introduces the `emptydir_mode` config flag to allow instructing the runtime to create a block device for emptyDir volumes. * The block device is created in the original emptyDir folder on the host so that Kubelet can monitors its disk usage and evict the pod if it exceeds its sizeLimit. This matches runc and virtio-fs. * The block device's disk image file is sparse to minimize host disk footprint. Fixes: #10560 Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Alex Lyn	fb743a304c	runtime: Support plugging a disk as an image file Some VMMs support plugging a disk as an image file instead of a block device, so we adapt the runtime to support that. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Aurélien Bombo <abombo@microsoft.com> Co-authored-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Fabiano Fidêncio	83dd7dcc75	runtimes: reject virtio-blk-mmio when confidential_guest is true Virtio-mmio transport is not hardened for confidential computing (unlike virtio-pci). Reject config that would use virtio-blk-mmio for rootfs/block when confidential_guest is set, so CoCo guests only use virtio-blk-pci. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 21:41:27 +01:00
Hyounggyu Choi	347ce5e3bc	runtime: Skip to call sandboxDevices() for remote hypervisor The remote hypervisor delegates VM creation to a remote service. The VM runs on cloud infrastructure, not the local host kernel. So requiring a KVM/MSHV device is semantically wrong and would cause a hard failure on any host where these devices are absent (e.g., a VM that doesn't expose nested virtualization). Skip sandboxDevices() entirely when the configured hypervisor type is remoteHypervisor{}. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-03-03 13:44:12 +01:00
Steve Horsman	b147cb1319	Merge pull request #12587 from fidencio/topic/runtime-add-configurable-kubelet-root-dir runtimes: add configurable kubelet root dir	2026-02-28 19:06:14 +00:00
Zvonko Kaiser	eec397ac08	qemu: Remove PCIe root port BAR reserve sizing Stop computing and setting mem-reserve and pref64-reserve on PCIe root ports and switch ports. Remove getBARsMaxAddressableMemory() which scanned host GPU BARs to pre-calculate these values. The previous approach only considered GPU devices (IsGPU(), class 0x0302) when scanning for BAR sizes, so devices like NVSwitches (class 0x0680) with their 32MB non-prefetchable BAR0 were not accounted for and received the 4MB default. Additionally, GetTotalAddressableMemory() classifies BARs by 32/64-bit address width rather than by the prefetchable flag that QEMU's mem-reserve vs pref64-reserve maps to. Modern QEMU introspects VFIO device BARs when they are attached to root ports and sizes the MMIO windows accordingly. Modern OVMF (edk2-stable202502+) automatically calculates the 64-bit PCI MMIO aperture based on the BARs of actually present devices during PCI enumeration. Omitting the reserve parameters lets QEMU and OVMF handle MMIO window sizing correctly for all device types including GPUs, NVSwitches, and NICs without requiring host-side BAR scanning. This also removes the nvpci dependency from qemu_arch_base.go. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-02-27 22:54:31 +01:00
Zvonko Kaiser	bb7fd335f3	qemu: Remove OVMF X-PciMmio64Mb fw_cfg hint Modern OVMF (edk2-stable202502 and later) automatically sizes the 64-bit PCI MMIO aperture based on the BARs of actually attached devices during PCI enumeration. The opt/ovmf/X-PciMmio64Mb fw_cfg hint is no longer needed to ensure large-BAR devices like NVIDIA GPUs receive adequate MMIO space. The previous approach was fragile: the runtime scanned host PCI devices to estimate the required aperture size, but only considered GPU devices (class 0x0302), missing NVSwitches and other devices with large BARs. Removing this code avoids confusion about MMIO sizing responsibility. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-02-27 22:54:31 +01:00
Fabiano Fidêncio	0a73638744	runtime: add configurable kubelet root dir Different kubernetes distributions, such as k0s, use a different kubelet root dir location instead of the default /var/lib/kubelet, so ConfigMap and Secret volume propagation were failing. This adds a kubelet_root_dir config option that the go runtime uses when matching volume paths and kata-deploy now sets it automatically for k0s via a drop-in file. runtime-rs does not need this option: it identifies ConfigMap/Secret, projected, and downward-api volumes by volume-type path segment (kubernetes.io~configmap, etc.), not by kubelet root prefix. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-27 14:10:57 +01:00
Hyounggyu Choi	be5ae7d1e1	Merge pull request #12573 from BbolroC/support-memory-hotplug-go-runtime-s390x runtime: Support memory hotplug via virtio-mem on s390x	2026-02-27 09:59:40 +01:00
Hyounggyu Choi	b9f3d5aa67	runtime: Support memory hotplug with virtio-mem on s390x This commit adds logic to properly handle memory hotplug for QemuCCWVirtio in the ExecMemdevAdd() path. The new logic is triggered only when virtio-mem is enabled. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-02-26 14:21:34 +01:00
Hyounggyu Choi	19771671c2	runtime: Handle virtio-mem resize in hotplugAddMemory() ResizeMemory() already contains the virtio-mem resize logic. However, hotplugAddMemory(), which is invoked via a different path, lacked this handling and always fell back to the pc-dimm path, even when virtio-mem was configured. This commit adds virtio-mem resize handling to hotplugAddMemory(). It also adds corresponding unit tests. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-02-26 14:21:34 +01:00
Dan Mihai	7973e4e2a8	runtime: clh: disable nested vCPUs on MSHV The recently-added nested property is true by default, but is not supported yet on MSHV. See cloud-hypervisor/cloud-hypervisor#7408 for additional information. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-25 21:01:25 +00:00
Dan Mihai	dc398e801c	runtime: clh: specify raw image format Specify raw image format for all guest block devices. - Attempting to auto-detect the image format from CLH would be riskier for the Host. - Creating a new raw image file, auto-detecting its format, and then creating a filesystem from the Guest onto the block device is no longer supported by CLH v51. Therefore, Kata CI's k8s-block-volume.bats would fail without specifying the raw format when hot plugging its block device. - See cloud-hypervisor/cloud-hypervisor@b3e8e2a for additional information. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-25 21:01:25 +00:00
Dan Mihai	0629354ca0	versions: update cloud hypervisor to v51.1 ``` v51.1 ===== This is a bug fix release. The following issues have been addressed: * Fix image_type in OpenAPI definition (#7734) v51.0 ===== This release has been tracked in v51.0 group of our roadmap project. Security Fixes This release fixes a security vulnerability in disk image handling. Details can be found in GHSA-jmr4-g2hv-mjj6. * A new `backing_files=on\|off` option has been added to `--disk` to explicitly control whether QCOW2 backing files are permitted. This defaults to `off` to prevent the loading of backing files entirely. (#7685) * Explicit image type specification via the user interface, removing reliance on format autodetection (#7728). * Prevent sector-zero writes for autodetected raw images (#7728). Significant QCOW2 v3 Improvements A large number of QCOW2 v3 specification features have been implemented: * RAW backing file support for QCOW2 overlays (#7570) * Zero bit in L2 entries (#7627) * Incompatible feature bit validation (#7612) * Dirty bit support (#7636) * Variable refcount widths (1 to 64-bit) (#7633) * Corrupt bit detection and marking (#7639) * Autoclear feature bits handling (#7648) * Thread safety fix for multiple virtio queues (`num_queues > 1`) (#7661) * Correct zero-fill for reads beyond backing file size (#7678) * Live disk resize support (#7687) ACPI Generic Initiator Support ACPI Generic Initiator Affinity (SRAT Type 5) support has been added to associate VFIO-PCI devices with dedicated memory/CPU-less NUMA nodes. This enables the guest OS to make NUMA-aware memory allocation decisions for device workloads. A new `device_id` parameter has been added to `--numa` for specifying VFIO devices. (#7626) Block Device DISCARD and WRITE_ZEROES Support The `virtio-blk` device now supports `DISCARD` and `WRITE_ZEROES` operations for QCOW2 and RAW image formats. This enables thin provisioning and efficient space reclamation when guests trim filesystems. A new `sparse=on\|off` option has been added to `--disk` to control disk space management: `sparse=on` (default) enables thin provisioning with space reclamation, while `sparse=off` provides thick provisioning with consistent I/O latency. (#7666) Notable Performance Improvements * Transparent Huge Pages (THP) support has been extended to cover anonymous shared memory (`shared=on`) via `madvise`. Previously, THP was only used for non-shared memory. (#7646) * The `vhost-user-net` device now uses the default set of vhost-user virtio features, including `VIRTIO_F_RING_INDIRECT_DESC`, which provides a performance improvement. (#7653) MSHV Support Improvements * Optimize CPU state update after emulation by only updating special registers when changed (#7603) * Enable SMT for guests with `threads_per_core > 1` (#7668) * Stub `save_data_tables()` to unblock VM pause/resume (#7692) * Handle `GHCB_INFO_SPECIAL_DBGPRINT` VMG exit in SEV-SNP guest exit handler (#7703) * Fix CVM boot failure on MSHV (#7548) * Fix CPU topology detection for multithreaded configurations (#7576) Notable Bug Fixes * Fix VFIO device hot-remove leaving group and container file descriptors open, preventing re-add (#7676) * Fix snapshot restore when backing file is on read-only storage with `shared=false` (#7674) * Enforce `VIRTIO_BLK_F_RO` even if guest does not negotiate it (#7705) * Fix read-only block device FLUSH requests from OVMF preventing VMs from booting (#7706) * Fix vhost-user device not properly dropping unowned file descriptors (#7679) * Fix `vhost-user-block` `get_config` interoperability (#7617) * Fix vsock TOCTOU race condition by copying packet header from guest memory before processing (#7530) * Fix vsock handling of large TX packets spanning multiple data descriptors (#7680) * Add `gettid()` to all seccomp filters (#7596) * Fix MAC address parsing that wrongly allowed `+` instead of hex characters (#7579) * Improve UUID parse error message and `--net` fd help text (#7702) * Fix various inconsistencies in our OpenAPI specification file (#7716, #7726) * Various documentation fixes (#7602, #7606) ``` Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-25 21:01:25 +00:00
stevenhorsman	ef1b0b2913	runtime: Fix mismatch in receiver names Fix: `ST1016: methods on the same type should have the same receiver name` Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	1b2ca678e5	runtime: Fix identifier names Fix identifiers that are non compliant with go's conventions e.g. not capitalising initialisations Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	69fea195f9	runtime: Fix arm unit test I think that `c727332b0e` broke the arm unit test by removing the arm specific overrides, so update the expected output Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00

1 2 3 4 5 ...

1352 Commits