kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-02 07:02:16 +00:00

Author	SHA1	Message	Date
Dan Mihai	c81dadaba1	Merge pull request #13064 from burgerdev/add-arp-neighbour agent: use rtnetlink to add ARP neighbour	2026-05-26 09:59:44 -07:00
Fabiano Fidêncio	3dc02a8604	Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only runtime-rs: Support erofs snapshotter with gpt vmdk mode	2026-05-25 16:29:59 +02:00
Zvonko Kaiser	aeadb1af35	Merge pull request #12948 from fidencio/topic/numa runtime (go): agent: Add NUMA support for QEMU	2026-05-25 15:33:14 +02:00
Alex Lyn	2036e66bc3	kata-agent: Integrate GPT partition support into multi-layer handler In GPT mode, all partitions share the same base block device, so resolving it once per uevent source and caching the result avoids redundant hotplug waits that would otherwise scale linearly with layer count. Layers are sorted by partition number before mounting to guarantee correct overlay lowerdir precedence regardless of the order the host emits Storage entries. And it will remove dead_code attributes to mark the codes working. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	17fadde6d8	kata-agent: Add GPT partition utility functions The guest agent needs to resolve individual partition devices from a single GPT-partitioned block device, but the kernel does not always create partition nodes immediately after the base device appears, especially when another fd holds the device open during hot-plug. Add utility functions that handle two problems: (1) Mapping a base device path to its partition path following the kernel naming convention (bare suffix vs 'p' separator). (2) And ensuring the partition node exists before mount. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	8119a561ae	kata-agent: Refactor wait_and_mount_layer to return LayerMountInfo This commit has No functional change — all callers pass None, so every call still resolves the device via uevent exactly as before. It just prepare the multi-layer EROFS handler for GPT partition and dm-verity support by widening the wait_and_mount_layer() interface without changing behavior. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	0bd150e5f1	runtime-rs: Integrate GPT+VMDK mode for multi-layer EROFS rootfs When multiple EROFS layers are present, wrap them into a single GPT-partitioned virtual disk delivered via one VMDK descriptor and a single block device hotplug which significantly reduce pci bus slots compared with the previous one-device-per-layer approach that exhausts virtio-blk slots for large layer counts. The host detects multi-layer mounts, computes the GPT layout, generates head metadata plus a VMDK descriptor referencing all EROFS images, and hot-plugs the composite disk. Per-partition Storage entries are created with X-kata.gpt-partitioned and X-kata.partition-number options so the guest agent can resolve each layer to its partition device. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	c3b06af4c7	kata-types: Add gpt_disk module for GPT metadata generation Introduce gpt_disk.rs to compute GPT partition layouts and generate metadata files for multi-layer EROFS rootfs. The module creates GPT head metadata that are combined with EROFS layer images via VMDK descriptors, presenting a single GPT-partitioned virtual disk to the guest VM — each EROFS layer mapped to its own partition. The layout engine calculates LBA positions for an arbitrary number of EROFS layers, then writes a full protective-MBR + GPT image and extracts the head (MBR + primary GPT table) segments as standalone files for VMDK extent assembly. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	148810312d	runtime-rs: Refactor VMDK writer and erofs rootfs handling logic Restructure the erofs rootfs handler to support multi-layer GPT+VMDK mode where multiple EROFS layers are wrapped into a single virtual disk with a GPT partition table. Extract VmdkDescriptorWriter as a reusable struct for atomic VMDK descriptor generation. Change erofs_storage from Option<Storage> to Vec<Storage> to hold per-layer metadata, and add GPT metadata path tracking for proper cleanup with path-traversal guards. Bump MAX_VIRTIO_BLK_DEVICES from 10 to 127 to accommodate GPT disks carrying many partitions. Pre-extract mkdir directives from overlay mounts before the main loop to avoid redundant option parsing. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	7086caaddf	kata-agent: Remove unused mode field from MkdirDirective As previous unused codes are with attribute of dead_code which actually are never used, we'd better remove them totally. It will remove the mode field from MkdirDirective structure and also remove its relavent test cases. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	39c512bc36	kata-agent: Enhance virtio block matcher to reject partition uevents Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This prevents the matcher from incorrectly matching partition uevents (e.g., /dev/vdaX) which is critical for partitioned disks where partition uevents appear alongside whole-disk uevents. This commit aims to eliminate such bad cases. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	56f05aa534	kata-agent: Enhance SCSI block device matcher to reject partition uevents Refactor ScsiBlockMatcher to only match whole-disk uevents. This prevents the matcher from incorrectly matching partition uevents (e.g., block/sdd/sdd9) which is critical for partitioned disks where partition uevents appear alongside whole-disk uevents. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Fabiano Fidêncio	7ddea26137	Merge pull request #13086 from fvichot/flo-kata-monitor-fix kata-monitor: use full URI for connecting to containerd	2026-05-25 10:16:11 +02:00
Fabiano Fidêncio	8787da13a9	agent: Add NUMA-aware PCI path parsing Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path format "root_complex/bus/device" (e.g. "10/00/02") in addition to the legacy "bus/device" format, defaulting to root complex "00" for backward compatibility. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1cbe930fc9	runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices When NUMA placement is active and VFIO devices are cold-plugged, create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has devices. Each pxb-pcie carries a numa_node property that gives the guest kernel correct NUMA affinity for all PCI devices beneath it. Root ports are created on each pxb-pcie bus instead of pcie.0, and VFIODevice.Attach() assigns each device to the root port on its host NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0. NUMA placement is "active" when there is more than one guest NUMA node OR a single guest node mapped to a specific host node (the latter happens when maybeRightSizeAutoNUMA() collapses a multi-node sandbox to the GPU's host NUMA node). In both cases buildNUMATopology() also emits the matching memory-backend-ram,host-nodes=,policy=bind entries so guest memory is sourced from the right host node. So pxb-pcie can never capture a leaf virtio-pci device as the default bus, every virtio-pci device emitter (NetDevice, VSOCK, vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when the machine actually exposes a pcie.0 root. Detection is done via a new hasPCIeRoot() helper that returns true only for q35/virt machine types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW transport) and microvm (no PCI) intentionally skip the pin to avoid "Bus 'pcie.0' not found" at startup. This is the only QEMU mechanism that works for both regular and confidential (TDX/SNP) guests, as it operates through the PCI bus hierarchy rather than ACPI table injection. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	15292da217	config: Enable NUMA by default for nvidia-gpu configurations Enable enable_numa=true in the three nvidia-gpu QEMU configuration templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since buildNUMATopology() returns nil when there is only one node. On multi-NUMA hosts it ensures GPU memory accesses are NUMA-local. Add documentation to all QEMU config templates explaining the VFIO device NUMA placement validation that occurs when NUMA is enabled. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	feeb5d8ecc	runtime-rs: Fix vCPU pinning race with backoff retry QEMU can report fewer vCPU threads during early startup, causing partial affinity setup. Let's retry with exponential backoff until the expected thread count is visible, then continue with best-effort pinning if the window is exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	f53f427859	runtime: Fix vCPU pinning race for Go runtime QEMU may not have spawned all vCPU threads when pinning starts, so query_cpus_fast can return an incomplete list and leave some vCPUs unpinned. To fix it, let's add exponential backoff retries before pinning and fall back to available threads if retries are exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	b688619314	runtime: oci: Fix sandbox CPU sizing with cpuManagerPolicy=static When cpuManagerPolicy=static is configured, kubelet sets the sandbox CPU quota to -1 (unconstrained) because it uses cpuset pinning instead of CFS quota. This causes CalculateSandboxSizing to compute 0 workload CPUs, resulting in the VM starting with only default_vcpus. Fall back to deriving the CPU count from sandbox CPU shares (1024 shares per CPU) when the quota-based calculation yields 0. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	12e5985dbd	runtime: Add NUMA-aware vCPU pinning and cpuset.mems forwarding Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured, vCPU threads are pinned to host CPUs belonging to the same NUMA node as the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(), preserving memory locality. vCPUs are distributed proportionally across NUMA nodes, matching the distribution in buildNUMATopology(). Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and container update(). When multi-NUMA is configured, translate host NUMA node IDs to guest NUMA node IDs using translateHostMemsToGuest() before forwarding to the agent. This allows the agent to enforce NUMA-aware memory placement for containers. Filter guest NUMA nodes at VM creation time: before calling CreateVM(), prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox cpuset. This avoids exposing fake NUMA topology to the guest when Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all CPUs from node 0 on a 2-node host), improving memory locality and avoiding unnecessary cross-node memory traffic. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	d0d7deb262	runtime: Add host NUMA distance discovery and build guest NUMA topology Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA distance matrix into the guest via -numa dist entries. Implement buildNUMATopology() which translates the GuestNUMANodes configuration into govmm NUMANode and NUMADist slices. Each guest NUMA node gets a floor-divided share of vCPUs and memory, with the last node absorbing any remainder. This handles the common Kata case of +1 VMM overhead vCPU gracefully. Memory backends are selected based on hugepages/virtio-fs/file-backed-mem configuration. Guard multi-NUMA topology generation to amd64 and arm64 only, since other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM. Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA nodes and distances. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	447e2a3faf	runtime: Add VFIO device NUMA node detection and placement validation Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO devices. Store the result in the new NUMANode field on VFIODev (-1 for unknown/no affinity). Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup() (legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every discovered VFIO device carries its host NUMA node. Add validateVFIODeviceNUMAPlacement() which runs at the end of buildNUMATopology(). It checks every cold-plugged VFIO device's host NUMA node against the guest NUMA topology and logs a warning if a device is on a host NUMA node not covered by any guest NUMA node (indicating potential cross-NUMA memory access overhead), or an info message confirming correct placement. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1ee8bb5740	runtime: Add NUMA-aware SMP topology Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter. When multiple NUMA nodes are configured, restructure the SMP topology so that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by socket per NUMA node. Use ceiling division so that uneven vCPU counts (e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP topology where MaxCPUs == Sockets * Cores * Threads. When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus, Cores=1) is preserved. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1e9da61d48	govmm: Add multi-NUMA memory backend and distance matrix support Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node memory-backend objects with host-nodes/policy=bind, -numa node entries with cpus= ranges, and -numa dist entries for the distance matrix. Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported() to ensure architectures without DIMM support (s390x, riscv64) fall back to the single-node path. Drop 386 from isDimmSupported since 32-bit x86 is not a supported Kata target. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	ed4d0fb51f	runtime-rs: qemu: pass `-bios` for non-confidential guests The `boot_info.firmware` field from the hypervisor configuration is loaded by kata-types and surfaces in the TOML as `firmware = "..."`, but the qemu cmdline generator never consumed it for non-CC guests. Today, `-bios <path>` is only appended via the `Bios` device pushed by `add_{sev,sev_snp,tdx}_protection_device()` in `QemuInner::start_vm()`, which use the firmware copied into the `ProtectionDeviceConfig`. That path is taken only when `confidential_guest = true` and a SEV/SEV-SNP/TDX protection device is configured. For plain Q35 profiles (notably the nvidia-gpu one, which needs OVMF to boot the GPU passthrough VM), the `firmware` set in the TOML was silently dropped and qemu fell back to its default BIOS. Wire `boot_info.firmware` directly in `QemuCmdLine::new()` when no protection device path is going to emit `-bios` (i.e. for non-CC guests). CC paths are left untouched so we don't end up with a duplicated `-bios` argument. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 15:05:26 +02:00
Fabiano Fidêncio	4c1b3312ea	runtime-rs: nvidia-gpu: use _NV firmware substitutions in config template The `configuration-qemu-nvidia-gpu-runtime-rs.toml.in` template was using the generic `@FIRMWAREPATH@` / `@FIRMWAREVOLUMEPATH@` placeholders, which are left empty for the qemu hypervisor in the runtime-rs Makefile. As a result, no firmware (BIOS) was actually passed to qemu when launching a VM with the nvidia-gpu configuration, breaking OVMF based boot. Switch the placeholders to `@FIRMWAREPATH_NV@` / `@FIRMWAREVOLUMEPATH_NV@`, matching the runtime-go nvidia-gpu template and the substitutions exported by the runtime-rs Makefile, so the OVMF firmware path is properly plumbed through to qemu. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 14:59:11 +02:00
Florian Vichot	554e8f91b1	kata-monitor: use full URI for connecting to containerd Without the protocol in the URI, grpc-go defaults to the DNS resolver, which results in an error for unix sockets (`name resolver error: produced zero addresses`). We also remove the `getAddressAndDialer(...)` and `dial(...)` functions, as they are no longer necessary, grpc-go supports connecting to unix sockets directly. This also removes the matching tests. This also adds a `Makefile` and tweaks the Dockerfile to simplify building the Docker image. Fixes #12398 Signed-off-by: Florian Vichot <florian.vichot@gmail.com>	2026-05-23 16:47:46 +02:00
Fabiano Fidêncio	cbcdd999e4	Merge pull request #12957 from Apokleos/fix-sb-api runtime-rs: Fix sandbox-api lifecycle and CRI status handling	2026-05-23 09:26:14 +02:00
Alex Lyn	486f5f9412	runtime-rs: Align sandbox status with CRI expectations Update the sandbox status reporting to align with containerd/CRI requirements. This commit aims to address issue of `State Mapping` Previously, internal state strings were returned, which containerd could not recognize, causing running sandboxes to be misinterpreted as SANDBOX_NOTREADY. This maps internal states to CRI constants: - Running -> SANDBOX_READY - Init \| Stopped -> SANDBOX_NOTREADY These changes ensure the sandbox status is both accurately interpreted and fully compliant with the expected interface. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	3f42929e2b	runtime-rs: Update sandbox status to include created_at field Ensure the `created_at` timestamp is correctly propagated in the sandbox status. Although `created_at` is present in the `SandboxStatus` and `SandboxStatusResponse` data structures, it was previously omitted during the status transition. This commit completes the implementation by passing the value recorded during sandbox initialization. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	3358c7634b	runtime-rs: Avoid shutting down sandbox on container exit Prevent the sandbox from being prematurely shut down when a standard workload container exits. Previously, the shutdown logic incorrectly triggered a sandbox shutdown whenever the container list became empty. This resulted in unintended lifecycle termination for non-transient sandboxes. This change refines the `need_shutdown_sandbox()` criteria in `virt_container/src/container_manager/manager.rs` to only initiate a shutdown under specific conditions: - The shutdown request is explicit (`req.is_now`). - The request targets the sandbox itself (`req.container_id == self.sid`). By removing the implicit dependency on the empty container list, we ensure the sandbox remains active as expected after workload containers finish execution. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	2b980b3a34	runtime-rs: Block WaitSandbox until sandbox exits Rework sandbox waiting so the WaitSandbox path blocks on sandbox lifetime rather than directly borrowing the hypervisor wait call. Once stop has been observed, the cached exit result is returned to later waiters. While the sandbox is still alive, waiters subscribe to the internal stop notifier and sleep until shutdown or VM exit records the final result. Together with the preceding support commits, this keeps the overall behaviour identical to the original WaitSandbox fix while making the dependency chain explicit. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	ac2d39fc34	runtime-rs: Add sandbox exit notifier in VirtSandbox Add an internal exit_notify_tx channel to VirtSandbox and initialise it in both the regular and restore constructors. The later WaitSandbox rework needs a way to block until sandbox stop has been observed without polling runtime state. This commit only wires in the notifier so the follow-on behaviour change can subscribe to a dedicated stop signal. No WaitSandbox behaviour changes are made here yet. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	116ae66025	runtime-rs: Introduce a cached sandbox exit information Introduce an exit_info field in SandboxInner so sandbox teardown can store a stable exit result in runtime state. The follow-on WaitSandbox rework needs a place to keep the final SandboxExitInfo after the sandbox has already stopped. Without that cached result, later waiters would have no consistent value to return once the original stop event has passed. This change only adds the state holder. Behaviour changes follow in later commits. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
dependabot[bot]	ac77c5fdff	build(deps): bump github.com/containerd/containerd in /src/runtime Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.29 to 1.7.32. - [Release notes](https://github.com/containerd/containerd/releases) - [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md) - [Commits](https://github.com/containerd/containerd/compare/v1.7.29...v1.7.32) --- updated-dependencies: - dependency-name: github.com/containerd/containerd dependency-version: 1.7.32 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-21 21:56:06 +00:00
Fabiano Fidêncio	05f2bfcb0b	runtime-rs: drop unused std::env import in initdata_block tests The tests module imports std::env but never references it, which trips the unused_imports warning during CI builds. Remove the dead import to silence the warning. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-21 13:56:45 +02:00
Fabiano Fidêncio	f9eafb3341	runtime: drop host time namespace from OCI spec Docker 29.5+ adds a private time namespace to container bundles by default, but kata agent only supports the classic namespace set and then fails with "invalid namespace type". Let's strip time namespaces in both the Go and rust runtimes before the spec reaches the agent, matching how network and cgroup namespaces are handled. Fixes: #13080 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-21 13:56:45 +02:00
Alex Lyn	c919aea448	Merge pull request #13066 from RainaYL/rainax/guest_memfd_pr dragonball: Add implementation for KVM-managed guest memfd	2026-05-21 17:12:44 +08:00
Xiaofan Xxf	62af158842	dragonball: Add implementation for KVM-managed guest memfd A TDX VM requires that guest memfd is managed by KVM, so that KVM is able to toggle the memory attribute for the region to shared/private. Therefore, only anonymous guest memory is allowed for TDX VM, and the KVM-managed memfd should be created by KVM_CREATE_GUEST_MEMFD ioctl, instead of issuing memfd_create system call. Also, in order to bind this memfd with corresponding memory region, KVM_SET_USER_MEMORY_REGION2 should be invoked, instead of KVM_SET_USER_MEMORY_REGION. Signed-off-by: Xiaofan Xxf <xiaofan.xxf@antgroup.com>	2026-05-20 15:02:03 +08:00
Xiaofan Xxf	2506b24c66	dragonball: Add basic ACPI implementation for TDX boot Added basic implementation for a few ACPI tables (MADT, FADT and DSDT). Td-shim does not support mptable, and requires VMM to pass ACPI table contents to virtual firmware via HOB list. Note that this is PR contains only minimal implementation enough for booting a TDX VM. More comprehensive ACPI support may require future updates. Signed-off-by: Xiaofan Xxf <xiaofan.xxf@antgroup.com>	2026-05-20 14:01:47 +08:00
stevenhorsman	6ee43475c3	agent-ctl: Fix CLH virtio-fs queue size configuration After commit `e2240b694a` ("runtime-rs: ch: source virtio-fs queue size from toml"), Cloud Hypervisor no longer provides fallback defaults for virtio-fs queue configuration. When queue_size or queue_num are 0, CH now uses those values directly instead of substituting defaults, which causes a panic in the device manager. The agent-ctl tool was hardcoding queue_size=0 and queue_num=0 in share_fs_utils.rs, relying on CH's fallback behavior. This broke the agent-api tests for Cloud Hypervisor while QEMU tests continued to pass. Fix by reading virtio_fs_queue_size from the hypervisor config and falling back to sensible defaults (1024 queue size, 1 queue) when not configured, matching the previous CH default behavior. Generated-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-19 12:05:52 +01:00
Fabiano Fidêncio	ffa59ce3aa	Merge commit from fork runtime: disable virtiofsd extra-args annotation by default	2026-05-19 08:22:12 +02:00
Alex Lyn	8dca734008	Merge pull request #12959 from DataDog/mayeul/fix-race-condition-when-adding-qdisc shim: Add backoff retry to ingress qdisc creation to avoid potential race condition	2026-05-19 14:06:37 +08:00
Aurélien Bombo	e2240b694a	runtime-rs: ch: source virtio-fs queue size from toml Now that `prepare_virtiofs` populates `ShareFsConfig` from `SharedFsInfo.virtio_fs_queue_size`, the CH-side fallback that substitutes `DEFAULT_FS_QUEUE_SIZE` (1024) when the incoming `queue_num`/`queue_size` are zero is no longer needed. Drop it from both `handle_share_fs_device` and `TryFrom<ShareFsSettings> for FsConfig` and use the values straight from the config. Drop the now unused `DEFAULT_FS_QUEUES` and `DEFAULT_FS_QUEUE_SIZE` constants. This also removes a latent bug in both call sites: the previous code gated `queue_size` on `queue_num > 0`, so a user setting only the queue size and not the (currently unconfigurable) queue count would have had their `queue_size` silently overwritten by the default. The CH config template (`configuration-clh-runtime-rs.toml.in`) did not ship the `virtio_fs_queue_size` key (unlike the qemu-runtime-rs templates), so without an explicit override the field would have deserialized to 0 and the fallback would have been the only thing keeping CH working. Add the key to the template, defaulted to `@DEFVIRTIOFSQUEUESIZE@` (1024), matching the qemu-runtime-rs templates. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-05-19 06:14:24 +02:00
Aurélien Bombo	0d5bde2181	runtime-rs: virtio-fs: plumb virtio_fs_queue_size to qemu/CH The shared filesystem device builder in `prepare_virtiofs` was hardcoding `queue_size = 0` and `queue_num = 0` on the `ShareFsConfig` it hands to the hypervisor, ignoring `SharedFsInfo.virtio_fs_queue_size` parsed from `configuration.toml` entirely. For qemu, this is silently broken: the cmdline generator's `DeviceVhostUserFs::set_queue_size` treats 0 as "not set" and skips the `queue-size=` argument when emitting the `vhost-user-fs-pci` device, so QEMU falls back to its built-in default of 128, regardless of what the user configured. For Cloud Hypervisor it happens to work in practice today, but only because `ch::handle_share_fs_device` and `TryFrom<ShareFsSettings> for FsConfig` substitute a hardcoded 1024 when the incoming `queue_num`/`queue_size` are zero. That fallback masks the real bug; the toml value still never reaches the VMM. Add a `get_shared_fs_info` accessor on `DeviceManager` mirroring the existing `get_block_device_info` helper, and use it in `prepare_virtiofs` to populate `ShareFsConfig.queue_size` from `SharedFsInfo.virtio_fs_queue_size`. Use a single virtqueue (`queue_num = 1`), matching what runtime-go hardcodes for both qemu (govmm `QemuFSParams` does not emit `num-queues=`) and CH (`numQueues := int32(1)` in `clh.go`). The CH-side fallback and the CH config template are addressed in a follow-up commit. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-05-19 06:14:24 +02:00
Alex Lyn	e5a7f5b120	Merge pull request #13009 from sebwolf-de/swolf/kata-fc-jailer-pid-leak Fix #13008: runtime/fc track real firecracker PID instead of jailer PID	2026-05-19 11:59:24 +08:00
Alex Lyn	357921df62	Merge pull request #12437 from Apokleos/fix-katactl-exec kata-ctl: Fix failures when kata-ctl exec with short id	2026-05-19 09:13:17 +08:00
Aurélien Bombo	83e20877d8	Merge pull request #12882 from stevenhorsman/runtime-rs/cdh_api_timeout runtime-rs: Add cdh_api_timeout configuration parameter	2026-05-18 15:38:27 -05:00
Sebastian Wolf	26746c9ce8	runtime/fc: track real firecracker PID instead of jailer PID When the jailer is in use (the default for kata-fc), cmd.Process.Pid in fcInit() is the jailer's PID, not firecracker's. The jailer forks + execs firecracker as a separate child and exits. fc.info.PID was therefore stored as the (soon-to-be-dead) jailer PID. At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...). syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess returns nil immediately, and the real firecracker microVM never receives a signal. It gets reparented to init and stays alive indefinitely, holding open resources from the host. Over many container lifecycles this becomes a serious resource leak. Read the real PID from <jailerRoot>/firecracker.pid, which firecracker itself writes after the exec. Update fc.info.PID with that value so all downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates on the actual firecracker process. Also fix a small adjacent bug in Sandbox.Stop where the per-container teardown loop ignored the force flag, causing any container.stop error to short-circuit Stop before stopVM ran. Signed-off-by: Sebastian Wolf <swolf@nvidia.com>	2026-05-18 21:09:51 +02:00
Fabiano Fidêncio	9044ee22d2	Merge pull request #13024 from SAY-5/fix-typo-occured dragonball: fix typo in VsockEpollListener doc comment	2026-05-18 20:39:33 +02:00

1 2 3 4 5 ...

6470 Commits