kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-02 07:02:16 +00:00

Author	SHA1	Message	Date
Zvonko Kaiser	aeadb1af35	Merge pull request #12948 from fidencio/topic/numa runtime (go): agent: Add NUMA support for QEMU	2026-05-25 15:33:14 +02:00
Fabiano Fidêncio	7ddea26137	Merge pull request #13086 from fvichot/flo-kata-monitor-fix kata-monitor: use full URI for connecting to containerd	2026-05-25 10:16:11 +02:00
Fabiano Fidêncio	513d87db7e	Merge pull request #13106 from fidencio/topic/runtime-rs-ensure-bios-is-passed-to-qemu-on-non-CC-cases runtime-rs: qemu: pass -bios for non-confidential guests	2026-05-25 09:56:11 +02:00
Fabiano Fidêncio	407a6946f2	Merge pull request #13077 from hdp617/fix-kata-deploy-build packaging: fix parallel kernel build race and kata-deploy script bugs	2026-05-25 09:53:38 +02:00
Fabiano Fidêncio	f763e9cca9	tests: Add NUMA topology / GPU placement tests to the NV CIs Add k8s-nvidia-numa.bats with five tests that validate NUMA behaviour on hosts where NUMA is configured by default (qemu-nvidia-gpu, qemu-nvidia-gpu-snp, qemu-nvidia-gpu-tdx): 1. Multi-node sandbox (large workload spanning all host NUMA nodes): - Guest NUMA node count matches host - Guest vCPU distribution is balanced across nodes (max-min <= 1) - Guest memory is distributed across NUMA nodes - Host-side vCPU pinning is balanced across NUMA nodes 2. Right-sized single-node sandbox (small workload fitting one node): - Guest collapses to a single NUMA node - All host vCPU threads pinned to that one NUMA node 3. GPU passthrough with VFIO, multi-node: - Guest NUMA topology is balanced (same as test 1) - Guest GPU's NUMA node matches the host GPU's NUMA node (resolved via the vfio-pci,host=<BDF> from the QEMU command line and /sys/bus/pci/devices/<BDF>/numa_node) - QEMU command line contains pxb-pcie and policy=bind - Host vCPU pinning is balanced 4. GPU passthrough with VFIO, right-sized single-node: small workload plus GPU that fits in a single host NUMA node: - Guest collapses to a single NUMA node - The chosen node is the GPU's host NUMA node, not just any node that fits — verified by matching host-nodes= in the memory backend and pxb-pcie numa_node= against the GPU's host node - Guest GPU reports the same NUMA node as the host GPU 5. Explicit numa_mapping in the runtime TOML (QEMU-only): - Drops a config.d/ fragment that sets numa_mapping = ["1"], so the auto-derive + right-sizing path is bypassed entirely - Guest sees exactly 1 NUMA node - QEMU memory backend is bound to host node 1 (host-nodes=1, policy=bind), not host node 0 - Host-side vCPU threads land on host node 1 - Drop-in is removed on teardown so subsequent tests are unaffected Guest-side checks use a dedicated container image (quay.io/kata-containers/numa) that reads sysfs and prints results to stdout — no kubectl exec or CoCo policy overrides needed. Host-side checks (crictl, pgrep, taskset) run directly on the host via sudo; a standalone numa-pinning-check.sh script handles the vCPU thread affinity inspection. The config.d/ helpers used by test 5 are runtime-agnostic (probe Go vs runtime-rs layout on disk) but the test is gated to qemu-* shims since runtime-rs does not yet implement NUMA. Skips cleanly on single-NUMA hosts, unsupported hypervisors, or when no nvidia.com/pgpu resources are available (GPU tests only). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	20705470e9	docs: Add NUMA support guide for Kata Containers with QEMU Add a step-by-step how-to guide covering host inspection, Kata NUMA drop-in setup (via kata-deploy Helm and manual config.d/), pod deployment examples, and guest/host verification procedures. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	8787da13a9	agent: Add NUMA-aware PCI path parsing Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path format "root_complex/bus/device" (e.g. "10/00/02") in addition to the legacy "bus/device" format, defaulting to root complex "00" for backward compatibility. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1cbe930fc9	runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices When NUMA placement is active and VFIO devices are cold-plugged, create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has devices. Each pxb-pcie carries a numa_node property that gives the guest kernel correct NUMA affinity for all PCI devices beneath it. Root ports are created on each pxb-pcie bus instead of pcie.0, and VFIODevice.Attach() assigns each device to the root port on its host NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0. NUMA placement is "active" when there is more than one guest NUMA node OR a single guest node mapped to a specific host node (the latter happens when maybeRightSizeAutoNUMA() collapses a multi-node sandbox to the GPU's host NUMA node). In both cases buildNUMATopology() also emits the matching memory-backend-ram,host-nodes=,policy=bind entries so guest memory is sourced from the right host node. So pxb-pcie can never capture a leaf virtio-pci device as the default bus, every virtio-pci device emitter (NetDevice, VSOCK, vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when the machine actually exposes a pcie.0 root. Detection is done via a new hasPCIeRoot() helper that returns true only for q35/virt machine types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW transport) and microvm (no PCI) intentionally skip the pin to avoid "Bus 'pcie.0' not found" at startup. This is the only QEMU mechanism that works for both regular and confidential (TDX/SNP) guests, as it operates through the PCI bus hierarchy rather than ACPI table injection. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	15292da217	config: Enable NUMA by default for nvidia-gpu configurations Enable enable_numa=true in the three nvidia-gpu QEMU configuration templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since buildNUMATopology() returns nil when there is only one node. On multi-NUMA hosts it ensures GPU memory accesses are NUMA-local. Add documentation to all QEMU config templates explaining the VFIO device NUMA placement validation that occurs when NUMA is enabled. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	feeb5d8ecc	runtime-rs: Fix vCPU pinning race with backoff retry QEMU can report fewer vCPU threads during early startup, causing partial affinity setup. Let's retry with exponential backoff until the expected thread count is visible, then continue with best-effort pinning if the window is exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	f53f427859	runtime: Fix vCPU pinning race for Go runtime QEMU may not have spawned all vCPU threads when pinning starts, so query_cpus_fast can return an incomplete list and leave some vCPUs unpinned. To fix it, let's add exponential backoff retries before pinning and fall back to available threads if retries are exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	b688619314	runtime: oci: Fix sandbox CPU sizing with cpuManagerPolicy=static When cpuManagerPolicy=static is configured, kubelet sets the sandbox CPU quota to -1 (unconstrained) because it uses cpuset pinning instead of CFS quota. This causes CalculateSandboxSizing to compute 0 workload CPUs, resulting in the VM starting with only default_vcpus. Fall back to deriving the CPU count from sandbox CPU shares (1024 shares per CPU) when the quota-based calculation yields 0. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	12e5985dbd	runtime: Add NUMA-aware vCPU pinning and cpuset.mems forwarding Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured, vCPU threads are pinned to host CPUs belonging to the same NUMA node as the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(), preserving memory locality. vCPUs are distributed proportionally across NUMA nodes, matching the distribution in buildNUMATopology(). Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and container update(). When multi-NUMA is configured, translate host NUMA node IDs to guest NUMA node IDs using translateHostMemsToGuest() before forwarding to the agent. This allows the agent to enforce NUMA-aware memory placement for containers. Filter guest NUMA nodes at VM creation time: before calling CreateVM(), prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox cpuset. This avoids exposing fake NUMA topology to the guest when Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all CPUs from node 0 on a 2-node host), improving memory locality and avoiding unnecessary cross-node memory traffic. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	d0d7deb262	runtime: Add host NUMA distance discovery and build guest NUMA topology Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA distance matrix into the guest via -numa dist entries. Implement buildNUMATopology() which translates the GuestNUMANodes configuration into govmm NUMANode and NUMADist slices. Each guest NUMA node gets a floor-divided share of vCPUs and memory, with the last node absorbing any remainder. This handles the common Kata case of +1 VMM overhead vCPU gracefully. Memory backends are selected based on hugepages/virtio-fs/file-backed-mem configuration. Guard multi-NUMA topology generation to amd64 and arm64 only, since other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM. Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA nodes and distances. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	447e2a3faf	runtime: Add VFIO device NUMA node detection and placement validation Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO devices. Store the result in the new NUMANode field on VFIODev (-1 for unknown/no affinity). Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup() (legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every discovered VFIO device carries its host NUMA node. Add validateVFIODeviceNUMAPlacement() which runs at the end of buildNUMATopology(). It checks every cold-plugged VFIO device's host NUMA node against the guest NUMA topology and logs a warning if a device is on a host NUMA node not covered by any guest NUMA node (indicating potential cross-NUMA memory access overhead), or an info message confirming correct placement. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1ee8bb5740	runtime: Add NUMA-aware SMP topology Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter. When multiple NUMA nodes are configured, restructure the SMP topology so that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by socket per NUMA node. Use ceiling division so that uneven vCPU counts (e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP topology where MaxCPUs == Sockets * Cores * Threads. When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus, Cores=1) is preserved. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1e9da61d48	govmm: Add multi-NUMA memory backend and distance matrix support Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node memory-backend objects with host-nodes/policy=bind, -numa node entries with cpus= ranges, and -numa dist entries for the distance matrix. Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported() to ensure architectures without DIMM support (s390x, riscv64) fall back to the single-node path. Drop 386 from isDimmSupported since 32-bit x86 is not a supported Kata target. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	8d2ecaabb5	versions: Bump QEMU to v11.0.0 For more details see QEMU's release notes: https://www.qemu.org/2026/04/22/qemu-11-0-0/ GPU experimental variants are also using v11.0.0 plus one patch to solve issues related to NUMA mapping. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	ed4d0fb51f	runtime-rs: qemu: pass `-bios` for non-confidential guests The `boot_info.firmware` field from the hypervisor configuration is loaded by kata-types and surfaces in the TOML as `firmware = "..."`, but the qemu cmdline generator never consumed it for non-CC guests. Today, `-bios <path>` is only appended via the `Bios` device pushed by `add_{sev,sev_snp,tdx}_protection_device()` in `QemuInner::start_vm()`, which use the firmware copied into the `ProtectionDeviceConfig`. That path is taken only when `confidential_guest = true` and a SEV/SEV-SNP/TDX protection device is configured. For plain Q35 profiles (notably the nvidia-gpu one, which needs OVMF to boot the GPU passthrough VM), the `firmware` set in the TOML was silently dropped and qemu fell back to its default BIOS. Wire `boot_info.firmware` directly in `QemuCmdLine::new()` when no protection device path is going to emit `-bios` (i.e. for non-CC guests). CC paths are left untouched so we don't end up with a duplicated `-bios` argument. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 15:05:26 +02:00
Fabiano Fidêncio	4c1b3312ea	runtime-rs: nvidia-gpu: use _NV firmware substitutions in config template The `configuration-qemu-nvidia-gpu-runtime-rs.toml.in` template was using the generic `@FIRMWAREPATH@` / `@FIRMWAREVOLUMEPATH@` placeholders, which are left empty for the qemu hypervisor in the runtime-rs Makefile. As a result, no firmware (BIOS) was actually passed to qemu when launching a VM with the nvidia-gpu configuration, breaking OVMF based boot. Switch the placeholders to `@FIRMWAREPATH_NV@` / `@FIRMWAREVOLUMEPATH_NV@`, matching the runtime-go nvidia-gpu template and the substitutions exported by the runtime-rs Makefile, so the OVMF firmware path is properly plumbed through to qemu. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 14:59:11 +02:00
Florian Vichot	554e8f91b1	kata-monitor: use full URI for connecting to containerd Without the protocol in the URI, grpc-go defaults to the DNS resolver, which results in an error for unix sockets (`name resolver error: produced zero addresses`). We also remove the `getAddressAndDialer(...)` and `dial(...)` functions, as they are no longer necessary, grpc-go supports connecting to unix sockets directly. This also removes the matching tests. This also adds a `Makefile` and tweaks the Dockerfile to simplify building the Docker image. Fixes #12398 Signed-off-by: Florian Vichot <florian.vichot@gmail.com>	2026-05-23 16:47:46 +02:00
Fabiano Fidêncio	cbcdd999e4	Merge pull request #12957 from Apokleos/fix-sb-api runtime-rs: Fix sandbox-api lifecycle and CRI status handling	2026-05-23 09:26:14 +02:00
Fabiano Fidêncio	a7aa2576c6	Merge pull request #13089 from fidencio/topic/kata-deploy-fix-label-set-on-rke2 kata-deploy: verify kata-runtime label remains stable on rke2/k3s	2026-05-23 08:52:27 +02:00
Fabiano Fidêncio	7faeb9b727	Merge pull request #13091 from kata-containers/dependabot/go_modules/src/runtime/github.com/containerd/containerd-1.7.32 build(deps): bump github.com/containerd/containerd from 1.7.29 to 1.7.32 in /src/runtime	2026-05-23 08:51:36 +02:00
Huy Pham	3ec444a7df	kernel: bump config version Bump the Kata Containers kernel configuration version to 195. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-22 12:26:53 -07:00
Huy Pham	c490373a78	kata-deploy: packaging: fix absolute path resolution in merge script The `kata-deploy-merge-builds.sh` script blindly prepended `PWD` to the `kata_versions_yaml_file` argument, assuming it was always a relative path. However, the `Makefile` passes an absolute path using `$(MK_DIR)`. This resulted in invalid double-concatenated paths like `/workspace/...//workspace/...` which failed to copy. Fix this by using `readlink -f` to safely resolve the path. This correctly handles both relative and absolute paths, preventing path corruption. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-22 12:05:56 -07:00
Fabiano Fidêncio	5d3e1e6396	kata-deploy: verify kata-runtime label remains stable on rke2/k3s The retry loop added in `efd468df3f` still allows the install to declare success while inside the kubelet's post-restart re-register window. On rke2/k3s, `systemctl restart rke2-agent` restarts both containerd and the kubelet, but `wait_till_node_is_ready` polls `.status.conditions[Ready]` every 2 s and returns on the first `True` observation it sees. By default the kubelet only publishes node status every ~10 s, so that first `True` is almost always the stale value from before the restart — the kubelet hasn't actually finished restarting yet. `label_node_with_retry` then applies the label, sleeps 1 s, reads back "true" (still stale, kubelet still down), and returns Ok. Install completes, `/readyz` flips to 200, helm releases its `--wait`, and the bats test starts — and only then does the kubelet finish coming up, re-register the node, and clobber the label with its cached set. The lifecycle test sees an empty `katacontainers.io/kata-runtime` and fails: # Node label katacontainers.io/kata-runtime: not ok 1 Kata artifacts are present on host after install A single-shot verification can't distinguish "still stale true" from "truly stable true after kubelet re-register". Replace it with a stability window: after (re)applying the label, require it to remain at the expected value for STABILITY_CHECKS=6 consecutive observations spaced CHECK_INTERVAL=2 s apart (≈ 12 s — comfortably more than the kubelet's status-update period). If the value ever drifts inside the window, re-apply and restart the stability counter. Bounded by MAX_APPLY_ATTEMPTS=12, so worst case is ~3 min; happy path adds ~12 s to install. Also add a short polling loop to the test's own label assertion as belt-and-suspenders for any leftover transient race, matching the existing retry pattern used for the container-runtime version check. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-22 11:53:18 +02:00
Alex Lyn	adf6d43e24	test: skip TestContainerMemoryUpdate for sandbox api Temporarily skip the `TestContainerMemoryUpdate` test case for sandbox api. This test case is currently skipped in other VMMs (e.g., QEMU, Cloud-Hypervisor) due to known issues and environmental stability concerns. To maintain consistency across the project, we are skipping it for sandbox as well. A follow-up PR will be dedicated to addressing these issues and properly enabling/refining this test case for all VMMs. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:46:44 +08:00
Alex Lyn	b5349f4d78	versions: bump containerd to 2.3 for sandbox API tests containerd 2.3 requires Go 1.26.3, but Kata still pins Go 1.25.10. Use Go 1.26.3 for the sandbox-api job so that make cri-integration can build containerd from source. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:46:16 +08:00
Alex Lyn	9f78dc687f	tests: exclude TestContainerRestart from the cri-containerd test list Creating a new container in the same sandbox VM after the previous container has exited and been removed has never been supported by kata-containers (neither with the go-based nor the rust-based runtime). When the last container is removed the kata VM shuts down, so any attempt to start a new container in the same sandbox fails. This test exercises a use-case kata does not currently support, and it has never been part of the passing list for good reason. Mark it explicitly excluded with a comment so it is clear this is a deliberate omission rather than an oversight. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:45:50 +08:00
Alex Lyn	328fccfbbd	ci: Re-enable run-containerd-sandboxapi job The job was disabled because TestImageLoad was failing when using the shim sandboxer with runc due to a containerd bug (config.json not being written to the bundle directory). Now that check_daemon_setup uses podsandbox for the runc sanity check, the root cause of the failure is worked around on our side and the job can be re-enabled. Also update the runner to ubuntu-24.04. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:45:26 +08:00
Alex Lyn	a7739579d6	tests: Use podsandbox sandboxer for the runc sanity check The check_daemon_setup function verifies that containerd + runc are functional before the real kata tests run. Using the shim sandboxer for this runc check hits a known containerd bug where the OCI spec is not populated before NewBundle is called, so config.json is never written and containerd-shim-runc-v2 fails at startup. See containerd/containerd#11640 The sandboxer choice is irrelevant for this sanity check, so use podsandbox which works correctly with runc. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:44:38 +08:00
Alex Lyn	486f5f9412	runtime-rs: Align sandbox status with CRI expectations Update the sandbox status reporting to align with containerd/CRI requirements. This commit aims to address issue of `State Mapping` Previously, internal state strings were returned, which containerd could not recognize, causing running sandboxes to be misinterpreted as SANDBOX_NOTREADY. This maps internal states to CRI constants: - Running -> SANDBOX_READY - Init \| Stopped -> SANDBOX_NOTREADY These changes ensure the sandbox status is both accurately interpreted and fully compliant with the expected interface. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	3f42929e2b	runtime-rs: Update sandbox status to include created_at field Ensure the `created_at` timestamp is correctly propagated in the sandbox status. Although `created_at` is present in the `SandboxStatus` and `SandboxStatusResponse` data structures, it was previously omitted during the status transition. This commit completes the implementation by passing the value recorded during sandbox initialization. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	3358c7634b	runtime-rs: Avoid shutting down sandbox on container exit Prevent the sandbox from being prematurely shut down when a standard workload container exits. Previously, the shutdown logic incorrectly triggered a sandbox shutdown whenever the container list became empty. This resulted in unintended lifecycle termination for non-transient sandboxes. This change refines the `need_shutdown_sandbox()` criteria in `virt_container/src/container_manager/manager.rs` to only initiate a shutdown under specific conditions: - The shutdown request is explicit (`req.is_now`). - The request targets the sandbox itself (`req.container_id == self.sid`). By removing the implicit dependency on the empty container list, we ensure the sandbox remains active as expected after workload containers finish execution. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	2b980b3a34	runtime-rs: Block WaitSandbox until sandbox exits Rework sandbox waiting so the WaitSandbox path blocks on sandbox lifetime rather than directly borrowing the hypervisor wait call. Once stop has been observed, the cached exit result is returned to later waiters. While the sandbox is still alive, waiters subscribe to the internal stop notifier and sleep until shutdown or VM exit records the final result. Together with the preceding support commits, this keeps the overall behaviour identical to the original WaitSandbox fix while making the dependency chain explicit. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	ac2d39fc34	runtime-rs: Add sandbox exit notifier in VirtSandbox Add an internal exit_notify_tx channel to VirtSandbox and initialise it in both the regular and restore constructors. The later WaitSandbox rework needs a way to block until sandbox stop has been observed without polling runtime state. This commit only wires in the notifier so the follow-on behaviour change can subscribe to a dedicated stop signal. No WaitSandbox behaviour changes are made here yet. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
Alex Lyn	116ae66025	runtime-rs: Introduce a cached sandbox exit information Introduce an exit_info field in SandboxInner so sandbox teardown can store a stable exit result in runtime state. The follow-on WaitSandbox rework needs a place to keep the final SandboxExitInfo after the sandbox has already stopped. Without that cached result, later waiters would have no consistent value to return once the original stop event has passed. This change only adds the state holder. Behaviour changes follow in later commits. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:42:43 +08:00
dependabot[bot]	ac77c5fdff	build(deps): bump github.com/containerd/containerd in /src/runtime Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.29 to 1.7.32. - [Release notes](https://github.com/containerd/containerd/releases) - [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md) - [Commits](https://github.com/containerd/containerd/compare/v1.7.29...v1.7.32) --- updated-dependencies: - dependency-name: github.com/containerd/containerd dependency-version: 1.7.32 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-21 21:56:06 +00:00
Huy Pham	ee4f756b75	kata-deploy: packaging: fix buggy return statements in cache check The `install_cached_tarball_component` function in the binaries packaging script contained syntax errors where it attempted to capture the empty stdout of the `cleanup_and_fail` function inside a return statement (e.g., `return "$(cleanup_and_fail ...)"`). Since `cleanup_and_fail` only returns an exit status and produces no stdout, this evaluated to `return ""`, which is invalid in bash and causes the script to crash with `numeric argument required` instead of returning the failure status. Fix this by replacing the buggy inline returns with proper `if` blocks that call `cleanup_and_fail` and explicitly return `1`. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-21 09:21:05 -07:00
Huy Pham	9ddcc53f6f	kernel: build: resolve race condition in parallel config generation During parallel builds of different kernel variants (e.g., generic, debug, nvidia-gpu), the config generation script wrote to a shared static path: `tools/packaging/kernel/configs/fragments/x86_64/.config`. This caused critical race conditions where concurrent processes would overwrite or delete the `.config` file while another process was reading it, leading to sporadic build failures with "No such file or directory" errors. Resolve this by changing the temporary configuration path to be build-specific, writing it inside the unique kernel build directory (e.g., `kata-linux-.../.config.generated`). The final config is still copied to `.config` in the kernel source tree as before, but the intermediate merge process is now isolated. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-21 09:19:45 -07:00
Fabiano Fidêncio	7536f2c616	Merge pull request #13055 from kata-containers/topic/kata-deploy-only-install-what-will-be-used kata-deploy: only install what will actually be used	2026-05-21 17:53:09 +02:00
Fabiano Fidêncio	90799f570d	Merge pull request #13082 from fidencio/topic/fix-docker-time-namespace runtime: drop host time namespace from OCI spec	2026-05-21 17:03:20 +02:00
Fabiano Fidêncio	05f2bfcb0b	runtime-rs: drop unused std::env import in initdata_block tests The tests module imports std::env but never references it, which trips the unused_imports warning during CI builds. Remove the dead import to silence the warning. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-21 13:56:45 +02:00
Fabiano Fidêncio	f9eafb3341	runtime: drop host time namespace from OCI spec Docker 29.5+ adds a private time namespace to container bundles by default, but kata agent only supports the classic namespace set and then fails with "invalid namespace type". Let's strip time namespaces in both the Go and rust runtimes before the spec reaches the agent, matching how network and cgroup namespaces are handled. Fixes: #13080 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-21 13:56:45 +02:00
Steve Horsman	bef049d07e	Merge pull request #13081 from stevenhorsman/cache-tag-updates kata-deploy: always add HEAD commit SHA tag to all builds	2026-05-21 11:15:23 +01:00
Alex Lyn	c919aea448	Merge pull request #13066 from RainaYL/rainax/guest_memfd_pr dragonball: Add implementation for KVM-managed guest memfd	2026-05-21 17:12:44 +08:00
Alex Lyn	0283097e91	Merge pull request #13063 from RainaYL/rainax/acpi_pr dragonball: Add basic ACPI implementation for TDX boot	2026-05-21 17:04:59 +08:00
Fabiano Fidêncio	efd468df3f	kata-deploy: retry node labeling after CRI restart On rke2/k3s a CRI restart also restarts the kubelet, which may briefly re-register the node with its cached label set and clobber the kata-runtime label that was just applied via the API. Replace the single label_node call with a retry loop that verifies the label value after setting it. If the label is missing or has the wrong value, it is re-applied (up to 10 attempts with 2 s back-off). This fixes a race condition that became more visible after the switch to individual tarball extraction, which made install take slightly longer and shifted the kubelet re-registration timing window. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	291e4d37be	kata-deploy: implement selective tarball extraction in installer Add zstd and tar as Rust dependencies and rewrite the artifact installation logic to extract only the component tarballs required by the enabled runtime classes. extract_component_tarballs reads shim-components.json to determine which kata-static-<name>.tar.zst files are needed for the selected shims and current architecture. Shared components (e.g. kernel, shim-v2-go) are listed by multiple shims and must only be unpacked once per install run. Deduplication is handled with an in-memory set passed through the call, avoiding any risk of stale on-disk state surviving across pod restarts. Within each tarball, opt/kata path prefixes are stripped and absolute symlink / hard-link targets are rewritten to point at the resolved installation directory, correctly handling MULTI_INSTALL_SUFFIX. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00

1 2 3 4 5 ...

19125 Commits