Commit Graph

6470 Commits

Author SHA1 Message Date
Dan Mihai
c81dadaba1 Merge pull request #13064 from burgerdev/add-arp-neighbour
agent: use rtnetlink to add ARP neighbour
2026-05-26 09:59:44 -07:00
Fabiano Fidêncio
3dc02a8604 Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only
runtime-rs: Support erofs snapshotter with gpt vmdk mode
2026-05-25 16:29:59 +02:00
Zvonko Kaiser
aeadb1af35 Merge pull request #12948 from fidencio/topic/numa
runtime (go): agent: Add NUMA support for QEMU
2026-05-25 15:33:14 +02:00
Alex Lyn
2036e66bc3 kata-agent: Integrate GPT partition support into multi-layer handler
In GPT mode, all partitions share the same base block device, so
resolving it once per uevent source and caching the result avoids
redundant hotplug waits that would otherwise scale linearly with
layer count.

Layers are sorted by partition number before mounting to guarantee
correct overlay lowerdir precedence regardless of the order the host
emits Storage entries.

And it will remove dead_code attributes to mark the codes working.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
17fadde6d8 kata-agent: Add GPT partition utility functions
The guest agent needs to resolve individual partition devices from a
single GPT-partitioned block device, but the kernel does not always
create partition nodes immediately after the base device appears,
especially when another fd holds the device open during hot-plug.

Add utility functions that handle two problems:
(1) Mapping a base device path to its partition path following the
kernel naming convention (bare suffix vs 'p' separator).
(2) And ensuring the partition node exists before mount.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
8119a561ae kata-agent: Refactor wait_and_mount_layer to return LayerMountInfo
This commit has No functional change — all callers pass None, so
every call still resolves the device via uevent exactly as before.

It just prepare the multi-layer EROFS handler for GPT partition and
dm-verity support by widening the wait_and_mount_layer() interface
without changing behavior.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
0bd150e5f1 runtime-rs: Integrate GPT+VMDK mode for multi-layer EROFS rootfs
When multiple EROFS layers are present, wrap them into a single
GPT-partitioned virtual disk delivered via one VMDK descriptor and a
single block device hotplug which significantly reduce pci bus slots
compared with the previous one-device-per-layer approach that exhausts
virtio-blk slots for large layer counts.

The host detects multi-layer mounts, computes the GPT layout, generates
head metadata plus a VMDK descriptor referencing all EROFS images, and
hot-plugs the composite disk. Per-partition Storage entries are created
with X-kata.gpt-partitioned and X-kata.partition-number options so the
guest agent can resolve each layer to its partition device.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
c3b06af4c7 kata-types: Add gpt_disk module for GPT metadata generation
Introduce gpt_disk.rs to compute GPT partition layouts and generate
metadata files for multi-layer EROFS rootfs. The module creates GPT
head metadata that are combined with EROFS layer images via VMDK
descriptors, presenting a single GPT-partitioned virtual disk to the
guest VM — each EROFS layer mapped to its own partition.

The layout engine calculates LBA positions for an arbitrary number of
EROFS layers, then writes a full protective-MBR + GPT image and extracts
the head (MBR + primary GPT table)  segments as standalone files for
VMDK extent assembly.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
148810312d runtime-rs: Refactor VMDK writer and erofs rootfs handling logic
Restructure the erofs rootfs handler to support multi-layer GPT+VMDK
mode where multiple EROFS layers are wrapped into a single virtual
disk with a GPT partition table.

Extract VmdkDescriptorWriter as a reusable struct for atomic VMDK
descriptor generation. Change erofs_storage from Option<Storage> to
Vec<Storage> to hold per-layer metadata, and add GPT metadata path
tracking for proper cleanup with path-traversal guards.

Bump MAX_VIRTIO_BLK_DEVICES from 10 to 127 to accommodate GPT disks
carrying many partitions. Pre-extract mkdir directives from overlay
mounts before the main loop to avoid redundant option parsing.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
7086caaddf kata-agent: Remove unused mode field from MkdirDirective
As previous unused codes are with attribute of dead_code which
actually are never used, we'd better remove them totally.

It will remove the mode field from MkdirDirective structure and
also remove its relavent test cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
39c512bc36 kata-agent: Enhance virtio block matcher to reject partition uevents
Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., /dev/vdaX) which is critical for partitioned disks where
partition uevents appear alongside whole-disk uevents.

This commit aims to eliminate such bad cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
56f05aa534 kata-agent: Enhance SCSI block device matcher to reject partition uevents
Refactor ScsiBlockMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., block/sdd/sdd9) which is critical for partitioned disks
where partition uevents appear alongside whole-disk uevents.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Fabiano Fidêncio
7ddea26137 Merge pull request #13086 from fvichot/flo-kata-monitor-fix
kata-monitor: use full URI for connecting to containerd
2026-05-25 10:16:11 +02:00
Fabiano Fidêncio
8787da13a9 agent: Add NUMA-aware PCI path parsing
Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path
format "root_complex/bus/device" (e.g. "10/00/02") in addition to the
legacy "bus/device" format, defaulting to root complex "00" for backward
compatibility.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1cbe930fc9 runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices
When NUMA placement is active and VFIO devices are cold-plugged,
create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has
devices.  Each pxb-pcie carries a numa_node property that gives the
guest kernel correct NUMA affinity for all PCI devices beneath it.

Root ports are created on each pxb-pcie bus instead of pcie.0, and
VFIODevice.Attach() assigns each device to the root port on its host
NUMA node's pxb bridge.  Non-VFIO devices remain on pcie.0.

NUMA placement is "active" when there is more than one guest NUMA
node OR a single guest node mapped to a specific host node (the
latter happens when maybeRightSizeAutoNUMA() collapses a multi-node
sandbox to the GPU's host NUMA node).  In both cases
buildNUMATopology() also emits the matching
memory-backend-ram,host-nodes=,policy=bind entries so guest memory
is sourced from the right host node.

So pxb-pcie can never capture a leaf virtio-pci device as the
default bus, every virtio-pci device emitter (NetDevice, VSOCK,
vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when
the machine actually exposes a pcie.0 root.  Detection is done via a
new hasPCIeRoot() helper that returns true only for q35/virt machine
types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW
transport) and microvm (no PCI) intentionally skip the pin to avoid
"Bus 'pcie.0' not found" at startup.

This is the only QEMU mechanism that works for both regular and
confidential (TDX/SNP) guests, as it operates through the PCI bus
hierarchy rather than ACPI table injection.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
15292da217 config: Enable NUMA by default for nvidia-gpu configurations
Enable enable_numa=true in the three nvidia-gpu QEMU configuration
templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since
buildNUMATopology() returns nil when there is only one node. On
multi-NUMA hosts it ensures GPU memory accesses are NUMA-local.

Add documentation to all QEMU config templates explaining the VFIO
device NUMA placement validation that occurs when NUMA is enabled.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
feeb5d8ecc runtime-rs: Fix vCPU pinning race with backoff retry
QEMU can report fewer vCPU threads during early startup, causing partial
affinity setup. Let's retry with exponential backoff until the expected
thread count is visible, then continue with best-effort pinning if the
window is exhausted.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
f53f427859 runtime: Fix vCPU pinning race for Go runtime
QEMU may not have spawned all vCPU threads when pinning starts, so
query_cpus_fast can return an incomplete list and leave some vCPUs
unpinned. To fix it, let's add exponential backoff retries before
pinning and fall back to available threads if retries are exhausted.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
b688619314 runtime: oci: Fix sandbox CPU sizing with cpuManagerPolicy=static
When cpuManagerPolicy=static is configured, kubelet sets the sandbox
CPU quota to -1 (unconstrained) because it uses cpuset pinning instead
of CFS quota. This causes CalculateSandboxSizing to compute 0 workload
CPUs, resulting in the VM starting with only default_vcpus.

Fall back to deriving the CPU count from sandbox CPU shares (1024
shares per CPU) when the quota-based calculation yields 0.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
12e5985dbd runtime: Add NUMA-aware vCPU pinning and cpuset.mems forwarding
Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured,
vCPU threads are pinned to host CPUs belonging to the same NUMA node as
the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(),
preserving memory locality. vCPUs are distributed proportionally across
NUMA nodes, matching the distribution in buildNUMATopology().

Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and
container update(). When multi-NUMA is configured, translate host NUMA
node IDs to guest NUMA node IDs using translateHostMemsToGuest() before
forwarding to the agent. This allows the agent to enforce NUMA-aware
memory placement for containers.

Filter guest NUMA nodes at VM creation time: before calling CreateVM(),
prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox
cpuset. This avoids exposing fake NUMA topology to the guest when
Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all
CPUs from node 0 on a 2-node host), improving memory locality and
avoiding unnecessary cross-node memory traffic.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
d0d7deb262 runtime: Add host NUMA distance discovery and build guest NUMA topology
Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that
parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA
distance matrix into the guest via -numa dist entries.

Implement buildNUMATopology() which translates the GuestNUMANodes
configuration into govmm NUMANode and NUMADist slices. Each guest NUMA
node gets a floor-divided share of vCPUs and memory, with the last node
absorbing any remainder. This handles the common Kata case of +1 VMM
overhead vCPU gracefully. Memory backends are selected based on
hugepages/virtio-fs/file-backed-mem configuration.

Guard multi-NUMA topology generation to amd64 and arm64 only, since
other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM.

Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA
nodes and distances.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
447e2a3faf runtime: Add VFIO device NUMA node detection and placement validation
Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper
to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO
devices. Store the result in the new NUMANode field on VFIODev (-1 for
unknown/no affinity).

Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup()
(legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every
discovered VFIO device carries its host NUMA node.

Add validateVFIODeviceNUMAPlacement() which runs at the end of
buildNUMATopology(). It checks every cold-plugged VFIO device's host
NUMA node against the guest NUMA topology and logs a warning if a device
is on a host NUMA node not covered by any guest NUMA node (indicating
potential cross-NUMA memory access overhead), or an info message
confirming correct placement.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1ee8bb5740 runtime: Add NUMA-aware SMP topology
Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter.
When multiple NUMA nodes are configured, restructure the SMP topology so
that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by
socket per NUMA node. Use ceiling division so that uneven vCPU counts
(e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP
topology where MaxCPUs == Sockets * Cores * Threads.

When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus,
Cores=1) is preserved.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1e9da61d48 govmm: Add multi-NUMA memory backend and distance matrix support
Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to
Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node
memory-backend objects with host-nodes/policy=bind, -numa node entries
with cpus= ranges, and -numa dist entries for the distance matrix.

Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported()
to ensure architectures without DIMM support (s390x, riscv64) fall back
to the single-node path. Drop 386 from isDimmSupported since 32-bit x86
is not a supported Kata target.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
ed4d0fb51f runtime-rs: qemu: pass -bios for non-confidential guests
The `boot_info.firmware` field from the hypervisor configuration is
loaded by kata-types and surfaces in the TOML as `firmware = "..."`,
but the qemu cmdline generator never consumed it for non-CC guests.

Today, `-bios <path>` is only appended via the `Bios` device pushed by
`add_{sev,sev_snp,tdx}_protection_device()` in
`QemuInner::start_vm()`, which use the firmware copied into the
`ProtectionDeviceConfig`. That path is taken only when
`confidential_guest = true` and a SEV/SEV-SNP/TDX protection device is
configured. For plain Q35 profiles (notably the nvidia-gpu one, which
needs OVMF to boot the GPU passthrough VM), the `firmware` set in the
TOML was silently dropped and qemu fell back to its default BIOS.

Wire `boot_info.firmware` directly in `QemuCmdLine::new()` when no
protection device path is going to emit `-bios` (i.e. for non-CC
guests). CC paths are left untouched so we don't end up with a
duplicated `-bios` argument.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 15:05:26 +02:00
Fabiano Fidêncio
4c1b3312ea runtime-rs: nvidia-gpu: use _NV firmware substitutions in config template
The `configuration-qemu-nvidia-gpu-runtime-rs.toml.in` template was using
the generic `@FIRMWAREPATH@` / `@FIRMWAREVOLUMEPATH@` placeholders, which
are left empty for the qemu hypervisor in the runtime-rs Makefile. As a
result, no firmware (BIOS) was actually passed to qemu when launching a
VM with the nvidia-gpu configuration, breaking OVMF based boot.

Switch the placeholders to `@FIRMWAREPATH_NV@` / `@FIRMWAREVOLUMEPATH_NV@`,
matching the runtime-go nvidia-gpu template and the substitutions
exported by the runtime-rs Makefile, so the OVMF firmware path is
properly plumbed through to qemu.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 14:59:11 +02:00
Florian Vichot
554e8f91b1 kata-monitor: use full URI for connecting to containerd
Without the protocol in the URI, grpc-go defaults to the DNS resolver,
which results in an error for unix sockets (`name resolver error: produced
zero addresses`).

We also remove the `getAddressAndDialer(...)` and `dial(...)` functions, as
they are no longer necessary, grpc-go supports connecting to unix sockets
directly. This also removes the matching tests.

This also adds a `Makefile` and tweaks the Dockerfile to simplify building
the Docker image.

Fixes #12398

Signed-off-by: Florian Vichot <florian.vichot@gmail.com>
2026-05-23 16:47:46 +02:00
Fabiano Fidêncio
cbcdd999e4 Merge pull request #12957 from Apokleos/fix-sb-api
runtime-rs: Fix sandbox-api lifecycle and CRI status handling
2026-05-23 09:26:14 +02:00
Alex Lyn
486f5f9412 runtime-rs: Align sandbox status with CRI expectations
Update the sandbox status reporting to align with containerd/CRI
requirements. This commit aims to address issue of `State Mapping`

Previously, internal state strings were returned, which containerd
could not recognize, causing running sandboxes to be misinterpreted
as SANDBOX_NOTREADY. This maps internal states to CRI constants:
- Running -> SANDBOX_READY
- Init | Stopped -> SANDBOX_NOTREADY

These changes ensure the sandbox status is both accurately interpreted
and fully compliant with the expected interface.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:42:43 +08:00
Alex Lyn
3f42929e2b runtime-rs: Update sandbox status to include created_at field
Ensure the `created_at` timestamp is correctly propagated in
the sandbox status.

Although `created_at` is present in the `SandboxStatus` and
`SandboxStatusResponse` data structures, it was previously
omitted during the status transition.

This commit completes the implementation by passing the value
recorded during sandbox initialization.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:42:43 +08:00
Alex Lyn
3358c7634b runtime-rs: Avoid shutting down sandbox on container exit
Prevent the sandbox from being prematurely shut down when a standard
workload container exits.

Previously, the shutdown logic incorrectly triggered a sandbox shutdown
whenever the container list became empty. This resulted in unintended
lifecycle termination for non-transient sandboxes.

This change refines the `need_shutdown_sandbox()` criteria in
`virt_container/src/container_manager/manager.rs` to only initiate a
shutdown under specific conditions:
- The shutdown request is explicit (`req.is_now`).
- The request targets the sandbox itself (`req.container_id ==
  self.sid`).

By removing the implicit dependency on the empty container list, we
ensure the sandbox remains active as expected after workload containers
finish execution.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:42:43 +08:00
Alex Lyn
2b980b3a34 runtime-rs: Block WaitSandbox until sandbox exits
Rework sandbox waiting so the WaitSandbox path blocks on sandbox
lifetime rather than directly borrowing the hypervisor wait call.

Once stop has been observed, the cached exit result is returned to
later waiters. While the sandbox is still alive, waiters subscribe to
the internal stop notifier and sleep until shutdown or VM exit records
the final result.

Together with the preceding support commits, this keeps the overall
behaviour identical to the original WaitSandbox fix while making the
dependency chain explicit.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:42:43 +08:00
Alex Lyn
ac2d39fc34 runtime-rs: Add sandbox exit notifier in VirtSandbox
Add an internal exit_notify_tx channel to VirtSandbox and initialise
it in both the regular and restore constructors.

The later WaitSandbox rework needs a way to block until sandbox stop
has been observed without polling runtime state. This commit only
wires in the notifier so the follow-on behaviour change can subscribe
to a dedicated stop signal.

No WaitSandbox behaviour changes are made here yet.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:42:43 +08:00
Alex Lyn
116ae66025 runtime-rs: Introduce a cached sandbox exit information
Introduce an exit_info field in SandboxInner so sandbox teardown can
store a stable exit result in runtime state.

The follow-on WaitSandbox rework needs a place to keep the final
SandboxExitInfo after the sandbox has already stopped. Without that
cached result, later waiters would have no consistent value to return
once the original stop event has passed.

This change only adds the state holder. Behaviour changes follow in
later commits.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:42:43 +08:00
dependabot[bot]
ac77c5fdff build(deps): bump github.com/containerd/containerd in /src/runtime
Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.29 to 1.7.32.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](https://github.com/containerd/containerd/compare/v1.7.29...v1.7.32)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-version: 1.7.32
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-21 21:56:06 +00:00
Fabiano Fidêncio
05f2bfcb0b runtime-rs: drop unused std::env import in initdata_block tests
The tests module imports std::env but never references it, which trips
the unused_imports warning during CI builds. Remove the dead import to
silence the warning.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-21 13:56:45 +02:00
Fabiano Fidêncio
f9eafb3341 runtime: drop host time namespace from OCI spec
Docker 29.5+ adds a private time namespace to container bundles by
default, but kata agent only supports the classic namespace set and
then fails with "invalid namespace type".

Let's strip time namespaces in both the Go and rust runtimes before the
spec reaches the agent, matching how network and cgroup namespaces are
handled.

Fixes: #13080

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-21 13:56:45 +02:00
Alex Lyn
c919aea448 Merge pull request #13066 from RainaYL/rainax/guest_memfd_pr
dragonball: Add implementation for KVM-managed guest memfd
2026-05-21 17:12:44 +08:00
Xiaofan Xxf
62af158842 dragonball: Add implementation for KVM-managed guest memfd
A TDX VM requires that guest memfd is managed by KVM, so that
KVM is able to toggle the memory attribute for the region to
shared/private. Therefore, only anonymous guest memory is allowed
for TDX VM, and the KVM-managed memfd should be created by
KVM_CREATE_GUEST_MEMFD ioctl, instead of issuing memfd_create
system call. Also, in order to bind this memfd with corresponding
memory region, KVM_SET_USER_MEMORY_REGION2 should be invoked,
instead of KVM_SET_USER_MEMORY_REGION.

Signed-off-by: Xiaofan Xxf <xiaofan.xxf@antgroup.com>
2026-05-20 15:02:03 +08:00
Xiaofan Xxf
2506b24c66 dragonball: Add basic ACPI implementation for TDX boot
Added basic implementation for a few ACPI tables (MADT, FADT and
DSDT). Td-shim does not support mptable, and requires VMM to pass
ACPI table contents to virtual firmware  via HOB list.

Note that this is PR contains only minimal implementation enough
for booting a TDX VM. More comprehensive ACPI support may require
future updates.

Signed-off-by: Xiaofan Xxf <xiaofan.xxf@antgroup.com>
2026-05-20 14:01:47 +08:00
stevenhorsman
6ee43475c3 agent-ctl: Fix CLH virtio-fs queue size configuration
After commit e2240b694a ("runtime-rs: ch: source virtio-fs queue size
from toml"), Cloud Hypervisor no longer provides fallback defaults for
virtio-fs queue configuration. When queue_size or queue_num are 0, CH
now uses those values directly instead of substituting defaults, which
causes a panic in the device manager.

The agent-ctl tool was hardcoding queue_size=0 and queue_num=0 in
share_fs_utils.rs, relying on CH's fallback behavior. This broke the
agent-api tests for Cloud Hypervisor while QEMU tests continued to pass.

Fix by reading virtio_fs_queue_size from the hypervisor config and
falling back to sensible defaults (1024 queue size, 1 queue) when not
configured, matching the previous CH default behavior.

Generated-by: IBM Bob

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-19 12:05:52 +01:00
Fabiano Fidêncio
ffa59ce3aa Merge commit from fork
runtime: disable virtiofsd extra-args annotation by default
2026-05-19 08:22:12 +02:00
Alex Lyn
8dca734008 Merge pull request #12959 from DataDog/mayeul/fix-race-condition-when-adding-qdisc
shim: Add backoff retry to ingress qdisc creation to avoid potential race condition
2026-05-19 14:06:37 +08:00
Aurélien Bombo
e2240b694a runtime-rs: ch: source virtio-fs queue size from toml
Now that `prepare_virtiofs` populates `ShareFsConfig` from
`SharedFsInfo.virtio_fs_queue_size`, the CH-side fallback that
substitutes `DEFAULT_FS_QUEUE_SIZE` (1024) when the incoming
`queue_num`/`queue_size` are zero is no longer needed. Drop it from
both `handle_share_fs_device` and `TryFrom<ShareFsSettings> for
FsConfig` and use the values straight from the config. Drop the now
unused `DEFAULT_FS_QUEUES` and `DEFAULT_FS_QUEUE_SIZE` constants.

This also removes a latent bug in both call sites: the previous code
gated `queue_size` on `queue_num > 0`, so a user setting only the
queue size and not the (currently unconfigurable) queue count would
have had their `queue_size` silently overwritten by the default.

The CH config template (`configuration-clh-runtime-rs.toml.in`) did
not ship the `virtio_fs_queue_size` key (unlike the qemu-runtime-rs
templates), so without an explicit override the field would have
deserialized to 0 and the fallback would have been the only thing
keeping CH working. Add the key to the template, defaulted to
`@DEFVIRTIOFSQUEUESIZE@` (1024), matching the qemu-runtime-rs
templates.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-05-19 06:14:24 +02:00
Aurélien Bombo
0d5bde2181 runtime-rs: virtio-fs: plumb virtio_fs_queue_size to qemu/CH
The shared filesystem device builder in `prepare_virtiofs` was
hardcoding `queue_size = 0` and `queue_num = 0` on the `ShareFsConfig`
it hands to the hypervisor, ignoring `SharedFsInfo.virtio_fs_queue_size`
parsed from `configuration.toml` entirely.

For qemu, this is silently broken: the cmdline generator's
`DeviceVhostUserFs::set_queue_size` treats 0 as "not set" and skips the
`queue-size=` argument when emitting the `vhost-user-fs-pci` device, so
QEMU falls back to its built-in default of 128, regardless of what the
user configured.

For Cloud Hypervisor it happens to work in practice today, but only
because `ch::handle_share_fs_device` and `TryFrom<ShareFsSettings> for
FsConfig` substitute a hardcoded 1024 when the incoming
`queue_num`/`queue_size` are zero. That fallback masks the real bug; the
toml value still never reaches the VMM.

Add a `get_shared_fs_info` accessor on `DeviceManager` mirroring the
existing `get_block_device_info` helper, and use it in
`prepare_virtiofs` to populate `ShareFsConfig.queue_size` from
`SharedFsInfo.virtio_fs_queue_size`. Use a single virtqueue
(`queue_num = 1`), matching what runtime-go hardcodes for both qemu
(govmm `QemuFSParams` does not emit `num-queues=`) and CH
(`numQueues := int32(1)` in `clh.go`).

The CH-side fallback and the CH config template are addressed in a
follow-up commit.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-05-19 06:14:24 +02:00
Alex Lyn
e5a7f5b120 Merge pull request #13009 from sebwolf-de/swolf/kata-fc-jailer-pid-leak
Fix #13008: runtime/fc track real firecracker PID instead of jailer PID
2026-05-19 11:59:24 +08:00
Alex Lyn
357921df62 Merge pull request #12437 from Apokleos/fix-katactl-exec
kata-ctl: Fix failures when kata-ctl exec with short id
2026-05-19 09:13:17 +08:00
Aurélien Bombo
83e20877d8 Merge pull request #12882 from stevenhorsman/runtime-rs/cdh_api_timeout
runtime-rs: Add cdh_api_timeout configuration parameter
2026-05-18 15:38:27 -05:00
Sebastian Wolf
26746c9ce8 runtime/fc: track real firecracker PID instead of jailer PID
When the jailer is in use (the default for kata-fc), cmd.Process.Pid in
fcInit() is the jailer's PID, not firecracker's. The jailer forks +
execs firecracker as a separate child and exits. fc.info.PID was
therefore stored as the (soon-to-be-dead) jailer PID.

At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...).
syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess
returns nil immediately, and the real firecracker microVM never
receives a signal. It gets reparented to init and stays alive
indefinitely, holding open resources from the host. Over many
container lifecycles this becomes a serious resource leak.

Read the real PID from <jailerRoot>/firecracker.pid, which firecracker
itself writes after the exec. Update fc.info.PID with that value so all
downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates
on the actual firecracker process.

Also fix a small adjacent bug in Sandbox.Stop where the per-container
teardown loop ignored the force flag, causing any container.stop error
to short-circuit Stop before stopVM ran.

Signed-off-by: Sebastian Wolf <swolf@nvidia.com>
2026-05-18 21:09:51 +02:00
Fabiano Fidêncio
9044ee22d2 Merge pull request #13024 from SAY-5/fix-typo-occured
dragonball: fix typo in VsockEpollListener doc comment
2026-05-18 20:39:33 +02:00