Commit Graph

6495 Commits

Author SHA1 Message Date
manuelh-dev
953b306ff3 Merge pull request #12979 from manuelh-dev/mahuber/erofs-tmpfs-mount
runtime-rs/agent: support EROFS snapshots without a rwlayer
2026-05-29 13:50:27 -07:00
Aurélien Bombo
9acef4bc55 Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple
Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
2026-05-29 14:57:07 -05:00
Cameron Baird
7a9d207ab2 Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
This reverts commit 2799f7d36b.
2026-05-29 17:05:40 +00:00
Fabiano Fidêncio
10e70a2a9f runtime-rs: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network
endpoints via host_bdf()) was already present inside handler_devices()
but could never be consumed because the LinuxDeviceType::C loop has
no entries to iterate over when linux.devices is empty.

After that loop, iterate over any unmatched cold-plug BDFs, derive the
VFIO group path via bdf_to_vfio_group_path() (reads
/sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk
ContainerDevice.  The vfio_group_to_bdf() short-circuit inside the
loop handles the case where the device plugin does add VFIO char
devices to linux.devices; it now supports both legacy (/dev/vfio/N)
and iommufd (/dev/vfio/devices/vfioN) path formats.

Add host_bdf() to the Endpoint trait (default: None) so that
PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
60f2878c68 runtime-rs: call network.remove() during resource cleanup
network.remove() — which detaches endpoints and rebinds VFs from
vfio-pci back to the host driver — was never being called.
ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs,
swap and ephemeral disks, but completely omitted the network teardown.

Call network.remove() at the start of cleanup(), using the already-held
self.hypervisor reference.  Errors are logged as warnings rather than
propagated, so they don't block the rest of the cleanup sequence.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
0b4b51dff6 runtime-rs: always detach endpoints on network removal
network_with_netns::remove() bailed out early when network_created=false
(i.e. the netns was created by the CNI, not by kata). This caused
physical endpoint VFs to remain bound to vfio-pci after pod deletion,
because PhysicalEndpoint::detach() — which calls bind_device_to_host()
to rebind the VF from vfio-pci back to mlx5_core — was never reached.

Separate endpoint detachment from netns deletion: always detach
endpoints, but only remove the netns if kata created it.  Detach errors
are logged as warnings rather than propagated, to mirror the Go runtime's
best-effort approach and avoid blocking sandbox teardown.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
be2ec02c9a runtime-rs: resolve cold-plug VFIO guest PCI path via QMP
The PCIe topology pre-computes a wrong path for cold-plugged physical-
endpoint VFs because the root port has no explicit addr and QEMU auto-
assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] }
resolves to 0000:00:00.0 (the Q35 MCH), causing
wait_for_pci_net_interface to time out looking for a netdev there.

Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait.
Implement it in QemuInner using qmp.get_device_by_qdev_id(), which
queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00"
= slot 5 on pcie.0 / slot 0 on the root port bus).

Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during
attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the
Endpoint trait and add an endpoints() accessor to the Network trait.

In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths()
before apply_network_to_agent() to populate the correct path from QMP
into each PhysicalEndpoint's guest_pci_path field. The field is then
consumed by network_with_netns::interfaces() to fill Interface.device_path
before update_interface is sent to the agent.

This is the runtime-rs counterpart of the Go runtime's
ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
f8ee9133e5 runtime-rs: populate device_path for cold-plug VFIO physical endpoints
Without device_path the agent receives Interface.device_path="" in
update_interface, falls back to a by-MAC link lookup, and fails for
SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after
the vfio-pci unbind/rebind cycle.

The guest PCI path is computed at attach() time by do_add_pcie_endpoint()
inside VfioDevice::register() — no QMP query is needed. Cache it in
PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach()
when do_handle_device() returns the DeviceType::Vfio with the path
already filled in.

Add a default-None guest_pci_path() method to the Endpoint trait;
PhysicalEndpoint overrides it to return the cached path. In
network_with_netns.rs::interfaces(), after building each Interface from
network_info, fill device_path from endpoint.guest_pci_path() when the
field would otherwise be empty.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
67843220f8 runtime-rs: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC the guest mlx5_core inherits whatever firmware-
default MAC the VF was created with. This MAC differs from the IB port
HCA MAC, so mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE appears active but
every verb needing a GID fails.

Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF
as an "admin MAC" via the parent PF using RTM_SETLINK with
IFLA_VFINFO_LIST — the netlink equivalent of
  ip link set <PF> vf <N> mac <MAC>

The operation runs in a spawn_blocking closure that enters the host
network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is
called while the thread is inside the pod netns.

Best-effort: failures are logged at warn and the existing agent-side MAC
reconciliation (update_interface in rpc.rs) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
9e9b50c79e runtime-rs: cold-plug Vfio physical endpoints at VM launch
DeviceType::Vfio (used by physical network VFs) was silently dropped
in start_vm()'s cold-plug loop, falling through to the unsupported-
device info log. The VF never appeared on the QEMU command line and
therefore never became visible inside the guest.

Add handling for DeviceType::Vfio in the start_vm() cold-plug loop.
For each HostDevice in the VfioDevice, emit:

  -device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \
      [x-pci-vendor-id=...,x-pci-device-id=...]

The bus assignment and guest PCI path are already computed by
do_add_pcie_endpoint() at VfioDevice::register() time (called from
VfioDevice::attach() via the PCIe topology), so no additional QMP
resolution is needed here.

Add id= support to PCIeVfioDevice so the QEMU device name is stable
and matchable in QMP queries. Add new_without_iommufd() constructor
for the non-IOMMUFD (legacy VFIO container) path used by physical
endpoints, and add_physical_vfio_device() to QemuCmdLine as a
direct emission helper.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
91df041803 agent: expose guest InfiniBand devices to VFIO containers
When a VF is cold-plugged in guest-kernel mode, mlx5_core binds to the
PCI device inside the VM and mlx5_ib creates IB character devices under
/dev/infiniband/ (uverbs*, rdma_cm, umad*). The container cannot reach
these devices unless they are explicitly added to its OCI spec.

Add expose_guest_infiniband_devices(), called from create_devices() when
the container carries at least one VFIO device entry. The function:

  - Walks /dev/infiniband/ inside the guest VM.
  - Appends each char device to spec.linux.devices.
  - Inserts matching cgroup allow rules (rwm).
  - Is a no-op if /dev/infiniband/ is absent or empty (no IB driver,
    or VF not yet rebound), so non-RDMA pods are unaffected.

Gate the call on container_has_vfio_device() so unrelated containers
sharing the sandbox do not get IB device access widened.

Add is_vfio_device_type() and snapshot_infiniband() to
kata-sys-util/pcilibs. is_vfio_device_type() lets the agent check
device type strings against the VFIO driver name constants without
duplication. snapshot_infiniband() summarises /sys/class/infiniband,
/sys/class/infiniband_verbs, and /dev/infiniband as a single diagnostic
string for log context; it lives in pcilibs because it has no
agent-specific dependencies (pure sysfs/devfs reads).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
025202a52a runtime: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

Add appendPhysicalEndpointDevices() which runs after appendDevices()
in createContainer().  It walks the sandbox network endpoints; for
each PhysicalEndpoint with a resolved guest PCI path it derives the
VFIO group char path from sysfs (iommu_group symlink) and synthesises
a vfio-pci-gk Device entry.  Both legacy group paths (/dev/vfio/N)
and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by
reading the iommu_group sysfs symlink.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
fa9a9f3aeb runtime: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC, the guest's mlx5_core inherits the VF's
firmware-default MAC. This MAC differs from the IB port's HCA MAC, so
mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE then appears active
(port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID
— RoCEv2 packets, address handles, librdmacm bind — fails silently.

Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF
using RTM_SETLINK before the bind-to-vfio-pci step. The firmware
applies the admin MAC during the VF reset that accompanies the
unbind/rebind cycle, so the guest sees a single consistent MAC across
netdev, IB port, and HCA.

Best-effort: failures are logged at warn and the existing agent-side
MAC reconciliation (rpc.rs::update_interface) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
992a723392 runtime: resolve cold-plug VFIO guest PCI path via QMP
For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged
VFIO device is auto-allocated at boot (each pcie-root-port is added with
chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot
on pcie.0). The hot-plug path already queries QMP via qomGetPciPath;
reuse that same mechanism for cold-plugged devices.

Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface.
Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all
other hypervisors.

Call it at the start of setupNetworks so that the PCI paths are resolved
before generateVCNetworkStructures emits the agent Interface proto. Also
stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs
exposed as physical network devices) so that update_interface carries a
non-empty devicePath. Without devicePath the agent falls back to a
by-MAC link lookup which fails when the VF firmware MAC differs from the
CNI-assigned MAC after the vfio-pci unbind/rebind cycle.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
23c5250933 runtime/qemu: emit id= for VFIODevice on -device cmdline
Without an explicit id= on the vfio-pci device, QEMU auto-generates
an internal name that does not match vfioDev.ID, so any subsequent
qomGetPciPath(vfioDev.ID) call via QMP fails with "Device 'X' not
found". This breaks resolveColdPlugVFIOGuestPciPaths which needs the
device ID to look up the guest PCI path, leaving GuestPciPath nil and
causing update_interface to fail repeatedly as the agent can't find
the interface to configure.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
e6777f0866 runtime: keep cold-plug VFIO devices in guest-kernel mode
Container.createDevices was dropping cold-plug VFIO entries from the
container's deviceInfos whenever vfio_mode = "guest-kernel", which
in turn meant the agent's CreateContainer request carried no
vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The
SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the
workload container, so update_env_pci then aborted with
"No PCI mapping found for container <id>" and the container failed
with CrashLoopBackOff.

Include cold-plug VFIO devices in deviceInfos for both VFIO modes.
The existing vfio-pci-gk agent handler returns dev: None (so
/dev/vfio/<group> is not materialised in the container spec, and
constrainGRPCSpec(stripVfio=true) already strips it from the grpc
spec for guest-kernel mode), while still recording the host->guest
PCI mapping into sandbox.pcimap[cid] so env-var translation works.

devManager.NewDevice calls FindDevice first, which matches the
already cold-plugged sandbox-level device by HostPath / major / minor,
so this does not double-attach.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
9893b6dc03 runtime: correctly resolve cold-plug VFIO guest PCI paths
Populate missing VFIO guest PCI paths via QMP before serializing
container devices so guest-kernel PCI env translation has the mappings
it needs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
118b7fa611 agent: reconcile VFIO netdev MAC before UpdateInterface lookup
When a VFIO cold-plugged network device appears in guest with a
different MAC than the runtime request, resolve the netdev by PCI path
and apply the requested MAC before the normal by-MAC update flow.

This preserves existing behavior while avoiding UpdateInterface
mismatches in SR-IOV cold-plug cases.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
e89eb77245 agent: keep PCIDEVICE env unchanged when pcimap is missing
Avoid failing container creation when per-container PCI mappings are
unavailable by preserving PCIDEVICE entries unchanged and warning
instead.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 21:54:52 +02:00
Cameron Baird
2799f7d36b runtime: Enforce >= 1 queue pairs for tapNetworkPair
In the xConnectVMNetwork path, we have queues = 0 as a baseline,
set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise
MultiQueueSupport. This is certainly incorrect as we always want, as
a baseline, at least one queue pair. Make queues := 1 by default
to ensure the NetworkPair has at least one queue pair for all
virtio-net paths.

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2026-05-27 18:55:11 +00:00
Manuel Huber
ebf2c99df3 runtime-rs: allow EROFS rootfs without rwlayer
Treat the containerd erofs snapshotter active snapshot as an EROFS
lower plus overlay metadata, with an optional ext4 rwlayer when host
rw backing is enabled. This also covers default_size=0, where
containerd sends no rwlayer and the agent provides the writable upper
inside the guest.

Forward overlay mkdir hints on the EROFS storage so the guest agent
sees them in both layouts, and add unit coverage for the dispatcher
patterns.

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-27 17:12:20 +00:00
Manuel Huber
4fbfba2f79 agent: support run-backed EROFS upper
Support multi-layer EROFS storage without an explicit ext4 upper
layer. When runtime-rs sends only EROFS lower storage and overlay
metadata, create the overlay upper/work directories under the
container bundle in /run/kata-containers.

Keep the explicit ext4 rwlayer path for disk-backed snapshots, and
only track real temporary mount points for cleanup. The implicit
/run-backed upper is bundle-scoped state and is removed with the
container bundle.

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-27 17:12:20 +00:00
Fabiano Fidêncio
5adfb27297 Merge pull request #13118 from PiotrProkop/fix-missing-cwd
agent: restore process CWD auto-creation
2026-05-27 13:32:05 +02:00
PiotrProkop
60a2e27f02 agent: Restore process CWD auto-creation
Commit b56313472 ("agent: Align agent OCI spec with oci-spec-rs",
PR #9944) inverted the condition guarding the create_dir_all call
for process.cwd: the leading `!` was dropped during the refactor.
As a result, the CWD is created only when process.cwd is the empty
string.

When the guest then runs chdir(process.cwd) and CWD doesn't exist
it returns ENOENT.  The agent propagates that to the shim, which
surfaces it to containerd as "failed to create shim task: ENOENT:
No such file or directory" — indistinguishable from a missing
argv[0].
This regressed the original fix in PR #2375 (Fixes #2374), which
deliberately mirrored runc's behavior.  Put the `!` back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-05-27 09:59:15 +02:00
Manuel Huber
e838cd7d8d agent: compact EROFS overlay lowerdirs
Use kata_types::mount::Mount for the final multi-layer EROFS
overlay mount instead of calling baremount() directly.

The mount helper detects overlay option strings close to the kernel
mount data limit. When lowerdir entries share a common parent, it
changes into that directory and rewrites lowerdir to relative paths.
That avoids repeating the same long prefix for every layer.

Multi-layer EROFS images can have many lower layers under
/run/kata-containers/<cid>/multi-layer. Passing the raw absolute
lowerdir list can exceed the mount option buffer and fail the final
overlay mount, even after all layer devices mounted successfully.

Reuse the helper so this path follows Kata's normal overlay mount
handling, including lowerdir compaction before mount(2).

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-26 18:42:11 +00:00
Dan Mihai
c81dadaba1 Merge pull request #13064 from burgerdev/add-arp-neighbour
agent: use rtnetlink to add ARP neighbour
2026-05-26 09:59:44 -07:00
Fabiano Fidêncio
3dc02a8604 Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only
runtime-rs: Support erofs snapshotter with gpt vmdk mode
2026-05-25 16:29:59 +02:00
Zvonko Kaiser
aeadb1af35 Merge pull request #12948 from fidencio/topic/numa
runtime (go): agent: Add NUMA support for QEMU
2026-05-25 15:33:14 +02:00
Alex Lyn
2036e66bc3 kata-agent: Integrate GPT partition support into multi-layer handler
In GPT mode, all partitions share the same base block device, so
resolving it once per uevent source and caching the result avoids
redundant hotplug waits that would otherwise scale linearly with
layer count.

Layers are sorted by partition number before mounting to guarantee
correct overlay lowerdir precedence regardless of the order the host
emits Storage entries.

And it will remove dead_code attributes to mark the codes working.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
17fadde6d8 kata-agent: Add GPT partition utility functions
The guest agent needs to resolve individual partition devices from a
single GPT-partitioned block device, but the kernel does not always
create partition nodes immediately after the base device appears,
especially when another fd holds the device open during hot-plug.

Add utility functions that handle two problems:
(1) Mapping a base device path to its partition path following the
kernel naming convention (bare suffix vs 'p' separator).
(2) And ensuring the partition node exists before mount.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
8119a561ae kata-agent: Refactor wait_and_mount_layer to return LayerMountInfo
This commit has No functional change — all callers pass None, so
every call still resolves the device via uevent exactly as before.

It just prepare the multi-layer EROFS handler for GPT partition and
dm-verity support by widening the wait_and_mount_layer() interface
without changing behavior.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
0bd150e5f1 runtime-rs: Integrate GPT+VMDK mode for multi-layer EROFS rootfs
When multiple EROFS layers are present, wrap them into a single
GPT-partitioned virtual disk delivered via one VMDK descriptor and a
single block device hotplug which significantly reduce pci bus slots
compared with the previous one-device-per-layer approach that exhausts
virtio-blk slots for large layer counts.

The host detects multi-layer mounts, computes the GPT layout, generates
head metadata plus a VMDK descriptor referencing all EROFS images, and
hot-plugs the composite disk. Per-partition Storage entries are created
with X-kata.gpt-partitioned and X-kata.partition-number options so the
guest agent can resolve each layer to its partition device.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
c3b06af4c7 kata-types: Add gpt_disk module for GPT metadata generation
Introduce gpt_disk.rs to compute GPT partition layouts and generate
metadata files for multi-layer EROFS rootfs. The module creates GPT
head metadata that are combined with EROFS layer images via VMDK
descriptors, presenting a single GPT-partitioned virtual disk to the
guest VM — each EROFS layer mapped to its own partition.

The layout engine calculates LBA positions for an arbitrary number of
EROFS layers, then writes a full protective-MBR + GPT image and extracts
the head (MBR + primary GPT table)  segments as standalone files for
VMDK extent assembly.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
148810312d runtime-rs: Refactor VMDK writer and erofs rootfs handling logic
Restructure the erofs rootfs handler to support multi-layer GPT+VMDK
mode where multiple EROFS layers are wrapped into a single virtual
disk with a GPT partition table.

Extract VmdkDescriptorWriter as a reusable struct for atomic VMDK
descriptor generation. Change erofs_storage from Option<Storage> to
Vec<Storage> to hold per-layer metadata, and add GPT metadata path
tracking for proper cleanup with path-traversal guards.

Bump MAX_VIRTIO_BLK_DEVICES from 10 to 127 to accommodate GPT disks
carrying many partitions. Pre-extract mkdir directives from overlay
mounts before the main loop to avoid redundant option parsing.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
7086caaddf kata-agent: Remove unused mode field from MkdirDirective
As previous unused codes are with attribute of dead_code which
actually are never used, we'd better remove them totally.

It will remove the mode field from MkdirDirective structure and
also remove its relavent test cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
39c512bc36 kata-agent: Enhance virtio block matcher to reject partition uevents
Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., /dev/vdaX) which is critical for partitioned disks where
partition uevents appear alongside whole-disk uevents.

This commit aims to eliminate such bad cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
56f05aa534 kata-agent: Enhance SCSI block device matcher to reject partition uevents
Refactor ScsiBlockMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., block/sdd/sdd9) which is critical for partitioned disks
where partition uevents appear alongside whole-disk uevents.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Fabiano Fidêncio
7ddea26137 Merge pull request #13086 from fvichot/flo-kata-monitor-fix
kata-monitor: use full URI for connecting to containerd
2026-05-25 10:16:11 +02:00
Fabiano Fidêncio
8787da13a9 agent: Add NUMA-aware PCI path parsing
Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path
format "root_complex/bus/device" (e.g. "10/00/02") in addition to the
legacy "bus/device" format, defaulting to root complex "00" for backward
compatibility.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1cbe930fc9 runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices
When NUMA placement is active and VFIO devices are cold-plugged,
create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has
devices.  Each pxb-pcie carries a numa_node property that gives the
guest kernel correct NUMA affinity for all PCI devices beneath it.

Root ports are created on each pxb-pcie bus instead of pcie.0, and
VFIODevice.Attach() assigns each device to the root port on its host
NUMA node's pxb bridge.  Non-VFIO devices remain on pcie.0.

NUMA placement is "active" when there is more than one guest NUMA
node OR a single guest node mapped to a specific host node (the
latter happens when maybeRightSizeAutoNUMA() collapses a multi-node
sandbox to the GPU's host NUMA node).  In both cases
buildNUMATopology() also emits the matching
memory-backend-ram,host-nodes=,policy=bind entries so guest memory
is sourced from the right host node.

So pxb-pcie can never capture a leaf virtio-pci device as the
default bus, every virtio-pci device emitter (NetDevice, VSOCK,
vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when
the machine actually exposes a pcie.0 root.  Detection is done via a
new hasPCIeRoot() helper that returns true only for q35/virt machine
types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW
transport) and microvm (no PCI) intentionally skip the pin to avoid
"Bus 'pcie.0' not found" at startup.

This is the only QEMU mechanism that works for both regular and
confidential (TDX/SNP) guests, as it operates through the PCI bus
hierarchy rather than ACPI table injection.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
15292da217 config: Enable NUMA by default for nvidia-gpu configurations
Enable enable_numa=true in the three nvidia-gpu QEMU configuration
templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since
buildNUMATopology() returns nil when there is only one node. On
multi-NUMA hosts it ensures GPU memory accesses are NUMA-local.

Add documentation to all QEMU config templates explaining the VFIO
device NUMA placement validation that occurs when NUMA is enabled.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
feeb5d8ecc runtime-rs: Fix vCPU pinning race with backoff retry
QEMU can report fewer vCPU threads during early startup, causing partial
affinity setup. Let's retry with exponential backoff until the expected
thread count is visible, then continue with best-effort pinning if the
window is exhausted.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
f53f427859 runtime: Fix vCPU pinning race for Go runtime
QEMU may not have spawned all vCPU threads when pinning starts, so
query_cpus_fast can return an incomplete list and leave some vCPUs
unpinned. To fix it, let's add exponential backoff retries before
pinning and fall back to available threads if retries are exhausted.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
b688619314 runtime: oci: Fix sandbox CPU sizing with cpuManagerPolicy=static
When cpuManagerPolicy=static is configured, kubelet sets the sandbox
CPU quota to -1 (unconstrained) because it uses cpuset pinning instead
of CFS quota. This causes CalculateSandboxSizing to compute 0 workload
CPUs, resulting in the VM starting with only default_vcpus.

Fall back to deriving the CPU count from sandbox CPU shares (1024
shares per CPU) when the quota-based calculation yields 0.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
12e5985dbd runtime: Add NUMA-aware vCPU pinning and cpuset.mems forwarding
Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured,
vCPU threads are pinned to host CPUs belonging to the same NUMA node as
the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(),
preserving memory locality. vCPUs are distributed proportionally across
NUMA nodes, matching the distribution in buildNUMATopology().

Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and
container update(). When multi-NUMA is configured, translate host NUMA
node IDs to guest NUMA node IDs using translateHostMemsToGuest() before
forwarding to the agent. This allows the agent to enforce NUMA-aware
memory placement for containers.

Filter guest NUMA nodes at VM creation time: before calling CreateVM(),
prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox
cpuset. This avoids exposing fake NUMA topology to the guest when
Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all
CPUs from node 0 on a 2-node host), improving memory locality and
avoiding unnecessary cross-node memory traffic.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
d0d7deb262 runtime: Add host NUMA distance discovery and build guest NUMA topology
Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that
parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA
distance matrix into the guest via -numa dist entries.

Implement buildNUMATopology() which translates the GuestNUMANodes
configuration into govmm NUMANode and NUMADist slices. Each guest NUMA
node gets a floor-divided share of vCPUs and memory, with the last node
absorbing any remainder. This handles the common Kata case of +1 VMM
overhead vCPU gracefully. Memory backends are selected based on
hugepages/virtio-fs/file-backed-mem configuration.

Guard multi-NUMA topology generation to amd64 and arm64 only, since
other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM.

Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA
nodes and distances.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
447e2a3faf runtime: Add VFIO device NUMA node detection and placement validation
Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper
to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO
devices. Store the result in the new NUMANode field on VFIODev (-1 for
unknown/no affinity).

Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup()
(legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every
discovered VFIO device carries its host NUMA node.

Add validateVFIODeviceNUMAPlacement() which runs at the end of
buildNUMATopology(). It checks every cold-plugged VFIO device's host
NUMA node against the guest NUMA topology and logs a warning if a device
is on a host NUMA node not covered by any guest NUMA node (indicating
potential cross-NUMA memory access overhead), or an info message
confirming correct placement.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1ee8bb5740 runtime: Add NUMA-aware SMP topology
Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter.
When multiple NUMA nodes are configured, restructure the SMP topology so
that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by
socket per NUMA node. Use ceiling division so that uneven vCPU counts
(e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP
topology where MaxCPUs == Sockets * Cores * Threads.

When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus,
Cores=1) is preserved.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1e9da61d48 govmm: Add multi-NUMA memory backend and distance matrix support
Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to
Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node
memory-backend objects with host-nodes/policy=bind, -numa node entries
with cpus= ranges, and -numa dist entries for the distance matrix.

Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported()
to ensure architectures without DIMM support (s390x, riscv64) fall back
to the single-node path. Drop 386 from isDimmSupported since 32-bit x86
is not a supported Kata target.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
ed4d0fb51f runtime-rs: qemu: pass -bios for non-confidential guests
The `boot_info.firmware` field from the hypervisor configuration is
loaded by kata-types and surfaces in the TOML as `firmware = "..."`,
but the qemu cmdline generator never consumed it for non-CC guests.

Today, `-bios <path>` is only appended via the `Bios` device pushed by
`add_{sev,sev_snp,tdx}_protection_device()` in
`QemuInner::start_vm()`, which use the firmware copied into the
`ProtectionDeviceConfig`. That path is taken only when
`confidential_guest = true` and a SEV/SEV-SNP/TDX protection device is
configured. For plain Q35 profiles (notably the nvidia-gpu one, which
needs OVMF to boot the GPU passthrough VM), the `firmware` set in the
TOML was silently dropped and qemu fell back to its default BIOS.

Wire `boot_info.firmware` directly in `QemuCmdLine::new()` when no
protection device path is going to emit `-bios` (i.e. for non-CC
guests). CC paths are left untouched so we don't end up with a
duplicated `-bios` argument.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 15:05:26 +02:00