Commit Graph

19182 Commits

Author SHA1 Message Date
Fabiano Fidêncio
4e7b49fede Merge pull request #13103 from fidencio/topic/mlx-coldplug-support-v2
runtime / agent / kernel: fix cold-plug VFIO guest-kernel mode for SR-IOV RoCE/InfiniBand
2026-05-29 21:21:46 +02:00
Fabiano Fidêncio
10e70a2a9f runtime-rs: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network
endpoints via host_bdf()) was already present inside handler_devices()
but could never be consumed because the LinuxDeviceType::C loop has
no entries to iterate over when linux.devices is empty.

After that loop, iterate over any unmatched cold-plug BDFs, derive the
VFIO group path via bdf_to_vfio_group_path() (reads
/sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk
ContainerDevice.  The vfio_group_to_bdf() short-circuit inside the
loop handles the case where the device plugin does add VFIO char
devices to linux.devices; it now supports both legacy (/dev/vfio/N)
and iommufd (/dev/vfio/devices/vfioN) path formats.

Add host_bdf() to the Endpoint trait (default: None) so that
PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
60f2878c68 runtime-rs: call network.remove() during resource cleanup
network.remove() — which detaches endpoints and rebinds VFs from
vfio-pci back to the host driver — was never being called.
ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs,
swap and ephemeral disks, but completely omitted the network teardown.

Call network.remove() at the start of cleanup(), using the already-held
self.hypervisor reference.  Errors are logged as warnings rather than
propagated, so they don't block the rest of the cleanup sequence.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
0b4b51dff6 runtime-rs: always detach endpoints on network removal
network_with_netns::remove() bailed out early when network_created=false
(i.e. the netns was created by the CNI, not by kata). This caused
physical endpoint VFs to remain bound to vfio-pci after pod deletion,
because PhysicalEndpoint::detach() — which calls bind_device_to_host()
to rebind the VF from vfio-pci back to mlx5_core — was never reached.

Separate endpoint detachment from netns deletion: always detach
endpoints, but only remove the netns if kata created it.  Detach errors
are logged as warnings rather than propagated, to mirror the Go runtime's
best-effort approach and avoid blocking sandbox teardown.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
be2ec02c9a runtime-rs: resolve cold-plug VFIO guest PCI path via QMP
The PCIe topology pre-computes a wrong path for cold-plugged physical-
endpoint VFs because the root port has no explicit addr and QEMU auto-
assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] }
resolves to 0000:00:00.0 (the Q35 MCH), causing
wait_for_pci_net_interface to time out looking for a netdev there.

Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait.
Implement it in QemuInner using qmp.get_device_by_qdev_id(), which
queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00"
= slot 5 on pcie.0 / slot 0 on the root port bus).

Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during
attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the
Endpoint trait and add an endpoints() accessor to the Network trait.

In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths()
before apply_network_to_agent() to populate the correct path from QMP
into each PhysicalEndpoint's guest_pci_path field. The field is then
consumed by network_with_netns::interfaces() to fill Interface.device_path
before update_interface is sent to the agent.

This is the runtime-rs counterpart of the Go runtime's
ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
f8ee9133e5 runtime-rs: populate device_path for cold-plug VFIO physical endpoints
Without device_path the agent receives Interface.device_path="" in
update_interface, falls back to a by-MAC link lookup, and fails for
SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after
the vfio-pci unbind/rebind cycle.

The guest PCI path is computed at attach() time by do_add_pcie_endpoint()
inside VfioDevice::register() — no QMP query is needed. Cache it in
PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach()
when do_handle_device() returns the DeviceType::Vfio with the path
already filled in.

Add a default-None guest_pci_path() method to the Endpoint trait;
PhysicalEndpoint overrides it to return the cached path. In
network_with_netns.rs::interfaces(), after building each Interface from
network_info, fill device_path from endpoint.guest_pci_path() when the
field would otherwise be empty.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
67843220f8 runtime-rs: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC the guest mlx5_core inherits whatever firmware-
default MAC the VF was created with. This MAC differs from the IB port
HCA MAC, so mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE appears active but
every verb needing a GID fails.

Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF
as an "admin MAC" via the parent PF using RTM_SETLINK with
IFLA_VFINFO_LIST — the netlink equivalent of
  ip link set <PF> vf <N> mac <MAC>

The operation runs in a spawn_blocking closure that enters the host
network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is
called while the thread is inside the pod netns.

Best-effort: failures are logged at warn and the existing agent-side MAC
reconciliation (update_interface in rpc.rs) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
9e9b50c79e runtime-rs: cold-plug Vfio physical endpoints at VM launch
DeviceType::Vfio (used by physical network VFs) was silently dropped
in start_vm()'s cold-plug loop, falling through to the unsupported-
device info log. The VF never appeared on the QEMU command line and
therefore never became visible inside the guest.

Add handling for DeviceType::Vfio in the start_vm() cold-plug loop.
For each HostDevice in the VfioDevice, emit:

  -device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \
      [x-pci-vendor-id=...,x-pci-device-id=...]

The bus assignment and guest PCI path are already computed by
do_add_pcie_endpoint() at VfioDevice::register() time (called from
VfioDevice::attach() via the PCIe topology), so no additional QMP
resolution is needed here.

Add id= support to PCIeVfioDevice so the QEMU device name is stable
and matchable in QMP queries. Add new_without_iommufd() constructor
for the non-IOMMUFD (legacy VFIO container) path used by physical
endpoints, and add_physical_vfio_device() to QemuCmdLine as a
direct emission helper.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
91df041803 agent: expose guest InfiniBand devices to VFIO containers
When a VF is cold-plugged in guest-kernel mode, mlx5_core binds to the
PCI device inside the VM and mlx5_ib creates IB character devices under
/dev/infiniband/ (uverbs*, rdma_cm, umad*). The container cannot reach
these devices unless they are explicitly added to its OCI spec.

Add expose_guest_infiniband_devices(), called from create_devices() when
the container carries at least one VFIO device entry. The function:

  - Walks /dev/infiniband/ inside the guest VM.
  - Appends each char device to spec.linux.devices.
  - Inserts matching cgroup allow rules (rwm).
  - Is a no-op if /dev/infiniband/ is absent or empty (no IB driver,
    or VF not yet rebound), so non-RDMA pods are unaffected.

Gate the call on container_has_vfio_device() so unrelated containers
sharing the sandbox do not get IB device access widened.

Add is_vfio_device_type() and snapshot_infiniband() to
kata-sys-util/pcilibs. is_vfio_device_type() lets the agent check
device type strings against the VFIO driver name constants without
duplication. snapshot_infiniband() summarises /sys/class/infiniband,
/sys/class/infiniband_verbs, and /dev/infiniband as a single diagnostic
string for log context; it lives in pcilibs because it has no
agent-specific dependencies (pure sysfs/devfs reads).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
9729ed9993 kernel: enable InfiniBand/RoCE support in mlx5 kernel config fragment
Add the kernel configuration options required for RDMA / RoCE operation
with Mellanox ConnectX / BlueField VFs:

  - CONFIG_INFINIBAND: IB subsystem core
  - CONFIG_INFINIBAND_ADDR_TRANS: RoCEv2 GID table management
  - CONFIG_INFINIBAND_USER_ACCESS: userspace verbs (/dev/infiniband/uverbs*)
  - CONFIG_INFINIBAND_USER_MAD: userspace MAD interface
  - CONFIG_MLX5_INFINIBAND: mlx5_ib ConnectX IB/RoCE driver
  - CONFIG_CGROUP_RDMA: RDMA cgroup controller (required by mlx5_ib)

Bump kata_config_version to 196 to trigger a kernel rebuild.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
025202a52a runtime: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

Add appendPhysicalEndpointDevices() which runs after appendDevices()
in createContainer().  It walks the sandbox network endpoints; for
each PhysicalEndpoint with a resolved guest PCI path it derives the
VFIO group char path from sysfs (iommu_group symlink) and synthesises
a vfio-pci-gk Device entry.  Both legacy group paths (/dev/vfio/N)
and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by
reading the iommu_group sysfs symlink.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
fa9a9f3aeb runtime: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC, the guest's mlx5_core inherits the VF's
firmware-default MAC. This MAC differs from the IB port's HCA MAC, so
mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE then appears active
(port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID
— RoCEv2 packets, address handles, librdmacm bind — fails silently.

Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF
using RTM_SETLINK before the bind-to-vfio-pci step. The firmware
applies the admin MAC during the VF reset that accompanies the
unbind/rebind cycle, so the guest sees a single consistent MAC across
netdev, IB port, and HCA.

Best-effort: failures are logged at warn and the existing agent-side
MAC reconciliation (rpc.rs::update_interface) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
992a723392 runtime: resolve cold-plug VFIO guest PCI path via QMP
For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged
VFIO device is auto-allocated at boot (each pcie-root-port is added with
chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot
on pcie.0). The hot-plug path already queries QMP via qomGetPciPath;
reuse that same mechanism for cold-plugged devices.

Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface.
Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all
other hypervisors.

Call it at the start of setupNetworks so that the PCI paths are resolved
before generateVCNetworkStructures emits the agent Interface proto. Also
stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs
exposed as physical network devices) so that update_interface carries a
non-empty devicePath. Without devicePath the agent falls back to a
by-MAC link lookup which fails when the VF firmware MAC differs from the
CNI-assigned MAC after the vfio-pci unbind/rebind cycle.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
23c5250933 runtime/qemu: emit id= for VFIODevice on -device cmdline
Without an explicit id= on the vfio-pci device, QEMU auto-generates
an internal name that does not match vfioDev.ID, so any subsequent
qomGetPciPath(vfioDev.ID) call via QMP fails with "Device 'X' not
found". This breaks resolveColdPlugVFIOGuestPciPaths which needs the
device ID to look up the guest PCI path, leaving GuestPciPath nil and
causing update_interface to fail repeatedly as the agent can't find
the interface to configure.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
e6777f0866 runtime: keep cold-plug VFIO devices in guest-kernel mode
Container.createDevices was dropping cold-plug VFIO entries from the
container's deviceInfos whenever vfio_mode = "guest-kernel", which
in turn meant the agent's CreateContainer request carried no
vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The
SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the
workload container, so update_env_pci then aborted with
"No PCI mapping found for container <id>" and the container failed
with CrashLoopBackOff.

Include cold-plug VFIO devices in deviceInfos for both VFIO modes.
The existing vfio-pci-gk agent handler returns dev: None (so
/dev/vfio/<group> is not materialised in the container spec, and
constrainGRPCSpec(stripVfio=true) already strips it from the grpc
spec for guest-kernel mode), while still recording the host->guest
PCI mapping into sandbox.pcimap[cid] so env-var translation works.

devManager.NewDevice calls FindDevice first, which matches the
already cold-plugged sandbox-level device by HostPath / major / minor,
so this does not double-attach.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
9893b6dc03 runtime: correctly resolve cold-plug VFIO guest PCI paths
Populate missing VFIO guest PCI paths via QMP before serializing
container devices so guest-kernel PCI env translation has the mappings
it needs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
118b7fa611 agent: reconcile VFIO netdev MAC before UpdateInterface lookup
When a VFIO cold-plugged network device appears in guest with a
different MAC than the runtime request, resolve the netdev by PCI path
and apply the requested MAC before the normal by-MAC update flow.

This preserves existing behavior while avoiding UpdateInterface
mismatches in SR-IOV cold-plug cases.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
e89eb77245 agent: keep PCIDEVICE env unchanged when pcimap is missing
Avoid failing container creation when per-container PCI mappings are
unavailable by preserving PCIDEVICE entries unchanged and warning
instead.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 21:54:52 +02:00
Zvonko Kaiser
7f4a85e833 Merge pull request #13047 from microsoft/cameronbaird/upstream/macvtap
runtime: Configure network num_queues properly for CLH
2026-05-28 20:39:15 +02:00
Cameron Baird
2799f7d36b runtime: Enforce >= 1 queue pairs for tapNetworkPair
In the xConnectVMNetwork path, we have queues = 0 as a baseline,
set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise
MultiQueueSupport. This is certainly incorrect as we always want, as
a baseline, at least one queue pair. Make queues := 1 by default
to ensure the NetworkPair has at least one queue pair for all
virtio-net paths.

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2026-05-27 18:55:11 +00:00
Fabiano Fidêncio
a423cf9526 Merge pull request #13087 from bpradipt/landlock
kernel: Enable landlock LSM
2026-05-27 17:34:47 +02:00
Pradipta Banerjee
1487eaaaa2 kernel: Enable landlock LSM
Allows using landlock LSM for the container process

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
2026-05-27 13:33:46 +02:00
Fabiano Fidêncio
5adfb27297 Merge pull request #13118 from PiotrProkop/fix-missing-cwd
agent: restore process CWD auto-creation
2026-05-27 13:32:05 +02:00
Fabiano Fidêncio
614dff4bfc Merge pull request #13119 from manuelh-dev/mahuber/erofs-multi-layer-fix
agent: compact EROFS overlay lowerdirs
2026-05-27 11:27:46 +02:00
Fabiano Fidêncio
238dd51039 Merge pull request #13108 from thebigbone/containerd-config
containerd: use /etc/containerd/conf.d/ drop-in for containerd >= 2.2.0
2026-05-27 10:14:51 +02:00
PiotrProkop
60a2e27f02 agent: Restore process CWD auto-creation
Commit b56313472 ("agent: Align agent OCI spec with oci-spec-rs",
PR #9944) inverted the condition guarding the create_dir_all call
for process.cwd: the leading `!` was dropped during the refactor.
As a result, the CWD is created only when process.cwd is the empty
string.

When the guest then runs chdir(process.cwd) and CWD doesn't exist
it returns ENOENT.  The agent propagates that to the shim, which
surfaces it to containerd as "failed to create shim task: ENOENT:
No such file or directory" — indistinguishable from a missing
argv[0].
This regressed the original fix in PR #2375 (Fixes #2374), which
deliberately mirrored runc's behavior.  Put the `!` back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-05-27 09:59:15 +02:00
Fabiano Fidêncio
f1c100797b Merge pull request #12955 from zvonkok/nvgpu-target
build: add nvgpu-tarball target
2026-05-27 09:44:37 +02:00
Fabiano Fidêncio
64056add0d build: add passthrough mode to kata-deploy-merge-builds
kata-deploy now unpacks individual component tarballs itself, so the
final `kata-static.tar.zst` no longer needs to be a merged filesystem
payload. Merging everything has two downsides for that flow:

  - It pulls in everything kept on disk under build/, which previously
    forced us to also drop agent/busybox/coco-guest-components/nydus
    from the build set to keep them out of the final tarball.
  - The merged tarball duplicates content kata-deploy will repack on
    its own anyway.

Add a `passthrough` mode to kata-deploy-merge-builds.sh that, instead
of untarring each `kata-static-*.tar.zst` into a single filesystem
tree, copies the selected component tarballs into the final tarball
as-is. The existing `merge` mode remains the default to preserve the
non-kata-deploy install paths (e.g. `make install-tarball`).

Wire `nvgpu-tarball` to the new mode via `FINAL_TARBALL_MERGE_MODE=
passthrough`, paired with the existing `FINAL_TARBALL_INPUTS`
allowlist. This lets us keep agent/busybox/coco as build prereqs of
the GPU rootfs while shipping a final tarball that only contains the
NVIDIA-relevant components.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
9b85bff2b4 build: don't double-prefix absolute versions.yaml path in merge-builds
The Makefile passes $(MK_DIR)/../../../../versions.yaml — already an
absolute path — to kata-deploy-merge-builds.sh. The script then
unconditionally prepended ${PWD}/, producing a malformed path like:

  /repo//repo/tools/.../local-build//../../../../versions.yaml

which made cp fail with "No such file or directory" at the merge-builds
step (the very last step of `make nvgpu-tarball`).

Only prepend ${PWD}/ when the input is relative — that preserves the
original fix for the pushd-changes-cwd issue (commit ae6e8d2b3) without
mangling absolute paths from Makefile callers.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
5aa6229eba build: group parallel build output by target
With `make all -j N` running multiple tarballs concurrently and silent
mode redirecting each build's stdio to its per-target log, a failing
target's "Failed to build: <name>, logs:" banner gets interleaved with
other in-flight jobs' output, making it hard to tell which target
failed.

Pass `--output-sync=target` to the recursive make so each sub-make's
output is buffered and emitted as one block when the target finishes,
keeping the failure banner contiguous with its log dump.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
3be370d2d6 qemu: clean stale clone before fetching sources
build-qemu.sh runs in the per-target builddir (e.g.
build/qemu-tarball/builddir/), which persists across runs. If a previous
build left the cloned `qemu` tree behind (e.g. after an interrupted
build), the next run errors out with:

  fatal: destination path 'qemu' already exists and is not an empty
  directory.

Wipe `qemu` before cloning so the build is repeatable from a dirty
builddir.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
18cee00df9 build: guard parallel races on build symlink and ~/.docker
Parallel make jobs invoke kata-deploy-binaries-in-docker.sh concurrently
and collide on two shared paths:

  ln: Already exists
  mkdir: /home/$USER/.docker: File exists

Skip the symlink creation when the link is already in place. If a
parallel job wins the create race in the cold-start window, fall back to
re-checking that the link exists so a real ln failure (permission, disk
full, etc.) still propagates rather than being silently swallowed.

The `~/.docker` mkdir is guarded by a `[[ ! -d ]]` check that two
processes can pass simultaneously, after which one bare `mkdir` fails.
Switch to `mkdir -p` so the second invocation is a no-op.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
815ebc340d build: add nvgpu-tarball target
serial-targets now waits for the other BASE_TARBALLS items so the
inner rootfs assembly runs with DEPS= against already-built
artifacts. This also fixes a pre-existing race in the main flows
where the outer parallel and inner -j 1 makes could both build
kernel-tarball at the same time.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
6a367ab777 build: declare install-prebuilt-artifacts as .PHONY
Leftover from #12954's rebase: the substantive sed-hack -> DEPS= change
landed on main, but the .PHONY declaration didn't make it. Add it so
the recipe always runs even if a stale `kata-artifacts` file exists in
CWD.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
thebigbone
d9f2aa895e containerd: use /etc/containerd/conf.d/ drop-in for containerd >= 2.2.0
containerd 2.2.0+ always imports /etc/containerd/conf.d/*.toml,
so write kata-deploy runtime config there directly, avoiding
modification of the main containerd config's imports array.

Signed-off-by: thebigbone <pacman@duck.com>
2026-05-26 21:29:46 +02:00
Manuel Huber
e838cd7d8d agent: compact EROFS overlay lowerdirs
Use kata_types::mount::Mount for the final multi-layer EROFS
overlay mount instead of calling baremount() directly.

The mount helper detects overlay option strings close to the kernel
mount data limit. When lowerdir entries share a common parent, it
changes into that directory and rewrites lowerdir to relative paths.
That avoids repeating the same long prefix for every layer.

Multi-layer EROFS images can have many lower layers under
/run/kata-containers/<cid>/multi-layer. Passing the raw absolute
lowerdir list can exceed the mount option buffer and fail the final
overlay mount, even after all layer devices mounted successfully.

Reuse the helper so this path follows Kata's normal overlay mount
handling, including lowerdir compaction before mount(2).

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-26 18:42:11 +00:00
Fabiano Fidêncio
d75a91ee09 Merge pull request #13114 from manuelh-dev/mahuber/nv-fix-policy-check
tests: nvidia: No policy for runtime-rs path
2026-05-26 20:00:02 +02:00
Dan Mihai
c81dadaba1 Merge pull request #13064 from burgerdev/add-arp-neighbour
agent: use rtnetlink to add ARP neighbour
2026-05-26 09:59:44 -07:00
Manuel Huber
6a715cf4f7 tests: nvidia: No policy for runtime-rs path
The current if condition causes agent security policies to be
attached to the non-TEE NVIDIA runtime-rs runtime class. While
this is good to see that it works, this is not intended. Thus,
replacting the condition with is_confidential_gpu_hypervisor.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-25 16:00:49 -07:00
Fabiano Fidêncio
25491fc20c Merge pull request #13104 from kata-containers/topic/kata-deploy-build-as-an-artefact
kata-deploy: prebuild payload-specific component artifacts
2026-05-25 22:56:55 +02:00
Fabiano Fidêncio
c65d64873b kata-deploy: prebuild payload-specific component artifacts
Build and publish the kata-deploy binary and CoCo guest-pull nydus
snapshotter as dedicated per-arch artifacts, then consume those tarballs
when assembling the kata-deploy image.

This avoids rebuilding those components in the payload image (which
would happen in serial) path and reduces overall CI build time.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-25 22:13:41 +02:00
Fabiano Fidêncio
3dc02a8604 Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only
runtime-rs: Support erofs snapshotter with gpt vmdk mode
2026-05-25 16:29:59 +02:00
Zvonko Kaiser
6c6c5809f1 Merge pull request #13109 from fidencio/topic/build-validate-measured-rootfs-root-hashes-for-all-shims
build: Validate measured-rootfs root hashes all shims
2026-05-25 15:58:35 +02:00
Zvonko Kaiser
aeadb1af35 Merge pull request #12948 from fidencio/topic/numa
runtime (go): agent: Add NUMA support for QEMU
2026-05-25 15:33:14 +02:00
Alex Lyn
53699b0170 docs: Reset max_unmerged_layers = 0 for gpt+vmdk mode
As max_unmerged_layers = 1 is just for fsmerge mode, as containerd
temperally unsupport fsmerge, we just reset it with default 0.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:13:28 +08:00
Alex Lyn
a359d13476 build: Validate measured-rootfs root hashes all shims
The cached shim-v2 tarballs ship per-variant `root_hash_*.txt` files
embedded in the matching measured-rootfs image. Until now only
shim-v2-rust validated those hashes against the freshly built rootfs
images on a cache hit; shim-v2-go reused whatever was cached without
checking, even though its bundled configuration files contain the
`KERNELVERITYPARAMS_*` values baked in at build time.

When a PR changes the agent (and therefore the rootfs image and its
dm-verity hash) but does not touch `src/runtime`, the shim-v2-go cache
key stays the same and the stale tarball is reused. The resulting
guest cmdline carries a verity hash that no longer matches the new
rootfs image, so the VM panics very early in boot:

    device-mapper: verity: 254:1: metadata block 0 is corrupted
    erofs (device dm-0): cannot read erofs superblock
    Kernel panic - not syncing: VFS: Unable to mount root fs ...

Generalize the shim-v2-rust cache validation so it also runs for
shim-v2-go, push the per-variant root-hash sidecar files for both
shims, and fall back to a full rebuild whenever the cached hash is
missing or differs from the image one.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:12:52 +08:00
Alex Lyn
fd139a1143 kata-deploy: Reset max_unmerged_layers to "0" within erofs snapshotter
we should set max_unmerged_layers = 0 for erofs snapshotter gpt-vmdk
mode.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
2036e66bc3 kata-agent: Integrate GPT partition support into multi-layer handler
In GPT mode, all partitions share the same base block device, so
resolving it once per uevent source and caching the result avoids
redundant hotplug waits that would otherwise scale linearly with
layer count.

Layers are sorted by partition number before mounting to guarantee
correct overlay lowerdir precedence regardless of the order the host
emits Storage entries.

And it will remove dead_code attributes to mark the codes working.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
17fadde6d8 kata-agent: Add GPT partition utility functions
The guest agent needs to resolve individual partition devices from a
single GPT-partitioned block device, but the kernel does not always
create partition nodes immediately after the base device appears,
especially when another fd holds the device open during hot-plug.

Add utility functions that handle two problems:
(1) Mapping a base device path to its partition path following the
kernel naming convention (bare suffix vs 'p' separator).
(2) And ensuring the partition node exists before mount.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
8119a561ae kata-agent: Refactor wait_and_mount_layer to return LayerMountInfo
This commit has No functional change — all callers pass None, so
every call still resolves the device via uevent exactly as before.

It just prepare the multi-layer EROFS handler for GPT partition and
dm-verity support by widening the wait_and_mount_layer() interface
without changing behavior.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00