Commit Graph

19217 Commits

Author SHA1 Message Date
Fabiano Fidêncio
a2bb3f64b0 Merge pull request #12436 from mythi/tdx-updates-2026-3
runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default
2026-06-03 08:50:26 +02:00
Fabiano Fidêncio
ecd9344dd1 Merge pull request #13144 from stevenhorsman/bump-rust-to-1.94
Bump rust to 1.94
2026-06-02 09:58:56 +02:00
Fabiano Fidêncio
230e01b04e Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs
runtime/runtime-rs: introduce Azure specific configs
2026-06-02 09:17:09 +02:00
stevenhorsman
b1928cc22f runtime-rs: run cargo fmt for Rust 1.94
Run cargo fmt on runtime-rs to ensure consistent formatting
with Rust 1.94 toolchain.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-01 17:32:06 +01:00
Fabiano Fidêncio
57de50f43c Merge pull request #13141 from fidencio/topic/kata-deploy-fix-stale-containerd-import
kata-deploy: scrub stale containerd import on conf.d migration
2026-06-01 18:13:08 +02:00
Steve Horsman
a3cc016e2f Merge pull request #13140 from fidencio/topic/fix-besteffort-sandbox-cpu-sizing
runtime: oci: Only derive sandbox CPUs from shares when quota is unconstrained
2026-06-01 17:09:12 +01:00
stevenhorsman
f9c95a279e dragonball: Remove unnecessary unsafe blocks in cpuid
Rust 1.94 now warns about unnecessary unsafe blocks around
__get_cpuid_max(), __cpuid_count(), and host_cpuid() calls.
Remove the unsafe blocks as they are no longer needed.

This fixes the following clippy warnings in dbs-arch:
- warning: unnecessary `unsafe` block at brand_string.rs:106
- warning: unnecessary `unsafe` block at brand_string.rs:114
- warning: unnecessary `unsafe` block at common.rs:28
- warning: unnecessary `unsafe` block at common.rs:36

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-01 17:07:16 +01:00
stevenhorsman
a63a948b4a libs: Remove unnecessary unsafe blocks in protection.rs
Rust 1.94 now warns about unnecessary unsafe blocks around
x86_64::__cpuid() calls. Remove the unsafe blocks as they are
no longer needed.

This fixes the following clippy warnings:
- warning: unnecessary `unsafe` block at line 129
- warning: unnecessary `unsafe` block at line 142

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-01 17:04:43 +01:00
stevenhorsman
9625bf8056 versions: Update MSRV to 1.94
With the bump to 1.94, we are now relying on some 1.94+
apis, so update the MSRV to reflect this

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-01 17:02:20 +01:00
stevenhorsman
4987d79e26 versions: Bump rust to 1.94
Now that 1.96 has been released, in compliance with our toolchain guidance
we should bump to rust 1.94

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-01 16:39:06 +01:00
Greg Kurz
8a49ecb159 Merge pull request #13097 from BbolroC/fix-shim-components-for-s390x
ci: Refactor boot-image-se build and update shim components
2026-06-01 11:43:42 +02:00
Fabiano Fidêncio
f788997253 kata-deploy: scrub stale containerd import on conf.d migration
Since the conf.d migration (containerd >= 2.2.0), kata-deploy writes its
drop-in to the auto-imported /etc/containerd/conf.d/ and no longer manages
the main config's `imports` array. A node upgraded from a pre-conf.d
kata-deploy keeps the legacy `{dest_dir}/containerd/config.d/kata-deploy.toml`
entry in `imports`, since the new code neither adds nor removes it.

On uninstall, remove_artifacts() deletes the artifacts dir (including the
file that import still points at) and then restarts containerd, which fails
to load the now-dangling import and wedges the node: pods get stuck
Terminating and new pods cannot start. This broke the lifecycle-manager E2E
tests (TC-02..TC-07) which repeatedly upgrade then reinstall across the
3.30.0 -> latest version boundary.

Defensively scrub the legacy import from the main containerd config in both
configure_containerd (at conf.d migration time) and cleanup_containerd
(before artifacts are removed and containerd is restarted). The helper is a
no-op when the config is absent, has no `imports` array, or does not contain
the legacy entry.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-01 11:07:13 +02:00
Fabiano Fidêncio
9b5b829265 runtime: oci: derive sandbox CPUs from shares only if unconstrained
The shares-based fallback added for cpuManagerPolicy=static fired whenever
the quota-based CPU count was 0, including for BestEffort sandboxes that
have no CPU request. Those sandboxes still carry the cgroup-floor shares
value (2), so the fallback derived ceil(2/1024)=1 and inflated every such
sandbox by one vCPU. For peer-pods (static resource management) this
changed the VM sizing to default_vcpus+1, regressing the libvirt
instance-type CI checks.

Gate the fallback on the quota being explicitly unconstrained (< 0), which
is the actual cpuManagerPolicy=static signal, instead of on numCPU == 0.
BestEffort sandboxes (quota 0/absent) now correctly contribute 0 vCPUs
while the static-policy case still recovers the CPU count from shares.

Add unit tests covering the static-policy, rounding, BestEffort, and
explicit-quota cases.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-01 09:50:49 +02:00
Fabiano Fidêncio
02fd572195 Merge pull request #13134 from jojimt/rc-version
kata-deploy: Add a version annotation to runtimeclass
2026-06-01 08:21:30 +02:00
manuelh-dev
953b306ff3 Merge pull request #12979 from manuelh-dev/mahuber/erofs-tmpfs-mount
runtime-rs/agent: support EROFS snapshots without a rwlayer
2026-05-29 13:50:27 -07:00
Aurélien Bombo
9acef4bc55 Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple
Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
2026-05-29 14:57:07 -05:00
Fabiano Fidêncio
f349d19bf4 Merge pull request #12956 from zvonkok/nvgpu-tarball-chart
build: add kata-deploy-publish target
2026-05-29 21:22:44 +02:00
Fabiano Fidêncio
4e7b49fede Merge pull request #13103 from fidencio/topic/mlx-coldplug-support-v2
runtime / agent / kernel: fix cold-plug VFIO guest-kernel mode for SR-IOV RoCE/InfiniBand
2026-05-29 21:21:46 +02:00
Joji Mekkattuparamban
8549d71c6f kata-deploy: Add a version annotation to runtimeclass
Enables automations to determine version with a simple read RBAC
on the runtime class. Helpful when versions need to match with other
tools (e.g. genpolicy) or when simple version determination is needed
for other reasons.

Fixes #13123

Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
2026-05-29 10:50:19 -07:00
Cameron Baird
7a9d207ab2 Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
This reverts commit 2799f7d36b.
2026-05-29 17:05:40 +00:00
Zvonko Kaiser
7f906ec95d build: add kata-deploy-publish target
Mirror the CI payload publish flow in local builds, including image and
helm chart publishing, while reusing the same chart upload helper in
payload-after-push to avoid duplicated chart packaging logic.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-29 16:22:12 +02:00
Zvonko Kaiser
fb73ccc352 build: include kata-deploy static artifacts in nvgpu bundle
Build and package kata-deploy binary and nydus snapshotter component
tarballs as part of nvgpu-tarball so local publish can consume a single
kata-static.tar.zst without rebuilding extra artifacts.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-29 16:22:12 +02:00
Fabiano Fidêncio
10e70a2a9f runtime-rs: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network
endpoints via host_bdf()) was already present inside handler_devices()
but could never be consumed because the LinuxDeviceType::C loop has
no entries to iterate over when linux.devices is empty.

After that loop, iterate over any unmatched cold-plug BDFs, derive the
VFIO group path via bdf_to_vfio_group_path() (reads
/sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk
ContainerDevice.  The vfio_group_to_bdf() short-circuit inside the
loop handles the case where the device plugin does add VFIO char
devices to linux.devices; it now supports both legacy (/dev/vfio/N)
and iommufd (/dev/vfio/devices/vfioN) path formats.

Add host_bdf() to the Endpoint trait (default: None) so that
PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
60f2878c68 runtime-rs: call network.remove() during resource cleanup
network.remove() — which detaches endpoints and rebinds VFs from
vfio-pci back to the host driver — was never being called.
ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs,
swap and ephemeral disks, but completely omitted the network teardown.

Call network.remove() at the start of cleanup(), using the already-held
self.hypervisor reference.  Errors are logged as warnings rather than
propagated, so they don't block the rest of the cleanup sequence.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
0b4b51dff6 runtime-rs: always detach endpoints on network removal
network_with_netns::remove() bailed out early when network_created=false
(i.e. the netns was created by the CNI, not by kata). This caused
physical endpoint VFs to remain bound to vfio-pci after pod deletion,
because PhysicalEndpoint::detach() — which calls bind_device_to_host()
to rebind the VF from vfio-pci back to mlx5_core — was never reached.

Separate endpoint detachment from netns deletion: always detach
endpoints, but only remove the netns if kata created it.  Detach errors
are logged as warnings rather than propagated, to mirror the Go runtime's
best-effort approach and avoid blocking sandbox teardown.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
be2ec02c9a runtime-rs: resolve cold-plug VFIO guest PCI path via QMP
The PCIe topology pre-computes a wrong path for cold-plugged physical-
endpoint VFs because the root port has no explicit addr and QEMU auto-
assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] }
resolves to 0000:00:00.0 (the Q35 MCH), causing
wait_for_pci_net_interface to time out looking for a netdev there.

Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait.
Implement it in QemuInner using qmp.get_device_by_qdev_id(), which
queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00"
= slot 5 on pcie.0 / slot 0 on the root port bus).

Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during
attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the
Endpoint trait and add an endpoints() accessor to the Network trait.

In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths()
before apply_network_to_agent() to populate the correct path from QMP
into each PhysicalEndpoint's guest_pci_path field. The field is then
consumed by network_with_netns::interfaces() to fill Interface.device_path
before update_interface is sent to the agent.

This is the runtime-rs counterpart of the Go runtime's
ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
f8ee9133e5 runtime-rs: populate device_path for cold-plug VFIO physical endpoints
Without device_path the agent receives Interface.device_path="" in
update_interface, falls back to a by-MAC link lookup, and fails for
SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after
the vfio-pci unbind/rebind cycle.

The guest PCI path is computed at attach() time by do_add_pcie_endpoint()
inside VfioDevice::register() — no QMP query is needed. Cache it in
PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach()
when do_handle_device() returns the DeviceType::Vfio with the path
already filled in.

Add a default-None guest_pci_path() method to the Endpoint trait;
PhysicalEndpoint overrides it to return the cached path. In
network_with_netns.rs::interfaces(), after building each Interface from
network_info, fill device_path from endpoint.guest_pci_path() when the
field would otherwise be empty.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
67843220f8 runtime-rs: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC the guest mlx5_core inherits whatever firmware-
default MAC the VF was created with. This MAC differs from the IB port
HCA MAC, so mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE appears active but
every verb needing a GID fails.

Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF
as an "admin MAC" via the parent PF using RTM_SETLINK with
IFLA_VFINFO_LIST — the netlink equivalent of
  ip link set <PF> vf <N> mac <MAC>

The operation runs in a spawn_blocking closure that enters the host
network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is
called while the thread is inside the pod netns.

Best-effort: failures are logged at warn and the existing agent-side MAC
reconciliation (update_interface in rpc.rs) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
9e9b50c79e runtime-rs: cold-plug Vfio physical endpoints at VM launch
DeviceType::Vfio (used by physical network VFs) was silently dropped
in start_vm()'s cold-plug loop, falling through to the unsupported-
device info log. The VF never appeared on the QEMU command line and
therefore never became visible inside the guest.

Add handling for DeviceType::Vfio in the start_vm() cold-plug loop.
For each HostDevice in the VfioDevice, emit:

  -device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \
      [x-pci-vendor-id=...,x-pci-device-id=...]

The bus assignment and guest PCI path are already computed by
do_add_pcie_endpoint() at VfioDevice::register() time (called from
VfioDevice::attach() via the PCIe topology), so no additional QMP
resolution is needed here.

Add id= support to PCIeVfioDevice so the QEMU device name is stable
and matchable in QMP queries. Add new_without_iommufd() constructor
for the non-IOMMUFD (legacy VFIO container) path used by physical
endpoints, and add_physical_vfio_device() to QemuCmdLine as a
direct emission helper.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
91df041803 agent: expose guest InfiniBand devices to VFIO containers
When a VF is cold-plugged in guest-kernel mode, mlx5_core binds to the
PCI device inside the VM and mlx5_ib creates IB character devices under
/dev/infiniband/ (uverbs*, rdma_cm, umad*). The container cannot reach
these devices unless they are explicitly added to its OCI spec.

Add expose_guest_infiniband_devices(), called from create_devices() when
the container carries at least one VFIO device entry. The function:

  - Walks /dev/infiniband/ inside the guest VM.
  - Appends each char device to spec.linux.devices.
  - Inserts matching cgroup allow rules (rwm).
  - Is a no-op if /dev/infiniband/ is absent or empty (no IB driver,
    or VF not yet rebound), so non-RDMA pods are unaffected.

Gate the call on container_has_vfio_device() so unrelated containers
sharing the sandbox do not get IB device access widened.

Add is_vfio_device_type() and snapshot_infiniband() to
kata-sys-util/pcilibs. is_vfio_device_type() lets the agent check
device type strings against the VFIO driver name constants without
duplication. snapshot_infiniband() summarises /sys/class/infiniband,
/sys/class/infiniband_verbs, and /dev/infiniband as a single diagnostic
string for log context; it lives in pcilibs because it has no
agent-specific dependencies (pure sysfs/devfs reads).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
9729ed9993 kernel: enable InfiniBand/RoCE support in mlx5 kernel config fragment
Add the kernel configuration options required for RDMA / RoCE operation
with Mellanox ConnectX / BlueField VFs:

  - CONFIG_INFINIBAND: IB subsystem core
  - CONFIG_INFINIBAND_ADDR_TRANS: RoCEv2 GID table management
  - CONFIG_INFINIBAND_USER_ACCESS: userspace verbs (/dev/infiniband/uverbs*)
  - CONFIG_INFINIBAND_USER_MAD: userspace MAD interface
  - CONFIG_MLX5_INFINIBAND: mlx5_ib ConnectX IB/RoCE driver
  - CONFIG_CGROUP_RDMA: RDMA cgroup controller (required by mlx5_ib)

Bump kata_config_version to 196 to trigger a kernel rebuild.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
025202a52a runtime: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

Add appendPhysicalEndpointDevices() which runs after appendDevices()
in createContainer().  It walks the sandbox network endpoints; for
each PhysicalEndpoint with a resolved guest PCI path it derives the
VFIO group char path from sysfs (iommu_group symlink) and synthesises
a vfio-pci-gk Device entry.  Both legacy group paths (/dev/vfio/N)
and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by
reading the iommu_group sysfs symlink.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Hyounggyu Choi
3175bf683e GHA: Remove secret CI_HKD_PATH from workflows
As the boot-image-se builds a fake image, the secret
CI_HKD_PATH is not necessary anymore.
Remove it from the workflows.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-05-29 11:35:40 +02:00
Hyounggyu Choi
640fa488a5 ci: Refactor boot-image-se build and update shim components
- Add FAKE_SE_IMAGE mode support in SE image build scripts for CI without real SE setup
- Simplify workflow by removing build-asset-boot-image-se job
- Integrate fake-boot-image-se into build matrix instead of separate job
- Skip attestation for fake-boot-image-se builds
- Update qemu-se and qemu-se-runtime-rs shim components to use:
  - rootfs-initrd-confidential instead of rootfs-image-confidential
  - boot-image-se component

This change streamlines the s390x SE build process and makes it easier
to test without requiring actual Secure Execution infrastructure.
This fixes deployment issues on non-TEE systems where TEE-specific artifacts
(like boot-image-se for IBM SEL) are not included in the kata-deploy image,
while ensuring TEE systems still get all required components.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-05-29 11:35:40 +02:00
Manuel Huber
7d9a143747 ci: cover EROFS snapshotter default_size=0 path
kata-deploy currently hard-codes the EROFS snapshotter
default_size to "10G", so the CoCo EROFS CI lane only
exercises the path where the snapshotter provides an rwlayer.

Use the generic containerd.userDropIn support for the EROFS
default_size and thread it through the Kubernetes CI helpers.
Keep the kata-deploy default at "10G" to preserve current
behavior, but allow the workflow to set "0" for the runtime-rs
no-rwlayer path.

Expand the existing EROFS snapshotter job to run both values.
The override is written to containerd as a TOML string so "0"
is not parsed as an integer.

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-28 22:54:56 +00:00
Fabiano Fidêncio
744ab0b548 ci: improve kata-deploy pod wait and timeout diagnostics
Make kata-deploy deployment waits more robust by deriving the pod
selector from the rendered helm values and using it consistently for
readiness checks and logs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
bddf1ecab4 build: stop producing cloud-hypervisor-glibc artifacts
Drop cloud-hypervisor-glibc from local and CI kata-deploy build targets
now that Azure CLH uses the standard cloud-hypervisor artifact set.

This removes obsolete build matrix entries and installer target
handling.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
81ce51a9aa ci: target Azure CLH runtimes directly in AKS tests
Switch AKS Mariner matrix entries to clh-azure handlers and remove the
temporary host-OS based helm value overrides.

Update integration test wiring and required test labels so CI tracks the
new runtime names.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
8c3a2c1a95 kata-deploy: register clh-azure shim families
Add clh-azure and clh-azure-runtime-rs as first-class shims across
installer logic, helm defaults, runtimeclass overhead mapping, and shim
component catalogs.

This aligns deploy payload selection with the new native Azure-specific
CLH configs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
f36c383b4f runtime: generate dedicated CLH Azure config variants
Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH
configs during build.

This keeps Mariner-specific defaults in explicit config artifacts
instead of ad-hoc runtime mutation.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
fa9a9f3aeb runtime: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC, the guest's mlx5_core inherits the VF's
firmware-default MAC. This MAC differs from the IB port's HCA MAC, so
mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE then appears active
(port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID
— RoCEv2 packets, address handles, librdmacm bind — fails silently.

Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF
using RTM_SETLINK before the bind-to-vfio-pci step. The firmware
applies the admin MAC during the VF reset that accompanies the
unbind/rebind cycle, so the guest sees a single consistent MAC across
netdev, IB port, and HCA.

Best-effort: failures are logged at warn and the existing agent-side
MAC reconciliation (rpc.rs::update_interface) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
992a723392 runtime: resolve cold-plug VFIO guest PCI path via QMP
For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged
VFIO device is auto-allocated at boot (each pcie-root-port is added with
chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot
on pcie.0). The hot-plug path already queries QMP via qomGetPciPath;
reuse that same mechanism for cold-plugged devices.

Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface.
Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all
other hypervisors.

Call it at the start of setupNetworks so that the PCI paths are resolved
before generateVCNetworkStructures emits the agent Interface proto. Also
stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs
exposed as physical network devices) so that update_interface carries a
non-empty devicePath. Without devicePath the agent falls back to a
by-MAC link lookup which fails when the VF firmware MAC differs from the
CNI-assigned MAC after the vfio-pci unbind/rebind cycle.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
23c5250933 runtime/qemu: emit id= for VFIODevice on -device cmdline
Without an explicit id= on the vfio-pci device, QEMU auto-generates
an internal name that does not match vfioDev.ID, so any subsequent
qomGetPciPath(vfioDev.ID) call via QMP fails with "Device 'X' not
found". This breaks resolveColdPlugVFIOGuestPciPaths which needs the
device ID to look up the guest PCI path, leaving GuestPciPath nil and
causing update_interface to fail repeatedly as the agent can't find
the interface to configure.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
e6777f0866 runtime: keep cold-plug VFIO devices in guest-kernel mode
Container.createDevices was dropping cold-plug VFIO entries from the
container's deviceInfos whenever vfio_mode = "guest-kernel", which
in turn meant the agent's CreateContainer request carried no
vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The
SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the
workload container, so update_env_pci then aborted with
"No PCI mapping found for container <id>" and the container failed
with CrashLoopBackOff.

Include cold-plug VFIO devices in deviceInfos for both VFIO modes.
The existing vfio-pci-gk agent handler returns dev: None (so
/dev/vfio/<group> is not materialised in the container spec, and
constrainGRPCSpec(stripVfio=true) already strips it from the grpc
spec for guest-kernel mode), while still recording the host->guest
PCI mapping into sandbox.pcimap[cid] so env-var translation works.

devManager.NewDevice calls FindDevice first, which matches the
already cold-plugged sandbox-level device by HostPath / major / minor,
so this does not double-attach.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
9893b6dc03 runtime: correctly resolve cold-plug VFIO guest PCI paths
Populate missing VFIO guest PCI paths via QMP before serializing
container devices so guest-kernel PCI env translation has the mappings
it needs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
118b7fa611 agent: reconcile VFIO netdev MAC before UpdateInterface lookup
When a VFIO cold-plugged network device appears in guest with a
different MAC than the runtime request, resolve the netdev by PCI path
and apply the requested MAC before the normal by-MAC update flow.

This preserves existing behavior while avoiding UpdateInterface
mismatches in SR-IOV cold-plug cases.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
e89eb77245 agent: keep PCIDEVICE env unchanged when pcimap is missing
Avoid failing container creation when per-container PCI mappings are
unavailable by preserving PCIDEVICE entries unchanged and warning
instead.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 21:54:52 +02:00
Zvonko Kaiser
7f4a85e833 Merge pull request #13047 from microsoft/cameronbaird/upstream/macvtap
runtime: Configure network num_queues properly for CLH
2026-05-28 20:39:15 +02:00
Cameron Baird
2799f7d36b runtime: Enforce >= 1 queue pairs for tapNetworkPair
In the xConnectVMNetwork path, we have queues = 0 as a baseline,
set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise
MultiQueueSupport. This is certainly incorrect as we always want, as
a baseline, at least one queue pair. Make queues := 1 by default
to ensure the NetworkPair has at least one queue pair for all
virtio-net paths.

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2026-05-27 18:55:11 +00:00
Fabiano Fidêncio
76212b9e0c kata-deploy: allow containerd user drop-in overrides
Add an optional user-provided containerd drop-in that is loaded after
kata-deploy's generated drop-in so operators can override snapshotter
and other runtime settings without patching kata-deploy.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-27 17:26:55 +00:00