Commit Graph

6537 Commits

Author SHA1 Message Date
Alex Lyn
fa84eecd2d runtime-rs: Implement ShareVirtioFsNydus for standalone mode
Introduce `ShareVirtioFsNydus` to enable standalone Nydus rootfs
support. This implementation acts as the bridge between runtime-rs
and the external `nydusd` daemon.

Key Capabilities:
(1) Trait Implementation: Implements `ShareFs` (for VM device/storage) and
  `NydusShareFs` (for RAFS lifecycle) traits.
(2) Daemon Lifecycle Management: Handles `nydusd` spawning, supervision,
  and graceful shutdown.
(3) Native Overlay Support: Configures `nydusd` with `passthrough_fs`
  backend to provide native overlay (upperdir/workdir) support.
(4) API Integration: Utilizes `NydusClient` for granular control over RAFS
  mount/umount operations.
(5) QEMU Integration: Enables `virtio-fs-nydus` device support,
  facilitating standalone mode execution.

This implementation allows Kata containers to utilize an external `nydusd`
process for Nydus rootfs management, providing a cleaner separation between
the runtime and the Nydus daemon lifecycle.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-11 21:42:48 +02:00
Alex Lyn
edfe9ea403 runtime-rs: refine ShareFs abstraction with lifecycle and Nydus traits
Refactor the `ShareFs` trait to improve modularity and support
standalone Nydus mode:

(1) Added `stop()` method to manage daemon teardown.
(2) Introduced a dedicated trait for Nydus-specific data-plane
operations.

This refactoring cleans up the `ShareFs` trait by consolidating
daemon lifecycle handling and isolating Nydus-specific extensions,
paving the way for cleaner standalone Nydus implementation.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-11 21:42:48 +02:00
Alex Lyn
720a8688b4 runtime-rs: Add daemon manager for nydusd process lifecycle
Implement Nydusd to manage nydusd daemon process:
(1) start: spawn process, validate paths, wait for API ready,
    setup passthrough fs.
(2) stop: kill process, cleanup socket files.
(3) mount_rafs/mount_rafs_with_overlay: high-level filesystem
    mount operations.
(4) build_args: construct virtiofs mode command line arguments.

This provides process lifecycle management with internal NydusClient

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-11 21:42:48 +02:00
Alex Lyn
c1ebf269f7 runtime-rs: Add nydus client for nydusd API communication via HTTP
Implement NydusClient to interact with nydusd daemon via Unix
socket:
(1) check_status: query daemon state via GET /api/v1/daemon.
(2) mount/umount: manage filesystem mounts via POST/DELETE
  /api/v1/mount.
(3) wait_until_ready: poll daemon until RUNNING state.

This provides a lightweight, stateless HTTP client layer for nydusd
API.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-11 21:42:48 +02:00
Alex Lyn
4c63b8e3de agent: handle ENOSYS in overlayfs storage handler
In standalone nydusd mode with virtio-fs passthrough, the guest-side
mkdir may fail with ENOSYS. Update the overlayfs storage handler to
skip directory creation when the directory already exists, logging a
warning instead of failing.

This ensures container rootfs setup succeeds when nydusd's native
overlay manages the directory structure.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-11 21:25:18 +02:00
Alex Lyn
8eb564dfb8 kata-sys-util: handle ENOSYS gracefully in mount destination creation
When using virtio-fs with nydusd's passthrough_fs, mkdir operations may
return ENOSYS on certain filesystem configurations. This causes mount
destination creation to fail unexpectedly.

Handle ENOSYS errors gracefully alongside AlreadyExists by verifying the
directory exists after the failed mkdir attempt, allowing the mount to
proceed if the directory is already present.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-11 21:25:18 +02:00
Alex Lyn
b50f803a4e kata-types: add virtio-fs-nydus shared fs configuration support
Add "virtio-fs-nydus" as a recognized shared filesystem type in the
hypervisor configuration. This enables the standalone nydusd mode where
nydusd runs as a separate process alongside virtiofsd.

The key changes:
(1) Add VIRTIO_FS_NYDUS constant for the new shared fs type.
(2) Register virtio-fs-nydus in adjust() and validate() paths, reusing
  the same virtio-fs validation logic since both use vhost-user protocol

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-06-11 21:25:18 +02:00
stevenhorsman
fb4600d66a runtime-rs: Fix test breakage
In #13147, for some reason a test block was added in the middle of code
and the code was stale when merged, which meant that a second
`mod test` section was added, breaking our tests. Merge the two
to fix this.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-11 19:03:33 +02:00
Fabiano Fidêncio
21657b9cd9 Merge pull request #13147 from manuelh-dev/mahuber/debug-go-rust
runtime-rs: Honor enable_debug for logs and adjust debugging documentation
2026-06-11 08:57:36 +02:00
Hyounggyu Choi
7cc6767fa2 runtime*: use static_sandbox_resource_mgmt defaults for qemu-se
Switch qemu-se config templates to use the TEE/CoCo-specific
static_sandbox_resource_mgmt defaults instead of the generic
QEMU defaults.

qemu-se-runtime-rs config now uses DEFSTATICRESOURCEMGMT_COCO
while runtime qemu-se config now uses DEFSTATICRESOURCEMGMT_TEE.
This aligns static sandbox resource management behavior with confidential
container expectations for qemu-se variants.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-06-09 14:45:50 +02:00
Alex Lyn
6500e018c0 Merge pull request #13093 from RainaYL/rainax/tdx_boot_pr
dragonball: Add steps to boot TDX VM
2026-06-09 10:13:57 +08:00
Fabiano Fidêncio
4dc288401e runtime-rs: make sandbox cgroup runtime attach idempotent
The dragonball nerdctl CI job can race when creating and attaching the
runtime process to the sandbox cgroup, surfacing an os error 17
(AlreadyExists) during shim task creation.

Let's retry add_proc once on this pre-existing cgroup condition so
startup remains robust.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
2026-06-08 13:11:34 +02:00
Fabiano Fidêncio
4d569c22b4 runtime-rs: enforce a minimum vsock reconnect window
Low-CPU sandboxes can take longer than a few seconds to complete guest
boot and start the agent.

Let's clamp the reconnect timeout to a safe minimum so sandbox startup
does not fail early with transient vsock ECONNRESET.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
2026-06-08 13:11:34 +02:00
Fabiano Fidêncio
ed34d7811d runtime-rs: supplement static sizing from sandbox annotations
When static sandbox resource management is enabled, CRI CPU/memory
sizing may live only in sandbox annotations and be missing from the OCI
spec.

Let's fill missing sizing fields from annotations before applying static
VM sizing so runtime-rs follows the expected Kubernetes behavior for
constrained pods.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
2026-06-08 13:11:34 +02:00
Fabiano Fidêncio
e93558e810 runtime-rs: default static sizing-related config flags to true
Add top-level runtime-rs Makefile options `DEFSANDBOXCGROUP_ONLY` and
`DEFSTATICRESOURCEMGMT`, both defaulting to true, and use them for the
runtime defaults that previously disabled these paths.

This aligns runtime-rs defaults with static sandbox resource management,
which sizes sandbox memory up front instead of relying on memory hotplug,
helping avoid architecture-specific hotplug limitations.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-08 12:57:40 +02:00
Steve Horsman
2ac6bb173b Merge pull request #13036 from stevenhorsman/jaeger-to-otlp-tracing-switch
trace-forwarder: migrate from Jaeger to OTLP exporter
2026-06-05 14:30:26 +01:00
Steve Horsman
1624ebe362 Merge pull request #13135 from kata-containers/dependabot/cargo/tar-0.4.46
build(deps): bump tar from 0.4.45 to 0.4.46
2026-06-05 09:44:46 +01:00
stevenhorsman
b737ae48bf trace-forwarder: migrate from Jaeger to OTLP exporter
Migrate trace-forwarder from the deprecated opentelemetry-jaeger
exporter to the modern opentelemetry-otlp exporter.

This change remediates GHSA-2f9f-gq7v-9h6m (CVE-2026-43868), a
medium-severity vulnerability in Apache Thrift. The opentelemetry-jaeger
crate is no longer maintained and depends on vulnerable thrift versions
(0.13.0 and 0.16.0). The opentelemetry-otlp exporter does not use thrift
and is actively maintained.

Changes:
- Replace opentelemetry-jaeger with opentelemetry-otlp in Cargo.toml
- Update tracer.rs to use OTLP exporter instead of Jaeger exporter
- Replace --jaeger-host/--jaeger-port flags with --otlp-endpoint flag
- Update server.rs to use TracerProvider instead of SpanExporter
- Update documentation to reflect OTLP migration
- Add examples for common OTLP-compatible collectors

Breaking change: Users must update their trace-forwarder invocations
to use --otlp-endpoint instead of --jaeger-host and --jaeger-port.

Default endpoint: http://localhost:4317 (OTLP gRPC)

Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-06-04 19:39:47 +01:00
Dan Mihai
c78ccc2e9f Merge pull request #13088 from kata-containers/dependabot/cargo/openssl-0.10.80
build(deps): bump openssl from 0.10.79 to 0.10.80
2026-06-04 11:38:08 -07:00
Fabiano Fidêncio
743b0a4839 Merge pull request #13165 from stevenhorsman/bump-go-to-1.25.11
versions: bump golang to 1.25.11
2026-06-04 20:24:57 +02:00
Fabiano Fidêncio
80e2473440 runtime-rs: shut down shim daemon on a failed create
When CreateContainer fails before the runtime instance is registered
(e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal
teardown. containerd's follow-up Shutdown RPC then reaches
get_runtime_instance(), fails with "runtime not ready", and returns
before the service loop is ever told to stop. Because the shim ignores
SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned.

Make the Shutdown RPC force the daemon to exit when there is no runtime
instance, emitting the same Action::Shutdown that sandbox.shutdown()
sends on the normal path. This guarantees the shim process is reaped
after a failed create instead of leaking.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
2026-06-04 14:12:01 +02:00
Fabiano Fidêncio
2a1ce7b8c4 Merge pull request #12539 from mythi/no-vcpu-hotplug
Disable CPU hotplug when confidential guest setting enabled
2026-06-04 10:56:52 +02:00
dependabot[bot]
4ab63d0a5d build(deps): bump tar from 0.4.45 to 0.4.46
Bumps [tar](https://github.com/composefs/tar-rs) from 0.4.45 to 0.4.46.
- [Release notes](https://github.com/composefs/tar-rs/releases)
- [Commits](https://github.com/composefs/tar-rs/compare/0.4.45...0.4.46)

---
updated-dependencies:
- dependency-name: tar
  dependency-version: 0.4.46
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-04 07:52:44 +00:00
dependabot[bot]
d155f1a4ab build(deps): bump openssl from 0.10.79 to 0.10.80
Bumps [openssl](https://github.com/rust-openssl/rust-openssl) from 0.10.79 to 0.10.80.
- [Release notes](https://github.com/rust-openssl/rust-openssl/releases)
- [Commits](https://github.com/rust-openssl/rust-openssl/compare/openssl-v0.10.79...openssl-v0.10.80)

---
updated-dependencies:
- dependency-name: openssl
  dependency-version: 0.10.80
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-04 07:51:50 +00:00
stevenhorsman
879912be25 versions: bump golang to 1.25.11
Bump the go version to resolve CVEs:
- GO-2026-5037
- GO-2026-5038
- GO-2026-5039

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-04 08:49:17 +01:00
Mikko Ylinen
e475d870fb runtime: qemu: don't set maxcpus when confidential guest is enabled
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.

Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.

Commit 07db945b09 built a relationship between kernel's cmdline nr_cpus and
the maxcpus config. Now that maxcpus is dropped for confidential guests, drop
nr_cpus from kernel commandline too. This hopefully helps with the reference
values computation too.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-03 15:27:35 +03:00
Mikko Ylinen
2e625d0bab runtime-rs: qemu: don't set maxcpus when confidential guest is enabled
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.

Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-03 15:27:35 +03:00
stevenhorsman
46d704a7ab log-parser: bump golang.org/x/sys dependency
Bump golang.org/x/sys from v0.1.0 to v0.44.0 to resolve CVE:
- GO-2026-5024

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
stevenhorsman
08ab789d9a csi-kata-directvolume: bump golang.org/x dependencies
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
stevenhorsman
c0f549860e runtime: bump golang.org/x dependencies
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
Fabiano Fidêncio
a2bb3f64b0 Merge pull request #12436 from mythi/tdx-updates-2026-3
runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default
2026-06-03 08:50:26 +02:00
Fabiano Fidêncio
ecd9344dd1 Merge pull request #13144 from stevenhorsman/bump-rust-to-1.94
Bump rust to 1.94
2026-06-02 09:58:56 +02:00
Fabiano Fidêncio
230e01b04e Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs
runtime/runtime-rs: introduce Azure specific configs
2026-06-02 09:17:09 +02:00
Manuel Huber
57ee67a6aa runtime-rs: Honor enable_debug for logs
Make enable_debug promote the effective component log level from the
default info level to debug for runtime, agent, and hypervisor logs.
Keep an explicit log_level value authoritative so users can still choose
trace, warn, or another level.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
2026-06-01 21:29:08 +00:00
stevenhorsman
b1928cc22f runtime-rs: run cargo fmt for Rust 1.94
Run cargo fmt on runtime-rs to ensure consistent formatting
with Rust 1.94 toolchain.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-01 17:32:06 +01:00
stevenhorsman
f9c95a279e dragonball: Remove unnecessary unsafe blocks in cpuid
Rust 1.94 now warns about unnecessary unsafe blocks around
__get_cpuid_max(), __cpuid_count(), and host_cpuid() calls.
Remove the unsafe blocks as they are no longer needed.

This fixes the following clippy warnings in dbs-arch:
- warning: unnecessary `unsafe` block at brand_string.rs:106
- warning: unnecessary `unsafe` block at brand_string.rs:114
- warning: unnecessary `unsafe` block at common.rs:28
- warning: unnecessary `unsafe` block at common.rs:36

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-01 17:07:16 +01:00
stevenhorsman
a63a948b4a libs: Remove unnecessary unsafe blocks in protection.rs
Rust 1.94 now warns about unnecessary unsafe blocks around
x86_64::__cpuid() calls. Remove the unsafe blocks as they are
no longer needed.

This fixes the following clippy warnings:
- warning: unnecessary `unsafe` block at line 129
- warning: unnecessary `unsafe` block at line 142

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-01 17:04:43 +01:00
Fabiano Fidêncio
9b5b829265 runtime: oci: derive sandbox CPUs from shares only if unconstrained
The shares-based fallback added for cpuManagerPolicy=static fired whenever
the quota-based CPU count was 0, including for BestEffort sandboxes that
have no CPU request. Those sandboxes still carry the cgroup-floor shares
value (2), so the fallback derived ceil(2/1024)=1 and inflated every such
sandbox by one vCPU. For peer-pods (static resource management) this
changed the VM sizing to default_vcpus+1, regressing the libvirt
instance-type CI checks.

Gate the fallback on the quota being explicitly unconstrained (< 0), which
is the actual cpuManagerPolicy=static signal, instead of on numCPU == 0.
BestEffort sandboxes (quota 0/absent) now correctly contribute 0 vCPUs
while the static-policy case still recovers the CPU count from shares.

Add unit tests covering the static-policy, rounding, BestEffort, and
explicit-quota cases.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-01 09:50:49 +02:00
manuelh-dev
953b306ff3 Merge pull request #12979 from manuelh-dev/mahuber/erofs-tmpfs-mount
runtime-rs/agent: support EROFS snapshots without a rwlayer
2026-05-29 13:50:27 -07:00
Aurélien Bombo
9acef4bc55 Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple
Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
2026-05-29 14:57:07 -05:00
Cameron Baird
7a9d207ab2 Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
This reverts commit 2799f7d36b.
2026-05-29 17:05:40 +00:00
Fabiano Fidêncio
10e70a2a9f runtime-rs: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network
endpoints via host_bdf()) was already present inside handler_devices()
but could never be consumed because the LinuxDeviceType::C loop has
no entries to iterate over when linux.devices is empty.

After that loop, iterate over any unmatched cold-plug BDFs, derive the
VFIO group path via bdf_to_vfio_group_path() (reads
/sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk
ContainerDevice.  The vfio_group_to_bdf() short-circuit inside the
loop handles the case where the device plugin does add VFIO char
devices to linux.devices; it now supports both legacy (/dev/vfio/N)
and iommufd (/dev/vfio/devices/vfioN) path formats.

Add host_bdf() to the Endpoint trait (default: None) so that
PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
60f2878c68 runtime-rs: call network.remove() during resource cleanup
network.remove() — which detaches endpoints and rebinds VFs from
vfio-pci back to the host driver — was never being called.
ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs,
swap and ephemeral disks, but completely omitted the network teardown.

Call network.remove() at the start of cleanup(), using the already-held
self.hypervisor reference.  Errors are logged as warnings rather than
propagated, so they don't block the rest of the cleanup sequence.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
0b4b51dff6 runtime-rs: always detach endpoints on network removal
network_with_netns::remove() bailed out early when network_created=false
(i.e. the netns was created by the CNI, not by kata). This caused
physical endpoint VFs to remain bound to vfio-pci after pod deletion,
because PhysicalEndpoint::detach() — which calls bind_device_to_host()
to rebind the VF from vfio-pci back to mlx5_core — was never reached.

Separate endpoint detachment from netns deletion: always detach
endpoints, but only remove the netns if kata created it.  Detach errors
are logged as warnings rather than propagated, to mirror the Go runtime's
best-effort approach and avoid blocking sandbox teardown.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
be2ec02c9a runtime-rs: resolve cold-plug VFIO guest PCI path via QMP
The PCIe topology pre-computes a wrong path for cold-plugged physical-
endpoint VFs because the root port has no explicit addr and QEMU auto-
assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] }
resolves to 0000:00:00.0 (the Q35 MCH), causing
wait_for_pci_net_interface to time out looking for a netdev there.

Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait.
Implement it in QemuInner using qmp.get_device_by_qdev_id(), which
queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00"
= slot 5 on pcie.0 / slot 0 on the root port bus).

Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during
attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the
Endpoint trait and add an endpoints() accessor to the Network trait.

In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths()
before apply_network_to_agent() to populate the correct path from QMP
into each PhysicalEndpoint's guest_pci_path field. The field is then
consumed by network_with_netns::interfaces() to fill Interface.device_path
before update_interface is sent to the agent.

This is the runtime-rs counterpart of the Go runtime's
ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
f8ee9133e5 runtime-rs: populate device_path for cold-plug VFIO physical endpoints
Without device_path the agent receives Interface.device_path="" in
update_interface, falls back to a by-MAC link lookup, and fails for
SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after
the vfio-pci unbind/rebind cycle.

The guest PCI path is computed at attach() time by do_add_pcie_endpoint()
inside VfioDevice::register() — no QMP query is needed. Cache it in
PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach()
when do_handle_device() returns the DeviceType::Vfio with the path
already filled in.

Add a default-None guest_pci_path() method to the Endpoint trait;
PhysicalEndpoint overrides it to return the cached path. In
network_with_netns.rs::interfaces(), after building each Interface from
network_info, fill device_path from endpoint.guest_pci_path() when the
field would otherwise be empty.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
67843220f8 runtime-rs: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC the guest mlx5_core inherits whatever firmware-
default MAC the VF was created with. This MAC differs from the IB port
HCA MAC, so mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE appears active but
every verb needing a GID fails.

Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF
as an "admin MAC" via the parent PF using RTM_SETLINK with
IFLA_VFINFO_LIST — the netlink equivalent of
  ip link set <PF> vf <N> mac <MAC>

The operation runs in a spawn_blocking closure that enters the host
network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is
called while the thread is inside the pod netns.

Best-effort: failures are logged at warn and the existing agent-side MAC
reconciliation (update_interface in rpc.rs) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
9e9b50c79e runtime-rs: cold-plug Vfio physical endpoints at VM launch
DeviceType::Vfio (used by physical network VFs) was silently dropped
in start_vm()'s cold-plug loop, falling through to the unsupported-
device info log. The VF never appeared on the QEMU command line and
therefore never became visible inside the guest.

Add handling for DeviceType::Vfio in the start_vm() cold-plug loop.
For each HostDevice in the VfioDevice, emit:

  -device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \
      [x-pci-vendor-id=...,x-pci-device-id=...]

The bus assignment and guest PCI path are already computed by
do_add_pcie_endpoint() at VfioDevice::register() time (called from
VfioDevice::attach() via the PCIe topology), so no additional QMP
resolution is needed here.

Add id= support to PCIeVfioDevice so the QEMU device name is stable
and matchable in QMP queries. Add new_without_iommufd() constructor
for the non-IOMMUFD (legacy VFIO container) path used by physical
endpoints, and add_physical_vfio_device() to QemuCmdLine as a
direct emission helper.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
91df041803 agent: expose guest InfiniBand devices to VFIO containers
When a VF is cold-plugged in guest-kernel mode, mlx5_core binds to the
PCI device inside the VM and mlx5_ib creates IB character devices under
/dev/infiniband/ (uverbs*, rdma_cm, umad*). The container cannot reach
these devices unless they are explicitly added to its OCI spec.

Add expose_guest_infiniband_devices(), called from create_devices() when
the container carries at least one VFIO device entry. The function:

  - Walks /dev/infiniband/ inside the guest VM.
  - Appends each char device to spec.linux.devices.
  - Inserts matching cgroup allow rules (rwm).
  - Is a no-op if /dev/infiniband/ is absent or empty (no IB driver,
    or VF not yet rebound), so non-RDMA pods are unaffected.

Gate the call on container_has_vfio_device() so unrelated containers
sharing the sandbox do not get IB device access widened.

Add is_vfio_device_type() and snapshot_infiniband() to
kata-sys-util/pcilibs. is_vfio_device_type() lets the agent check
device type strings against the VFIO driver name constants without
duplication. snapshot_infiniband() summarises /sys/class/infiniband,
/sys/class/infiniband_verbs, and /dev/infiniband as a single diagnostic
string for log context; it lives in pcilibs because it has no
agent-specific dependencies (pure sysfs/devfs reads).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
025202a52a runtime: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

Add appendPhysicalEndpointDevices() which runs after appendDevices()
in createContainer().  It walks the sandbox network endpoints; for
each PhysicalEndpoint with a resolved guest PCI path it derives the
VFIO group char path from sysfs (iommu_group symlink) and synthesises
a vfio-pci-gk Device entry.  Both legacy group paths (/dev/vfio/N)
and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by
reading the iommu_group sysfs symlink.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00