kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	4dc288401e	runtime-rs: make sandbox cgroup runtime attach idempotent The dragonball nerdctl CI job can race when creating and attaching the runtime process to the sandbox cgroup, surfacing an os error 17 (AlreadyExists) during shim task creation. Let's retry add_proc once on this pre-existing cgroup condition so startup remains robust. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex <codex@openai.com>	2026-06-08 13:11:34 +02:00
Fabiano Fidêncio	4d569c22b4	runtime-rs: enforce a minimum vsock reconnect window Low-CPU sandboxes can take longer than a few seconds to complete guest boot and start the agent. Let's clamp the reconnect timeout to a safe minimum so sandbox startup does not fail early with transient vsock ECONNRESET. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex <codex@openai.com>	2026-06-08 13:11:34 +02:00
Fabiano Fidêncio	ed34d7811d	runtime-rs: supplement static sizing from sandbox annotations When static sandbox resource management is enabled, CRI CPU/memory sizing may live only in sandbox annotations and be missing from the OCI spec. Let's fill missing sizing fields from annotations before applying static VM sizing so runtime-rs follows the expected Kubernetes behavior for constrained pods. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex <codex@openai.com>	2026-06-08 13:11:34 +02:00
Fabiano Fidêncio	e93558e810	runtime-rs: default static sizing-related config flags to true Add top-level runtime-rs Makefile options `DEFSANDBOXCGROUP_ONLY` and `DEFSTATICRESOURCEMGMT`, both defaulting to true, and use them for the runtime defaults that previously disabled these paths. This aligns runtime-rs defaults with static sandbox resource management, which sizes sandbox memory up front instead of relying on memory hotplug, helping avoid architecture-specific hotplug limitations. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-08 12:57:40 +02:00
Steve Horsman	2ac6bb173b	Merge pull request #13036 from stevenhorsman/jaeger-to-otlp-tracing-switch trace-forwarder: migrate from Jaeger to OTLP exporter	2026-06-05 14:30:26 +01:00
Steve Horsman	1624ebe362	Merge pull request #13135 from kata-containers/dependabot/cargo/tar-0.4.46 build(deps): bump tar from 0.4.45 to 0.4.46	2026-06-05 09:44:46 +01:00
stevenhorsman	b737ae48bf	trace-forwarder: migrate from Jaeger to OTLP exporter Migrate trace-forwarder from the deprecated opentelemetry-jaeger exporter to the modern opentelemetry-otlp exporter. This change remediates GHSA-2f9f-gq7v-9h6m (CVE-2026-43868), a medium-severity vulnerability in Apache Thrift. The opentelemetry-jaeger crate is no longer maintained and depends on vulnerable thrift versions (0.13.0 and 0.16.0). The opentelemetry-otlp exporter does not use thrift and is actively maintained. Changes: - Replace opentelemetry-jaeger with opentelemetry-otlp in Cargo.toml - Update tracer.rs to use OTLP exporter instead of Jaeger exporter - Replace --jaeger-host/--jaeger-port flags with --otlp-endpoint flag - Update server.rs to use TracerProvider instead of SpanExporter - Update documentation to reflect OTLP migration - Add examples for common OTLP-compatible collectors Breaking change: Users must update their trace-forwarder invocations to use --otlp-endpoint instead of --jaeger-host and --jaeger-port. Default endpoint: http://localhost:4317 (OTLP gRPC) Generated-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com> Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-06-04 19:39:47 +01:00
Dan Mihai	c78ccc2e9f	Merge pull request #13088 from kata-containers/dependabot/cargo/openssl-0.10.80 build(deps): bump openssl from 0.10.79 to 0.10.80	2026-06-04 11:38:08 -07:00
Fabiano Fidêncio	743b0a4839	Merge pull request #13165 from stevenhorsman/bump-go-to-1.25.11 versions: bump golang to 1.25.11	2026-06-04 20:24:57 +02:00
Fabiano Fidêncio	80e2473440	runtime-rs: shut down shim daemon on a failed create When CreateContainer fails before the runtime instance is registered (e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal teardown. containerd's follow-up Shutdown RPC then reaches get_runtime_instance(), fails with "runtime not ready", and returns before the service loop is ever told to stop. Because the shim ignores SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned. Make the Shutdown RPC force the daemon to exit when there is no runtime instance, emitting the same Action::Shutdown that sandbox.shutdown() sends on the normal path. This guarantees the shim process is reaped after a failed create instead of leaking. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-04 14:12:01 +02:00
Fabiano Fidêncio	2a1ce7b8c4	Merge pull request #12539 from mythi/no-vcpu-hotplug Disable CPU hotplug when confidential guest setting enabled	2026-06-04 10:56:52 +02:00
dependabot[bot]	4ab63d0a5d	build(deps): bump tar from 0.4.45 to 0.4.46 Bumps [tar](https://github.com/composefs/tar-rs) from 0.4.45 to 0.4.46. - [Release notes](https://github.com/composefs/tar-rs/releases) - [Commits](https://github.com/composefs/tar-rs/compare/0.4.45...0.4.46) --- updated-dependencies: - dependency-name: tar dependency-version: 0.4.46 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2026-06-04 07:52:44 +00:00
dependabot[bot]	d155f1a4ab	build(deps): bump openssl from 0.10.79 to 0.10.80 Bumps [openssl](https://github.com/rust-openssl/rust-openssl) from 0.10.79 to 0.10.80. - [Release notes](https://github.com/rust-openssl/rust-openssl/releases) - [Commits](https://github.com/rust-openssl/rust-openssl/compare/openssl-v0.10.79...openssl-v0.10.80) --- updated-dependencies: - dependency-name: openssl dependency-version: 0.10.80 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2026-06-04 07:51:50 +00:00
stevenhorsman	879912be25	versions: bump golang to 1.25.11 Bump the go version to resolve CVEs: - GO-2026-5037 - GO-2026-5038 - GO-2026-5039 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-04 08:49:17 +01:00
Mikko Ylinen	e475d870fb	runtime: qemu: don't set maxcpus when confidential guest is enabled QEMU maxcpus enables CPU hotplug capabilities but it's unused when confidential guest is enabled. Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug is not needed. Commit `07db945b09` built a relationship between kernel's cmdline nr_cpus and the maxcpus config. Now that maxcpus is dropped for confidential guests, drop nr_cpus from kernel commandline too. This hopefully helps with the reference values computation too. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
Mikko Ylinen	2e625d0bab	runtime-rs: qemu: don't set maxcpus when confidential guest is enabled QEMU maxcpus enables CPU hotplug capabilities but it's unused when confidential guest is enabled. Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug is not needed. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
stevenhorsman	46d704a7ab	log-parser: bump golang.org/x/sys dependency Bump golang.org/x/sys from v0.1.0 to v0.44.0 to resolve CVE: - GO-2026-5024 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
stevenhorsman	08ab789d9a	csi-kata-directvolume: bump golang.org/x dependencies Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys from v0.43.0 to v0.44.0 to resolve CVEs: - GO-2026-5024 - GO-2026-5025 - GO-2026-5026 - GO-2026-5027 - GO-2026-5028 - GO-2026-5029 - GO-2026-5030 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
stevenhorsman	c0f549860e	runtime: bump golang.org/x dependencies Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys from v0.43.0 to v0.44.0 to resolve CVEs: - GO-2026-5024 - GO-2026-5025 - GO-2026-5026 - GO-2026-5027 - GO-2026-5028 - GO-2026-5029 - GO-2026-5030 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
Fabiano Fidêncio	a2bb3f64b0	Merge pull request #12436 from mythi/tdx-updates-2026-3 runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default	2026-06-03 08:50:26 +02:00
Fabiano Fidêncio	ecd9344dd1	Merge pull request #13144 from stevenhorsman/bump-rust-to-1.94 Bump rust to 1.94	2026-06-02 09:58:56 +02:00
Fabiano Fidêncio	230e01b04e	Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs runtime/runtime-rs: introduce Azure specific configs	2026-06-02 09:17:09 +02:00
stevenhorsman	b1928cc22f	runtime-rs: run cargo fmt for Rust 1.94 Run cargo fmt on runtime-rs to ensure consistent formatting with Rust 1.94 toolchain. Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-01 17:32:06 +01:00
stevenhorsman	f9c95a279e	dragonball: Remove unnecessary unsafe blocks in cpuid Rust 1.94 now warns about unnecessary unsafe blocks around __get_cpuid_max(), __cpuid_count(), and host_cpuid() calls. Remove the unsafe blocks as they are no longer needed. This fixes the following clippy warnings in dbs-arch: - warning: unnecessary `unsafe` block at brand_string.rs:106 - warning: unnecessary `unsafe` block at brand_string.rs:114 - warning: unnecessary `unsafe` block at common.rs:28 - warning: unnecessary `unsafe` block at common.rs:36 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-01 17:07:16 +01:00
stevenhorsman	a63a948b4a	libs: Remove unnecessary unsafe blocks in protection.rs Rust 1.94 now warns about unnecessary unsafe blocks around x86_64::__cpuid() calls. Remove the unsafe blocks as they are no longer needed. This fixes the following clippy warnings: - warning: unnecessary `unsafe` block at line 129 - warning: unnecessary `unsafe` block at line 142 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-01 17:04:43 +01:00
Fabiano Fidêncio	9b5b829265	runtime: oci: derive sandbox CPUs from shares only if unconstrained The shares-based fallback added for cpuManagerPolicy=static fired whenever the quota-based CPU count was 0, including for BestEffort sandboxes that have no CPU request. Those sandboxes still carry the cgroup-floor shares value (2), so the fallback derived ceil(2/1024)=1 and inflated every such sandbox by one vCPU. For peer-pods (static resource management) this changed the VM sizing to default_vcpus+1, regressing the libvirt instance-type CI checks. Gate the fallback on the quota being explicitly unconstrained (< 0), which is the actual cpuManagerPolicy=static signal, instead of on numCPU == 0. BestEffort sandboxes (quota 0/absent) now correctly contribute 0 vCPUs while the static-policy case still recovers the CPU count from shares. Add unit tests covering the static-policy, rounding, BestEffort, and explicit-quota cases. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-01 09:50:49 +02:00
manuelh-dev	953b306ff3	Merge pull request #12979 from manuelh-dev/mahuber/erofs-tmpfs-mount runtime-rs/agent: support EROFS snapshots without a rwlayer	2026-05-29 13:50:27 -07:00
Aurélien Bombo	9acef4bc55	Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"	2026-05-29 14:57:07 -05:00
Cameron Baird	7a9d207ab2	Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair" This reverts commit `2799f7d36b`.	2026-05-29 17:05:40 +00:00
Fabiano Fidêncio	10e70a2a9f	runtime-rs: expose InfiniBand devices to VFIO containers The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a PCIDEVICE_* environment variable; it does not add the VFIO char device to linux.devices in the OCI spec. As a result the agent's container_has_vfio_device() gate stays closed and expose_guest_infiniband_devices() is never triggered — leaving /dev/infiniband absent from the container even though the guest kernel created the IB devices (mlx5_core.rdma.0 probes successfully). The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network endpoints via host_bdf()) was already present inside handler_devices() but could never be consumed because the LinuxDeviceType::C loop has no entries to iterate over when linux.devices is empty. After that loop, iterate over any unmatched cold-plug BDFs, derive the VFIO group path via bdf_to_vfio_group_path() (reads /sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk ContainerDevice. The vfio_group_to_bdf() short-circuit inside the loop handles the case where the device plugin does add VFIO char devices to linux.devices; it now supports both legacy (/dev/vfio/N) and iommufd (/dev/vfio/devices/vfioN) path formats. Add host_bdf() to the Endpoint trait (default: None) so that PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	60f2878c68	runtime-rs: call network.remove() during resource cleanup network.remove() — which detaches endpoints and rebinds VFs from vfio-pci back to the host driver — was never being called. ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs, swap and ephemeral disks, but completely omitted the network teardown. Call network.remove() at the start of cleanup(), using the already-held self.hypervisor reference. Errors are logged as warnings rather than propagated, so they don't block the rest of the cleanup sequence. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	0b4b51dff6	runtime-rs: always detach endpoints on network removal network_with_netns::remove() bailed out early when network_created=false (i.e. the netns was created by the CNI, not by kata). This caused physical endpoint VFs to remain bound to vfio-pci after pod deletion, because PhysicalEndpoint::detach() — which calls bind_device_to_host() to rebind the VF from vfio-pci back to mlx5_core — was never reached. Separate endpoint detachment from netns deletion: always detach endpoints, but only remove the netns if kata created it. Detach errors are logged as warnings rather than propagated, to mirror the Go runtime's best-effort approach and avoid blocking sandbox teardown. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	be2ec02c9a	runtime-rs: resolve cold-plug VFIO guest PCI path via QMP The PCIe topology pre-computes a wrong path for cold-plugged physical- endpoint VFs because the root port has no explicit addr and QEMU auto- assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] } resolves to 0000:00:00.0 (the Q35 MCH), causing wait_for_pci_net_interface to time out looking for a netdev there. Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait. Implement it in QemuInner using qmp.get_device_by_qdev_id(), which queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00" = slot 5 on pcie.0 / slot 0 on the root port bus). Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the Endpoint trait and add an endpoints() accessor to the Network trait. In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths() before apply_network_to_agent() to populate the correct path from QMP into each PhysicalEndpoint's guest_pci_path field. The field is then consumed by network_with_netns::interfaces() to fill Interface.device_path before update_interface is sent to the agent. This is the runtime-rs counterpart of the Go runtime's ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	f8ee9133e5	runtime-rs: populate device_path for cold-plug VFIO physical endpoints Without device_path the agent receives Interface.device_path="" in update_interface, falls back to a by-MAC link lookup, and fails for SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after the vfio-pci unbind/rebind cycle. The guest PCI path is computed at attach() time by do_add_pcie_endpoint() inside VfioDevice::register() — no QMP query is needed. Cache it in PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach() when do_handle_device() returns the DeviceType::Vfio with the path already filled in. Add a default-None guest_pci_path() method to the Endpoint trait; PhysicalEndpoint overrides it to return the cached path. In network_with_netns.rs::interfaces(), after building each Interface from network_info, fill device_path from endpoint.guest_pci_path() when the field would otherwise be empty. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	67843220f8	runtime-rs: set VF admin MAC before vfio-pci rebind for IB/RoCE support Without an admin MAC the guest mlx5_core inherits whatever firmware- default MAC the VF was created with. This MAC differs from the IB port HCA MAC, so mlx5_ib's GID cache refuses to populate /sys/class/infiniband/mlx5_/ports/N/gids/. RoCE appears active but every verb needing a GID fails. Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF as an "admin MAC" via the parent PF using RTM_SETLINK with IFLA_VFINFO_LIST — the netlink equivalent of ip link set <PF> vf <N> mac <MAC> The operation runs in a spawn_blocking closure that enters the host network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is called while the thread is inside the pod netns. Best-effort: failures are logged at warn and the existing agent-side MAC reconciliation (update_interface in rpc.rs) remains as a fallback for L2/L3 connectivity. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	9e9b50c79e	runtime-rs: cold-plug Vfio physical endpoints at VM launch DeviceType::Vfio (used by physical network VFs) was silently dropped in start_vm()'s cold-plug loop, falling through to the unsupported- device info log. The VF never appeared on the QEMU command line and therefore never became visible inside the guest. Add handling for DeviceType::Vfio in the start_vm() cold-plug loop. For each HostDevice in the VfioDevice, emit: -device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \ [x-pci-vendor-id=...,x-pci-device-id=...] The bus assignment and guest PCI path are already computed by do_add_pcie_endpoint() at VfioDevice::register() time (called from VfioDevice::attach() via the PCIe topology), so no additional QMP resolution is needed here. Add id= support to PCIeVfioDevice so the QEMU device name is stable and matchable in QMP queries. Add new_without_iommufd() constructor for the non-IOMMUFD (legacy VFIO container) path used by physical endpoints, and add_physical_vfio_device() to QemuCmdLine as a direct emission helper. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	91df041803	agent: expose guest InfiniBand devices to VFIO containers When a VF is cold-plugged in guest-kernel mode, mlx5_core binds to the PCI device inside the VM and mlx5_ib creates IB character devices under /dev/infiniband/ (uverbs, rdma_cm, umad). The container cannot reach these devices unless they are explicitly added to its OCI spec. Add expose_guest_infiniband_devices(), called from create_devices() when the container carries at least one VFIO device entry. The function: - Walks /dev/infiniband/ inside the guest VM. - Appends each char device to spec.linux.devices. - Inserts matching cgroup allow rules (rwm). - Is a no-op if /dev/infiniband/ is absent or empty (no IB driver, or VF not yet rebound), so non-RDMA pods are unaffected. Gate the call on container_has_vfio_device() so unrelated containers sharing the sandbox do not get IB device access widened. Add is_vfio_device_type() and snapshot_infiniband() to kata-sys-util/pcilibs. is_vfio_device_type() lets the agent check device type strings against the VFIO driver name constants without duplication. snapshot_infiniband() summarises /sys/class/infiniband, /sys/class/infiniband_verbs, and /dev/infiniband as a single diagnostic string for log context; it lives in pcilibs because it has no agent-specific dependencies (pure sysfs/devfs reads). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	025202a52a	runtime: expose InfiniBand devices to VFIO containers The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a PCIDEVICE_* environment variable; it does not add the VFIO char device to linux.devices in the OCI spec. As a result the agent's container_has_vfio_device() gate stays closed and expose_guest_infiniband_devices() is never triggered — leaving /dev/infiniband absent from the container even though the guest kernel created the IB devices (mlx5_core.rdma.0 probes successfully). Add appendPhysicalEndpointDevices() which runs after appendDevices() in createContainer(). It walks the sandbox network endpoints; for each PhysicalEndpoint with a resolved guest PCI path it derives the VFIO group char path from sysfs (iommu_group symlink) and synthesises a vfio-pci-gk Device entry. Both legacy group paths (/dev/vfio/N) and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by reading the iommu_group sysfs symlink. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	f36c383b4f	runtime: generate dedicated CLH Azure config variants Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH configs during build. This keeps Mariner-specific defaults in explicit config artifacts instead of ad-hoc runtime mutation. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Fabiano Fidêncio	fa9a9f3aeb	runtime: set VF admin MAC before vfio-pci rebind for IB/RoCE support Without an admin MAC, the guest's mlx5_core inherits the VF's firmware-default MAC. This MAC differs from the IB port's HCA MAC, so mlx5_ib's GID cache refuses to populate /sys/class/infiniband/mlx5_/ports/N/gids/. RoCE then appears active (port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID — RoCEv2 packets, address handles, librdmacm bind — fails silently. Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF using RTM_SETLINK before the bind-to-vfio-pci step. The firmware applies the admin MAC during the VF reset that accompanies the unbind/rebind cycle, so the guest sees a single consistent MAC across netdev, IB port, and HCA. Best-effort: failures are logged at warn and the existing agent-side MAC reconciliation (rpc.rs::update_interface) remains as a fallback for L2/L3 connectivity. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	992a723392	runtime: resolve cold-plug VFIO guest PCI path via QMP For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged VFIO device is auto-allocated at boot (each pcie-root-port is added with chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot on pcie.0). The hot-plug path already queries QMP via qomGetPciPath; reuse that same mechanism for cold-plugged devices. Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface. Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all other hypervisors. Call it at the start of setupNetworks so that the PCI paths are resolved before generateVCNetworkStructures emits the agent Interface proto. Also stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs exposed as physical network devices) so that update_interface carries a non-empty devicePath. Without devicePath the agent falls back to a by-MAC link lookup which fails when the VF firmware MAC differs from the CNI-assigned MAC after the vfio-pci unbind/rebind cycle. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	23c5250933	runtime/qemu: emit id= for VFIODevice on -device cmdline Without an explicit id= on the vfio-pci device, QEMU auto-generates an internal name that does not match vfioDev.ID, so any subsequent qomGetPciPath(vfioDev.ID) call via QMP fails with "Device 'X' not found". This breaks resolveColdPlugVFIOGuestPciPaths which needs the device ID to look up the guest PCI path, leaving GuestPciPath nil and causing update_interface to fail repeatedly as the agent can't find the interface to configure. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	e6777f0866	runtime: keep cold-plug VFIO devices in guest-kernel mode Container.createDevices was dropping cold-plug VFIO entries from the container's deviceInfos whenever vfio_mode = "guest-kernel", which in turn meant the agent's CreateContainer request carried no vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the workload container, so update_env_pci then aborted with "No PCI mapping found for container <id>" and the container failed with CrashLoopBackOff. Include cold-plug VFIO devices in deviceInfos for both VFIO modes. The existing vfio-pci-gk agent handler returns dev: None (so /dev/vfio/<group> is not materialised in the container spec, and constrainGRPCSpec(stripVfio=true) already strips it from the grpc spec for guest-kernel mode), while still recording the host->guest PCI mapping into sandbox.pcimap[cid] so env-var translation works. devManager.NewDevice calls FindDevice first, which matches the already cold-plugged sandbox-level device by HostPath / major / minor, so this does not double-attach. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	9893b6dc03	runtime: correctly resolve cold-plug VFIO guest PCI paths Populate missing VFIO guest PCI paths via QMP before serializing container devices so guest-kernel PCI env translation has the mappings it needs. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	118b7fa611	agent: reconcile VFIO netdev MAC before UpdateInterface lookup When a VFIO cold-plugged network device appears in guest with a different MAC than the runtime request, resolve the netdev by PCI path and apply the requested MAC before the normal by-MAC update flow. This preserves existing behavior while avoiding UpdateInterface mismatches in SR-IOV cold-plug cases. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-28 21:54:52 +02:00
Fabiano Fidêncio	e89eb77245	agent: keep PCIDEVICE env unchanged when pcimap is missing Avoid failing container creation when per-container PCI mappings are unavailable by preserving PCIDEVICE entries unchanged and warning instead. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 21:54:52 +02:00
Cameron Baird	2799f7d36b	runtime: Enforce >= 1 queue pairs for tapNetworkPair In the xConnectVMNetwork path, we have queues = 0 as a baseline, set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise MultiQueueSupport. This is certainly incorrect as we always want, as a baseline, at least one queue pair. Make queues := 1 by default to ensure the NetworkPair has at least one queue pair for all virtio-net paths. Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>	2026-05-27 18:55:11 +00:00
Manuel Huber	ebf2c99df3	runtime-rs: allow EROFS rootfs without rwlayer Treat the containerd erofs snapshotter active snapshot as an EROFS lower plus overlay metadata, with an optional ext4 rwlayer when host rw backing is enabled. This also covers default_size=0, where containerd sends no rwlayer and the agent provides the writable upper inside the guest. Forward overlay mkdir hints on the EROFS storage so the guest agent sees them in both layouts, and add unit coverage for the dispatcher patterns. Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-27 17:12:20 +00:00
Manuel Huber	4fbfba2f79	agent: support run-backed EROFS upper Support multi-layer EROFS storage without an explicit ext4 upper layer. When runtime-rs sends only EROFS lower storage and overlay metadata, create the overlay upper/work directories under the container bundle in /run/kata-containers. Keep the explicit ext4 rwlayer path for disk-backed snapshots, and only track real temporary mount points for cleanup. The implicit /run-backed upper is bundle-scoped state and is removed with the container bundle. Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-27 17:12:20 +00:00
Fabiano Fidêncio	5adfb27297	Merge pull request #13118 from PiotrProkop/fix-missing-cwd agent: restore process CWD auto-creation	2026-05-27 13:32:05 +02:00

1 2 3 4 5 ...

6524 Commits