kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Author	SHA1	Message	Date
Markus Rudy	4d0f32ce41	runtime-rs: use proper temp dirs in initdata tests The test currently uses a static directory at `/tmp/initimg_test`. This introduces non-determinism into the unit test: * Files that already exist in that dir might alter test results. * If the directory is owned by root, the test will fail due to permissions. Switch to using the tempfile crate instead. Fixes: #13053 Signed-off-by: Markus Rudy <mr@edgeless.systems>	2026-05-16 20:39:13 +02:00
Fabiano Fidêncio	33de5a6c22	runtime-rs: refactor handler_volumes to use VolumeContext Group the shared-context parameters (share_fs, device_manager, sid, agent, emptydir_mode) into a VolumeContext struct so handler_volumes stays within clippy's argument count limit and avoids -D warnings breakage in CI. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-14 22:56:11 +02:00
Fabiano Fidêncio	aa7392b1b9	runtime-rs: add emptydir_mode to config templates Add the emptydir_mode configuration option to all runtime-rs config template files. CoCo configs (snp, tdx, se, coco-dev, nvidia-gpu-snp, nvidia-gpu-tdx) default to block-encrypted via @DEFEMPTYDIRMODE_COCO@, while non-CoCo configs (qemu, nvidia-gpu, fc) default to shared-fs via @DEFEMPTYDIRMODE@. Also add DEFEMPTYDIRMODE and DEFEMPTYDIRMODE_COCO variables to the runtime-rs Makefile for template substitution. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-14 22:56:11 +02:00
Fabiano Fidêncio	5e2ca6d6ee	runtime-rs: skip local type conversion for block-encrypted emptyDirs When emptydir_mode is "block-encrypted", host emptyDir paths must remain as "bind" mounts so the EncryptedEmptyDirVolume handler can intercept them in the volume dispatch chain. Previously, update_ephemeral_storage_type() would unconditionally convert them to "local" type, causing them to be handled as plain local volumes instead. Add the emptydir_mode parameter to update_ephemeral_storage_type() and its call chain (amend_spec in container.rs) and skip the host-emptyDir-to-local conversion when the mode is block-encrypted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-14 22:56:11 +02:00
Fabiano Fidêncio	d3a9669be5	runtime-rs: implement EncryptedEmptyDirVolume Add the core volume handler for block-encrypted emptyDir support in runtime-rs, bringing it to parity with the Go runtime (PR #10559). When emptydir_mode is set to "block-encrypted", host emptyDir bind mounts are intercepted and handled as follows: 1. A sparse disk image (disk.img) is created inside the emptyDir folder, sized to match the host filesystem capacity. 2. A mountInfo.json is written under the kata direct-volume root with volume_type "blk", fs_type "ext4", and metadata encryptionKey=ephemeral. 3. The disk image is plugged into the guest VM as a virtio-blk device via the hypervisor device manager. 4. An agent::Storage is built with driver_options containing encryption_key=ephemeral and shared=true, so the kata-agent delegates formatting and encryption to CDH using LUKS2. The volume is registered in the dispatch chain before the regular block-volume check, and ephemeral disk metadata is tracked for sandbox-level cleanup at teardown. Also re-exports EMPTYDIR_MODE_* constants from kata-types::config so downstream crates can reference them. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-14 22:56:11 +02:00
Fabiano Fidêncio	0b1e103886	runtime-rs: agent: add shared field to Storage struct The proto Storage message already has a "shared" field (field 8), but the runtime-rs agent crate's internal Storage struct was missing it, so it was never forwarded to the kata-agent. Add the field to the Rust struct and its From<Storage> translation, and update all explicit struct initialisers across the resource crate to include shared: false so the build stays clean. This is needed for trusted ephemeral data storage, where the agent uses the shared flag to avoid premature cleanup of volumes that are shared across containers in a pod. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-14 15:42:20 +02:00
Fabiano Fidêncio	44b356c654	Merge pull request #13033 from microsoft/saul/static_maxvcpus runtime-rs: static resources: always set maxvcpus equal to vcpus	2026-05-14 11:16:35 +02:00
Saul Paredes	d930fc42b8	runtime-rs: static resources: always set maxvcpus equal to vcpus based on current runtime-go behaviour introduced in https://github.com/kata-containers/kata-containers/pull/9195 When using static resources, always set maxvcpus value equal to the vcpus value. This is because the static resources case does not support dynamic CPU hotplugging, and therefore the maximum number of vCPUs should be limited to the number of vCPUs. Booting with a high number of max vCPUs is a bit slower compared to a lower number. Signed-off-by: Saul Paredes <saulparedes@microsoft.com>	2026-05-13 13:21:56 -07:00
Aurélien Bombo	555b7738fe	runtime-rs: align virtiofsd args on runtime-go Runtime-go doesn't hardcode --sandbox none --seccomp none [1], so mirror that in runtime-rs. [1]: `733ccb3254/src/runtime/virtcontainers/virtiofsd.go (L183)` Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-05-12 12:51:32 -05:00
Fabiano Fidêncio	6b802a4e30	nvidia: switch GPU rootfs images to erofs Switch the NVIDIA GPU rootfs images (both standard and confidential) from ext4 to erofs (Enhanced Read-Only File System). Unlike ext4, which is a read-write filesystem mounted read-only by convention, erofs is structurally read-only -- no journal, no write metadata, no superblock write path. This eliminates accidental mutation and reduces the attack surface inside the guest VM, which is particularly important for confidential workloads using dm-verity. Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4 so non-NVIDIA configurations are unaffected. Update all six NVIDIA GPU configuration templates (base, SNP, TDX for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global @DEFROOTFSTYPE@. Export FS_TYPE=erofs in install_image_nvidia_gpu() and install_image_nvidia_gpu_confidential() so the build pipeline produces erofs images via the image builder. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	905303b6b0	Merge pull request #13013 from BbolroC/filter-vfio-gk-only-runtime-rs runtime-rs: filter VFIO devices only in guest-kernel mode	2026-05-08 23:49:50 +02:00
Hyounggyu Choi	754707fe83	runtime-rs: filter VFIO devices only in guest-kernel mode After #12857, the VFIO-AP hotplug test fails because runtime-rs unconditionally removes all /dev/vfio/* devices from the OCI spec before sending it to the kata agent. The agent then rejects the container creation with: ``` Missing devices in OCI spec ``` Filter devices from the OCI spec conditionally based on the vfio_mode configuration (e.g. guest-kernel). Also factor the filtering logic out into a separate function and add unit tests. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-05-08 15:39:16 +02:00
Fabiano Fidêncio	8e65e89ade	Merge pull request #13011 from kata-containers/fix-warnings runtime-rs: Fix warnings in rust runtime	2026-05-08 15:12:53 +02:00
Fabiano Fidêncio	a541827a7e	Merge pull request #12984 from fidencio/topic/network-pair-use-name-for-lookup runtime-rs: network: use provided name for virt interface lookup	2026-05-08 14:31:58 +02:00
Alex Lyn	1441b2b84a	runtime-rs: Fix warnings in rust runtime So many unformatted rust codes cause uncommitted change files in rust runtime and its libs or agent sources, which can be easily found just by `cargo fmt --all`. Let's reduce such noisy bad experiences Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-08 14:56:00 +08:00
Fabiano Fidêncio	8a33007806	runtime-rs: Add configuration-qemu-nvidia-gpu-tdx-runtime-rs.toml.in Add a new runtime-rs configuration template that combines the NVIDIA GPU cold-plug stack with Intel TDX confidential guest support. This is the runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-tdx template. The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API, NV-specific kernel/image/firmware, extended timeouts) with TDX confidential guest settings (confidential_guest, OVMF.inteltdx.fd firmware, TDX Quote Generation Service socket, confidential NV kernel and image). The Makefile is updated with the new config file registration and the FIRMWARETDVFPATH_NV variable pointing to OVMF.inteltdx.fd. Also removes a stray tdx_quote_generation_service_socket_port setting from the SNP GPU template where it did not belong. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	e98a864285	runtime-rs: Add configuration-qemu-nvidia-gpu-snp-runtime-rs.toml.in Add a new runtime-rs configuration template that combines the NVIDIA GPU cold-plug stack with AMD SEV-SNP confidential guest support. This is the runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-snp template. The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API, NV-specific kernel/image/firmware, extended timeouts) with the SNP confidential guest settings (confidential_guest, sev_snp_guest, SNP ID block/auth, guest policy, AMDSEV.fd firmware, confidential NV kernel and image). The Makefile is updated with the new config file registration, the CONFIDENTIAL_NV image/kernel variables, and FIRMWARESNPPATH_NV pointing to AMDSEV.fd. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	1ada256581	runtime-rs: Add configuration-qemu-nvidia-gpu-runtime-rs.toml.in Add a QEMU configuration template for the NVIDIA GPU runtime-rs shim, mirroring the Go runtime's configuration-qemu-nvidia-gpu.toml.in. The template uses _NV-suffixed Makefile variables for kernel, image, and verity params so the GPU-specific rootfs and kernel are selected at build time. Wire the new config into the runtime-rs Makefile: define FIRMWAREPATH_NV with arch-specific OVMF/AAVMF paths (matching the Go runtime's PR #12780), add EDK2_NAME for x86_64, and register the config in CONFIGS/CONFIG_PATHS/SYSCONFIG_PATHS so it gets installed alongside the other runtime-rs configurations. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	cb6fb51920	runtime-rs: Do not pass through audio device from IOMMU group NVIDIA GPUs often have an HDA audio controller (PCI class 0x0403) in the same IOMMU group. This device should not be passed through to the guest, just like Host and PCI bridges. Change filter_bridge_device() to accept a slice of PCI class bitmasks and add 0x0403 (audio) to the ignore list alongside 0x0600 (host/PCI bridge). This matches the Go runtime fix from NVIDIA/kata-containers#26. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	7e2dff8179	runtime-rs: Wire BlockDeviceModern into rawblock volume and container Use BlockCfgModern for rawblock volumes when the hypervisor supports it, passing logical and physical sector sizes from the volume metadata. In the container manager, clear Linux.Resources fields (Pids, BlockIO, Network) that genpolicy expects to be null, and filter VFIO character devices from Linux.Devices to avoid policy rejection. Update Dragonball's inner_device to handle the DeviceType::VfioModern variant in its no-op match arm. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	eecb1a246c	runtime-rs: Add resource manager VFIO modern handling and CDI wiring Extend the resource manager to handle VfioModern and BlockModern device types when building the agent's device list and storage list. For VFIO modern devices, the manager resolves the container path and sets the agent Device.id to match what genpolicy expects. Rework CDI device annotation handling in container_device.rs: - Strip the "vfio" prefix from device names when building CDI annotation keys (cdi.k8s.io/vfio0, cdi.k8s.io/vfio1, etc.) - Remove the per-device index suffix that caused policy mismatches - Add iommufd cdev path support alongside legacy VFIO group paths Update the vfio driver to detect iommufd cdev vs legacy group from the CDI device node path. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	4f618d09d5	runtime-rs: Add Pod Resources CDI discovery in sandbox Query the kubelet Pod Resources API during sandbox setup to discover which GPU devices have been allocated to the pod. When cold_plug_vfio is enabled, the sandbox resolves CDI device specs, extracts host PCI addresses and IOMMU groups from sysfs, and creates VfioModernCfg device entries that get passed to the hypervisor for cold-plug. Add pod-resources and cdi crate dependencies to the runtimes and virt_container workspace members. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	21a47cfe8d	runtime-rs: Wire VFIO cold-plug into QEMU inner Implement add_device() and remove_device() support for DeviceType::VfioModern and DeviceType::BlockModern in the QEMU inner hypervisor layer. For cold-plug (before VM boot): VfioDeviceConfig/VfioDeviceGroup structs are constructed from the device's resolved PCI address, IOMMU group, and bus assignment, then appended to the QEMU command line via cmdline_generator. Block devices use VirtioBlkDevice with the modern config's sector size fields and are always cold-plugged onto the command line. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	0f9ab37abe	runtime-rs: Bump QMP timeouts for VFIO cold-plug Bump QMP connection timeout from 10s to 30s and initial read timeout from 250ms to 5s to accommodate the longer initialization time when VFIO devices are cold-plugged (IOMMU domain setup and device reset can be slow for GPUs). Re-export cmdline_generator types from qemu/mod.rs for downstream use. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	a975a998a6	runtime-rs: Add QEMU VFIO command-line parameter structs Add QEMU command-line parameter types for VFIO device cold-plug: - ObjectIommufd: /dev/iommu object for iommufd-backed passthrough - PCIeVfioDevice: vfio-pci device on a PCIe root port or switch port, supporting both legacy VFIO group and iommufd cdev backends - FWCfgDevice: firmware config device for fw_cfg blob injection - VfioDeviceBase/VfioDeviceConfig/VfioDeviceGroup: high-level wrappers that compose the above into complete QEMU argument sets, resolving IOMMU groups, device nodes, and per-device fw_cfg entries Refactor existing cmdline structs (BalloonDevice, VirtioNetDevice, VirtioBlkDevice, etc.) to use a shared devices_to_params() helper and align the ToQemuParams implementations. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	074e9e9423	runtime-rs: Add PCIe topology cold-plug port management Extend PCIeTopology to support cold-plug port reservation and release for VFIO devices. New fields track the topology mode (NoPort, RootPort, SwitchPort), whether cold-plug dynamic expansion is enabled, and a map of reserved bus assignments per device. PCIeTopology::new() now infers the mode from the configured root-port and switch-port counts, pre-seeds the port structures, and makes add_root_ports_on_bus() idempotent so that PortDevice::attach can safely call it again after the topology has already been initialized. New methods: - reserve_bus_for_device: allocate a free root port or switch downstream port for a device, expanding the port map when cold_plug is enabled - release_bus_for_device: free the previously reserved port - find_free_root_port / find_free_switch_down_port: internal helpers - release_root_port / release_switch_down_port: internal helpers Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	064aa340ab	runtime-rs: Wire modern device types into device config and manager Add DeviceConfig::VfioModernCfg and DeviceConfig::BlockCfgModern variants so the device manager can accept creation requests for the modern VFIO and block drivers introduced in the previous commits. Wire find_device() to look up VfioModern devices by iommu_group_devnode and BlockModern devices by path_on_host. Add create_block_device_modern() for BlockConfigModern with the same driver-option normalization and virt-path assignment as the legacy path. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	6c0b53fe36	runtime-rs: Add BlockDeviceModern driver Add a modern block device driver using the Arc<Mutex> pattern for interior mutability, matching the VfioDeviceModern approach. The driver implements the Device trait with attach/detach/hotplug lifecycle management, and supports BlockConfigModern with logical and physical sector size fields. Add the DeviceType::BlockModern enum variant so the driver compiles. The device_manager and hypervisor cold-plug wiring follow in subsequent commits. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	e72ed1c12e	runtime-rs: Add VFIO modern device driver Add the VfioDeviceModern driver for VFIO device passthrough in runtime-rs. The driver handles device discovery through sysfs, detects whether the host uses iommufd cdev or legacy VFIO group interfaces, resolves PCI BDF addresses and IOMMU groups, and implements the Device and PCIeDevice traits for hypervisor integration. The module is structured as: - core.rs: sysfs discovery, BDF parsing, IOMMU group resolution, device-node path logic for both iommufd cdev and legacy group paths - device.rs: VfioDeviceModern/VfioDeviceModernHandle types, Device and PCIeDevice trait implementations - mod.rs: host capability detection (iommufd vs legacy), backend selection logic The DeviceType::VfioModern enum variant and stub PCIeTopology methods (reserve_bus_for_device, release_bus_for_device) are added so the driver compiles; full topology wiring follows in a subsequent commit. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Alex Lyn	564c39907a	runtime-rs: Improve vsock connect with spawn_blocking and backoff The vsock connect loop previously ran the blocking connect(2) syscall directly on a tokio async worker thread, which could stall other async tasks. Move the socket creation and connect(2) call into spawn_blocking so the async runtime remains responsive. Replace the fixed-interval retry loop with an Instant-based deadline and bounded exponential backoff (10ms-500ms, doubling each attempt). This avoids hammering the vsock endpoint during slow VM boots while still converging quickly once the guest agent is ready. Also improve log messages to include attempt counts and remaining time. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-07 10:33:26 +02:00
Greg Kurz	bb933f65e4	vendor: Remove `make vendor` across the repo `make vendor` isn't required anymore. People who need vendored code should use the `tools/packaging/release/generate_vendor.sh` script instead. Assisted-by: Claude AI Signed-off-by: Greg Kurz <groug@kaod.org>	2026-05-06 09:49:52 +02:00
Fabiano Fidêncio	210ad5de98	runtime-rs: Bump netlinks for Linux 6.17+ IPv6 dev conf RTNetlink Upgrade netlink-packet-route and rtnetlink so IFLA_INET6_CONF matches the kernel's 240-byte layout (DEVCONF_FORCE_FORWARDING). Adapt to API changes: NeighbourAttribute::LinkLayerAddress and bool MulticastSnooping. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-05 13:56:44 +02:00
Fabiano Fidêncio	258ab1eab4	runtime-rs: network: use provided name for virt interface lookup NetworkPair::new() always constructed the virtual interface name as "eth{idx}" and looked it up in the network namespace. This works for regular veth endpoints created by CNI (which names them eth0, eth1, etc.), but fails for interfaces injected by Multus with different names (e.g. "net1" for mlx5 Scalable Functions). The `name` parameter was only applied after the lookup to override the stored name, which is too late — the lookup already failed with "No such device (os error 19)". Use the provided name directly for the lookup when it is non-empty, falling back to "eth{idx}" only when no name is given. This also removes the now-redundant post-creation name override. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-05 12:07:06 +02:00
Fabiano Fidêncio	746d182c1a	runtime-rs: qemu: add CCW network hotplug & retry update_interface On s390x, QEMU uses the CCW bus instead of PCI. The network device hotplug path was hardcoded to find a PCI slot, which fails with "no free slots on PCI bridges" on s390x. Add CCW support to `hotplug_network_device`: when running on a native CCW bus, allocate a CCW subchannel address and use `devno` instead of PCI `bus`/`addr`/`vectors`. Additionally, after hotplugging a network device, the guest kernel needs time to probe the CCW device before the network interface appears. Add a retry loop (up to 10 attempts, 100ms apart) to `handle_interfaces` so that `update_interface` succeeds once the guest has created the link. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-05-03 19:26:39 +02:00
Steve Horsman	2435970fe8	Merge pull request #12933 from fidencio/topic/runtime-rs-decouple-dragonball-from-non-x86-checks runtime-rs: drop misleading unsupported arches gating	2026-04-28 18:36:16 +01:00
Aurélien Bombo	cf6a91a104	runtime-rs/config: rename cloud-hypervisor to clh This aligns on the previous commit and runtime-go. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-04-28 10:58:01 -05:00
Aurélien Bombo	e4fbddb91a	ci: rename cloud-hypervisor to clh-runtime-rs This aligns on qemu-runtime-rs and makes more sense. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-04-28 10:58:01 -05:00
Fabiano Fidêncio	7e5cc37fab	runtime-rs: resource: discover hugetlbfs page sizes from sysfs in test `volume::hugepage::tests::test_get_huge_page_size` was hard-coded to exercise the round-trip through `get_huge_page_option` / `get_page_size` for two hugetlbfs page sizes: let format_sizes = ["1Gi", "2Mi"]; These are the sizes x86_64 Ubuntu kernels expose by default (`/sys/kernel/mm/hugepages/hugepages-{1048576,2048}kB`), but other architectures use different sizes: * s390x: typically `hugepages-1048576kB` only (1 GiB; no 2 MiB pool) -- the kernel returns `EINVAL` for the missing 2 MiB iteration: thread 'volume::hugepage::tests::test_get_huge_page_size' panicked at .../resource/src/volume/hugepage.rs:242:14: called `Result::unwrap()` on an `Err` value: EINVAL * ppc64le: page sizes vary by kernel build (e.g. 16M/16G with 64K base pages, 2M/1G with 4K base pages), and may not match `["1Gi", "2Mi"]` exactly. Same EINVAL on the iteration whose size isn't a registered hstate. The reason this never bit before is the same as the SELinux test in the previous-but-one commit: the runtime-rs `Makefile` wrapped `test` in an `ifeq UNSUPPORTED_ARCHS` block that turned it into `echo ...; exit 0` on s390x/ppc64le/riscv64gc, so the test was only ever exercised on x86_64 (and aarch64, which happens to have the same default hugetlb page sizes). Dropping that gate is what exposed the latent assumption. Replace the hard-coded list with a small helper that lists the hugetlbfs page sizes the running kernel actually exposes via `/sys/kernel/mm/hugepages/hugepages-NkB`, rendered as binary-unit strings (e.g. "2Mi", "1Gi") that are accepted both by the kernel's `pagesize=...` mount option and by `byte_unit::Byte::parse_str(s, /allow_binary=/ true)`. If `/sys/kernel/mm/hugepages` doesn't exist or the directory is empty (e.g. hugetlbfs is unconfigured in the test environment) the test simply returns -- there's nothing meaningful to round-trip. On x86_64 the discovered list comes out as `["1Gi", "2Mi"]` (the same coverage as before). On s390x it becomes `["1Gi"]`, on ppc64le whatever that kernel build supports. Sysfs alone, however, is a necessary-but-not-sufficient signal: it tells us the kernel registered the page size, not whether this process is allowed to mount hugetlbfs. The ubuntu-24.04-s390x GHA runner demonstrates the gap -- it exposes `hugepages-1048576kB` via /sys but runs the build inside a user/mount namespace where mount(2) of hugetlbfs returns EPERM even when the test is invoked through sudo: thread 'volume::hugepage::tests::test_get_huge_page_size' panicked at .../resource/src/volume/hugepage.rs:292:14: called `Result::unwrap()` on an `Err` value: EPERM There's no portable capability bit we can sniff for that, so probe once with the first discovered size before iterating; if the probe mount fails, skip the test (rather than panic on something it can't control). A real regression on a host where mount() does work will still surface inside the loop below, since the per-size mount calls there continue to assert via `.unwrap()`. While here, feed the kernel-native shorthand (e.g. "2M", "1G") rather than the IEC form ("2Mi", "1Gi") to mount(2). hugetlbfs parses `pagesize=` via `memparse()`, which understands K/M/G but not the IEC `Ki/Mi/Gi`; today the kernel happens to silently drop the trailing `i` (memparse just stops scanning), but that leniency is incidental. /proc/mounts in turn always renders the option back as `pagesize=<N>{K,M,G}`, which is exactly the form `get_page_size()` already expects -- it strips `pagesize=` and unconditionally appends `i` before handing the result to byte_unit. Stripping the `i` for the mount option keeps the test's input aligned with the kernel's canonical syntax, while leaving the IEC form intact for the `Byte::parse_str(..., /allow_binary=/ true)` comparison. Also drop the unused `Ok` re-export from `use anyhow::{anyhow, Context, Ok, Result}`. Every existing `Ok(...)` site in this module is the variant-constructor form, for which the prelude's `Result::Ok` already works fine in `anyhow::Result<T>` context (same enum, with `E = anyhow::Error` inferred from the surrounding return type), so nothing actually needed `anyhow::Ok` to begin with. Removing the import lets the new helper use plain `let Ok(entries) = ... else` / `let Ok(name) = ... else` patterns directly instead of funneling everything through `.ok()` + `if let Some(...)` to dodge the shadowing. Made-with: Cursor Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 16:25:31 +02:00
Fabiano Fidêncio	cd67638618	runtime-rs: hypervisor: don't assert kernel LSM behaviour in selinux test `selinux::tests::test_set_exec_label` had two branches: when SELinux is enabled it asserts that `set_exec_label` succeeds and round-trips the label through `/proc/thread-self/attr/exec`, and when SELinux is NOT enabled it asserted that `set_exec_label` returns `Err`. The second assertion is wrong -- it's a claim about the kernel/LSM interface, not about `set_exec_label` itself. `/proc/thread-self/attr/exec` is a generic LSM interface, not SELinux-specific. When no LSM owns the slot, kernel behaviour is arch/distro/build dependent: some kernels return `EINVAL` (observed on x86_64 Ubuntu CI runners, where the test was originally written and was passing), others silently accept the write (observed on ppc64le Ubuntu CI runners, which is what made this surface): thread 'selinux::tests::test_set_exec_label' panicked at src/runtime-rs/crates/hypervisor/src/selinux.rs:62:13: Expecting error, Got Ok(()) The reason this never blew up before is that the previous-but-one commit's `ifeq UNSUPPORTED_ARCHS ... exit 0` block in the runtime-rs `Makefile` made `make test` a no-op on s390x/ppc64le/riscv64gc. Dropping that gate (so `make test` actually runs on every arch that runtime-rs builds on) is what surfaced the latent bug. Drop the `else { assert!(ret.is_err(), ...); }` branch and replace it with a comment explaining why we deliberately don't assert on `ret` in that path. The "SELinux is enabled" branch is the only side that exercises anything we own; the no-SELinux path is a kernel detail that's not ours to normalize. Made-with: Cursor Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 16:25:31 +02:00
Fabiano Fidêncio	48ef1be3be	runtime-rs: Drop misleading "unsupported arch" gates The Makefile pretended to reject s390x, powerpc64le and riscv64gc by wrapping `default`, `test` and `install` in `ifeq UNSUPPORTED_ARCHS`, and `check` in `ifeq ($(ARCH),x86_64)`. In reality `default` and `install` were byte-for-byte identical in both branches, so only `test` and `check` were ever skipped. The user-visible "$(ARCH) is not currently supported" message and the bare `exit 0` made it look like the build was a no-op when in fact builds and installs were proceeding -- which has burned at least one maintainer trying to debug a downstream packaging failure (issue #12914). The original reasons those targets were skipped were: * `test` (commit `389ae9702`, 2022): `cargo test` would pull in the dragonball crate, which only builds on x86_64/aarch64. * `check`: delegates to `standard_rust_check` in utils.mk, which runs `cargo clippy --all-targets --all-features`. `--all-features` unconditionally turns on the `dragonball` (and `cloud-hypervisor`) feature regardless of arch, breaking the build wherever those crates can't compile. Both are now obsolete. The preceding commit arch-gated the dragonball and firecracker drivers (and their dependencies) at the Cargo and Rust source level, so on s390x/ppc64le/riscv64gc: * the `dragonball` cargo feature is a safe no-op -- enabling it just doesn't pull in the dep, * the `cloud-hypervisor` cargo feature still pulls in `ch-config` (which is portable Rust), but the `ch` driver module that uses it remains arch-gated at the source level, * `dbs-utils` and `hyperlocal` are not built at all. That means `cargo clippy --all-targets --all-features` -- exactly what `standard_rust_check` runs -- is safe on every architecture, and no runtime-rs-local override of `check` is needed. Drop both `ifeq` blocks and let `test` and `check` run on every arch the way `default` and `install` already did. Net result: `make {default, test,check,install}` now Just Work everywhere, with no arch-specific code paths in this Makefile and no misleading "not currently supported" messages. Fixes: #12914 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 16:25:31 +02:00
Fabiano Fidêncio	6a1d7f7d85	runtime-rs: Arch-gate dragonball and firecracker hypervisors Two of the in-tree hypervisor drivers, dragonball and firecracker, along with three of their transitive dependencies (the dragonball crate itself, dbs-utils, hyperlocal), are built unconditionally on every architecture even though both upstream projects only support x86_64 and aarch64: * dragonball: the dragonball VMM crate is x86_64+aarch64 only. The runtime-rs `dragonball` cargo feature is already gated via `USE_BUILTIN_DB` -> `ARCH_SUPPORT_DB` in the Makefile, so the default `make` flow does the right thing today. But anything that bypasses that gate -- a contributor running `cargo clippy --all-features`, a CI matrix that forces the feature on, etc. -- fails to build on s390x/ppc64le/riscv64gc, because the optional `dragonball` dependency is declared without a target predicate and Rust source sites reference it under a feature gate alone. * firecracker: firecracker upstream only releases for x86_64 and aarch64 (https://github.com/firecracker-microvm/firecracker/releases/tag/v1.15.1). The Makefile already reflects this -- `FCCMD` is only defined in the x86_64/aarch64 arch options files -- but the in-tree `firecracker` driver module compiles unconditionally, so on s390x/ppc64le/riscv64gc we still ship a runtime that thinks it can drive a hypervisor binary that doesn't exist on the platform. Decouple both at the Cargo and Rust source level, mirroring the existing cloud-hypervisor pattern. * Cargo.toml: move the optional `dragonball` dependency, plus `dbs-utils` and `hyperlocal` (whose only consumers are the dragonball and firecracker driver modules), into a target- specific dependency block: [target.'cfg(any(target_arch = "x86_64", target_arch = "aarch64"))'.dependencies] dbs-utils = { workspace = true } hyperlocal = { workspace = true } dragonball = { workspace = true, features = [ ... ], optional = true } On x86_64/aarch64 the resolved dep graph is unchanged. On s390x/ppc64le/riscv64gc enabling the `dragonball` feature becomes a safe no-op, and the dep graph for the `hypervisor` crate is completely free of any dragonball or firecracker artifacts. This also makes the gating self-policing: any future `use dbs_utils::...` or `use hyperlocal::...` outside an arch-gated module will fail to build on non-x86 instead of silently shipping dead code. * Rust modules: combine the existing `feature = "dragonball"` gate with `target_arch = "x86_64"\|"aarch64"` on `pub mod dragonball;` and the dragonball-only constants (`DEV_HUGEPAGES`, `SHMEM`, `HUGE_SHMEM`) in `crates/hypervisor/src/lib.rs`. Add the same target_arch gate to `pub mod firecracker;` (matching the existing gate on `pub mod ch;`) and to every site in `crates/runtimes/virt_container/src/{lib,sandbox}.rs` that names a now-gated type (`Dragonball`, `Firecracker`, `DragonballConfig`, `FirecrackerConfig`). * `pub(crate) enum VmmState` in `crates/hypervisor/src/lib.rs` gets the same target_arch gate -- its only consumers are the `ch`, `dragonball` and `firecracker` modules, all of which are gated to x86_64+aarch64. Without it, `cargo clippy --all-features -- -D warnings` (i.e. what `make check` runs via `standard_rust_check`) would fail on non-x86 with "enum `VmmState` is never used". The plain `HYPERVISOR_DRAGONBALL` and `HYPERVISOR_FIRECRACKER` string constants stay ungated, and the persist-side match arms in `sandbox.rs` that only compare against those strings also stay ungated, mirroring how `HYPERVISOR_NAME_CH` is already handled. Verified with `cargo tree --target=<triple> --features dragonball -p hypervisor` for x86_64/aarch64/s390x/powerpc64le/riscv64gc: * x86_64/aarch64: full dragonball stack (dbs_address_space, dbs_allocator, dbs_arch, dbs_boot, dbs_device, dbs_interrupt, dbs_legacy_devices, dbs_pci, dbs_upcall, dbs-utils, hyperlocal, ...) is pulled in, as before. * s390x/ppc64le/riscv64gc: the dep graph for the `hypervisor` crate is completely free of any dragonball or firecracker artifacts, even with `--features dragonball` explicitly enabled. `cargo clippy --target=s390x-unknown-linux-gnu --all-targets --all-features --release --locked -- -D warnings` is also clean, and `make check` on x86_64 with the default `USE_BUILTIN_DB=true` still passes. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 16:25:31 +02:00
Fabiano Fidêncio	19bb8746f8	runtime-rs: rescan network at Start RPC for Docker 26+ Docker 26+ configures the container's veth pair between the Create and Start RPCs by bind-mounting `/proc/<vmm_pid>/ns/net`. The Rust shim's network scan during sandbox creation finds no interfaces because they don't exist yet. The Go shim (commit `f7878cc`) solves this with `detectHypervisorNetns` inside `addAllEndpoints`: when the placeholder netns is empty, it switches to the hypervisor's network namespace and rescans there. Port this approach to the Rust shim: - Add `rescan_network()` to the `Sandbox` trait - Implement it on `VirtSandbox`: build a rescan config that always targets the hypervisor's netns (`/proc/<vmm_pid>/ns/net`), bypassing the placeholder netns and the `network_created` flag - Call `sandbox.rescan_network()` synchronously in the `StartProcess` handler, before `cm.start_process()`, so interfaces are wired before the container process runs Fixes: #9340 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 10:20:18 +02:00
Fabiano Fidêncio	0da2f00488	runtime-rs: resource: add network rescan polling for Docker 26+ Docker 26+ configures veth pairs in the hypervisor's network namespace between the Create and Start RPCs. The initial network scan during sandbox creation finds no interfaces because they do not exist yet. Add `rescan_network_if_unconfigured` which polls the network namespace (50ms intervals, 5s timeout) until interfaces appear, then pushes the configuration to the guest agent. This mirrors the Go runtime's `RescanNetwork` (commit `f7878cc`). Supporting changes: - Derive `Clone` on `NetworkWithNetNsConfig` so it can be reused across poll iterations - Add `tokio/time` feature to the resource crate - Add `apply_network_to_agent` helper to push interfaces, routes, and neighbors to the guest Fixes: #9340 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 10:20:18 +02:00
Fabiano Fidêncio	67679ddd15	runtime-rs: detect Docker 26+ netns from hook args and filter /proc/0/ Docker 26+ with `runtimeType` may not publish the network namespace in `linux.namespaces` at create time. Instead, the netns path can be discovered from `libnetwork-setkey` hook arguments. Additionally, filter out the invalid `/proc/0/ns/net` placeholder that appears when the task PID is not yet known. This mirrors the Go runtime's `DockerNetnsPath` fallback logic. Fixes: #9340 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 10:20:18 +02:00
Fabiano Fidêncio	3ad2de584f	runtime-rs: return hypervisor PID from container manager methods Docker and containerd use the PID returned by the shim to construct `/proc/<pid>/ns/net` for network namespace operations. The Rust shim was returning the shim's own PID instead of the hypervisor's PID, which meant Docker would look at the wrong network namespace. Update `create_container`, `start_process`, `state_process`, `pid`, and `connect_container` to return the VMM master thread/process ID (`vmm_master_tid`) instead of `self.pid`. For QEMU this is the QEMU process PID; for Dragonball this is the VMM thread ID — both are valid for `/proc/<id>/ns/net` on Linux. Fixes: #9340 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 10:20:18 +02:00
Fabiano Fidêncio	b1393f03c4	runtime-rs: fix ConnectResponse to set both shim_pid and task_pid The containerd runtime v2 `shimTask.Create()` discards the `CreateTaskResponse.Pid` and instead retrieves the task PID by calling the shim's Connect RPC, reading `ConnectResponse.task_pid`. The Rust shim only set `shim_pid` in the ConnectResponse, leaving `task_pid` at its default zero value. This caused Docker to call `sb.SetKey("/proc/0/ns/net", ...)` which fails with "no such file or directory". Set `shim_pid` to the actual shim process ID and `task_pid` to the hypervisor PID (vmm_master_tid), matching the Go shim's Connect handler behavior. Fixes: #9340 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-28 10:20:18 +02:00
Steve Horsman	d5785b4eba	Merge pull request #12872 from stevenhorsman/bump-rust-to-1.93 Bump rust to 1.93	2026-04-27 09:01:00 +01:00
Fabiano Fidêncio	b3ed669d16	Merge pull request #12913 from pmores/fix-exec runtime-rs: fix exec when selinux is disabled on guest	2026-04-25 17:34:46 +02:00
stevenhorsman	1dbfd4b7f4	runtime-rs: Fix clippy warnings for Rust 1.93 - Replace is_ok() check followed by unwrap_err() with if let Err pattern - Replace .err().expect() with .expect_err() - Replace is_some() check followed by unwrap() with if let Some pattern These changes address clippy::unnecessary_unwrap and clippy::err_expect warnings in Rust 1.93. Assisted-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-04-25 11:27:39 +01:00
Pavel Mores	d3f56cd3a6	runtime-rs: remove process selinux label on exec if disable_guest_selinux Without this commit any attempt to exec a command in a container will fail if SELinux is disabled in the guest but an SELinux label is given for the new process. That will happen pretty much any time SELinux is enabled on the host (and the container is not privileged). Signed-off-by: Pavel Mores <pmores@redhat.com>	2026-04-25 11:27:15 +01:00

1 2 3 4 5 ...

1258 Commits