Add qemu-nvidia-gpu-runtime-rs and qemu-nvidia-gpu-snp-runtime-rs to
the NVIDIA GPU test matrix so CI covers the new runtime-rs shims.
Introduce a `coco` boolean field in each matrix entry and use it for
all CoCo-related conditionals (KBS, snapshotter, KBS deploy/cleanup
steps). This replaces fragile name-string comparisons that were already
broken for the runtime-rs variants: `nvidia-gpu (runtime-rs)` was
incorrectly getting KBS steps, and `nvidia-gpu-snp (runtime-rs)` was
not getting the right env vars.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Register the new qemu-nvidia-gpu-tdx-runtime-rs shim across the kata-deploy
stack so it is built, installed, and exposed as a RuntimeClass.
This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the
runtime-rs binary), SHIMS list, the qemu-tdx-experimental share name
mapping, and the x86_64 default shim set. The Helm chart gets the new
shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the
RuntimeClass overhead definition in runtimeclasses.yaml.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add a new runtime-rs configuration template that combines the NVIDIA GPU
cold-plug stack with Intel TDX confidential guest support. This is the
runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-tdx
template.
The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API,
NV-specific kernel/image/firmware, extended timeouts) with TDX confidential
guest settings (confidential_guest, OVMF.inteltdx.fd firmware, TDX Quote
Generation Service socket, confidential NV kernel and image).
The Makefile is updated with the new config file registration and the
FIRMWARETDVFPATH_NV variable pointing to OVMF.inteltdx.fd.
Also removes a stray tdx_quote_generation_service_socket_port setting
from the SNP GPU template where it did not belong.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Register the new qemu-nvidia-gpu-snp-runtime-rs shim across the kata-deploy
stack so it is built, installed, and exposed as a RuntimeClass.
This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the
runtime-rs binary), SHIMS list, the qemu-snp-experimental share name
mapping, and the x86_64 default shim set. The Helm chart gets the new
shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the
RuntimeClass overhead definition in runtimeclasses.yaml.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add a new runtime-rs configuration template that combines the NVIDIA GPU
cold-plug stack with AMD SEV-SNP confidential guest support. This is the
runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-snp
template.
The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API,
NV-specific kernel/image/firmware, extended timeouts) with the SNP
confidential guest settings (confidential_guest, sev_snp_guest, SNP ID
block/auth, guest policy, AMDSEV.fd firmware, confidential NV kernel and
image).
The Makefile is updated with the new config file registration, the
CONFIDENTIAL_NV image/kernel variables, and FIRMWARESNPPATH_NV pointing
to AMDSEV.fd.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Register the Rust NVIDIA GPU runtime as a kata-deploy shim so it gets
installed and configured alongside the existing Go-based
qemu-nvidia-gpu shim.
Add qemu-nvidia-gpu-runtime-rs to the RUST_SHIMS list and the default
enabled shims, create its RuntimeClass entry in the Helm chart, and
include it in the try-kata-nvidia-gpu values overlay. The kata-deploy
installer will now copy the runtime-rs configuration and create the
containerd runtime entry for it.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add a QEMU configuration template for the NVIDIA GPU runtime-rs shim,
mirroring the Go runtime's configuration-qemu-nvidia-gpu.toml.in. The
template uses _NV-suffixed Makefile variables for kernel, image, and
verity params so the GPU-specific rootfs and kernel are selected at
build time.
Wire the new config into the runtime-rs Makefile: define
FIRMWAREPATH_NV with arch-specific OVMF/AAVMF paths (matching the Go
runtime's PR #12780), add EDK2_NAME for x86_64, and register the config
in CONFIGS/CONFIG_PATHS/SYSCONFIG_PATHS so it gets installed alongside
the other runtime-rs configurations.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Extend the in-guest agent's VFIO device handler to support the cold-plug
flow. When the runtime cold-plugs a GPU before the VM boots, the agent
needs to bind the device to the vfio-pci driver inside the guest and
set up the correct /dev/vfio/ group nodes so the workload can access
the GPU.
This updates the device discovery logic to handle the PCI topology that
QEMU presents for cold-plugged vfio-pci devices and ensures the IOMMU
group is properly resolved from the guest's sysfs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Use BlockCfgModern for rawblock volumes when the hypervisor supports it,
passing logical and physical sector sizes from the volume metadata.
In the container manager, clear Linux.Resources fields (Pids, BlockIO,
Network) that genpolicy expects to be null, and filter VFIO character
devices from Linux.Devices to avoid policy rejection.
Update Dragonball's inner_device to handle the DeviceType::VfioModern
variant in its no-op match arm.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Extend the resource manager to handle VfioModern and BlockModern device
types when building the agent's device list and storage list. For VFIO
modern devices, the manager resolves the container path and sets the
agent Device.id to match what genpolicy expects.
Rework CDI device annotation handling in container_device.rs:
- Strip the "vfio" prefix from device names when building CDI annotation
keys (cdi.k8s.io/vfio0, cdi.k8s.io/vfio1, etc.)
- Remove the per-device index suffix that caused policy mismatches
- Add iommufd cdev path support alongside legacy VFIO group paths
Update the vfio driver to detect iommufd cdev vs legacy group from
the CDI device node path.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Query the kubelet Pod Resources API during sandbox setup to discover
which GPU devices have been allocated to the pod. When cold_plug_vfio
is enabled, the sandbox resolves CDI device specs, extracts host PCI
addresses and IOMMU groups from sysfs, and creates VfioModernCfg
device entries that get passed to the hypervisor for cold-plug.
Add pod-resources and cdi crate dependencies to the runtimes and
virt_container workspace members.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Implement add_device() and remove_device() support for
DeviceType::VfioModern and DeviceType::BlockModern in the QEMU inner
hypervisor layer.
For cold-plug (before VM boot): VfioDeviceConfig/VfioDeviceGroup
structs are constructed from the device's resolved PCI address, IOMMU
group, and bus assignment, then appended to the QEMU command line via
cmdline_generator.
For hotplug (after VM boot): the iommufd QMP path is used to live-add
VFIO devices via object-add + device_add.
Block devices use VirtioBlkDevice with the modern config's sector size
fields and are always cold-plugged onto the command line.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add hotplug_iommufd() to QmpClient for attaching VFIO devices through
the iommufd backend at runtime via QMP object-add + device_add.
Bump QMP connection and command timeouts from 10s to 30s to accommodate
the longer initialization time when VFIO devices are cold-plugged
(IOMMU domain setup and device reset can be slow for GPUs).
Re-export cmdline_generator types from qemu/mod.rs for downstream use.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add QEMU command-line parameter types for VFIO device cold-plug:
- ObjectIommufd: /dev/iommu object for iommufd-backed passthrough
- PCIeVfioDevice: vfio-pci device on a PCIe root port or switch port,
supporting both legacy VFIO group and iommufd cdev backends
- FWCfgDevice: firmware config device for fw_cfg blob injection
- VfioDeviceBase/VfioDeviceConfig/VfioDeviceGroup: high-level wrappers
that compose the above into complete QEMU argument sets, resolving
IOMMU groups, device nodes, and per-device fw_cfg entries
Refactor existing cmdline structs (BalloonDevice, VirtioNetDevice,
VirtioBlkDevice, etc.) to use a shared devices_to_params() helper
and align the ToQemuParams implementations.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Extend PCIeTopology to support cold-plug port reservation and release
for VFIO devices. New fields track the topology mode (NoPort, RootPort,
SwitchPort), whether cold-plug dynamic expansion is enabled, and a map
of reserved bus assignments per device.
PCIeTopology::new() now infers the mode from the configured root-port
and switch-port counts, pre-seeds the port structures, and makes
add_root_ports_on_bus() idempotent so that PortDevice::attach can
safely call it again after the topology has already been initialized.
New methods:
- reserve_bus_for_device: allocate a free root port or switch downstream
port for a device, expanding the port map when cold_plug is enabled
- release_bus_for_device: free the previously reserved port
- find_free_root_port / find_free_switch_down_port: internal helpers
- release_root_port / release_switch_down_port: internal helpers
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add DeviceConfig::VfioModernCfg and DeviceConfig::BlockCfgModern
variants so the device manager can accept creation requests for the
modern VFIO and block drivers introduced in the previous commits.
Wire find_device() to look up VfioModern devices by iommu_group_devnode
and BlockModern devices by path_on_host. Add create_block_device_modern()
for BlockConfigModern with the same driver-option normalization and
virt-path assignment as the legacy path.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a modern block device driver using the Arc<Mutex> pattern for
interior mutability, matching the VfioDeviceModern approach. The driver
implements the Device trait with attach/detach/hotplug lifecycle
management, and supports BlockConfigModern with logical and physical
sector size fields.
Add the DeviceType::BlockModern enum variant so the driver compiles.
The device_manager and hypervisor cold-plug wiring follow in subsequent
commits.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add the VfioDeviceModern driver for VFIO device passthrough in
runtime-rs. The driver handles device discovery through sysfs, detects
whether the host uses iommufd cdev or legacy VFIO group interfaces,
resolves PCI BDF addresses and IOMMU groups, and implements the Device
and PCIeDevice traits for hypervisor integration.
The module is structured as:
- core.rs: sysfs discovery, BDF parsing, IOMMU group resolution,
device-node path logic for both iommufd cdev and legacy group paths
- device.rs: VfioDeviceModern/VfioDeviceModernHandle types, Device
and PCIeDevice trait implementations
- mod.rs: host capability detection (iommufd vs legacy), backend
selection logic
The DeviceType::VfioModern enum variant and stub PCIeTopology methods
(reserve_bus_for_device, release_bus_for_device) are added so the
driver compiles; full topology wiring follows in a subsequent commit.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The vsock connect loop previously ran the blocking connect(2) syscall
directly on a tokio async worker thread, which could stall other async
tasks. Move the socket creation and connect(2) call into
spawn_blocking so the async runtime remains responsive.
Replace the fixed-interval retry loop with an Instant-based deadline
and bounded exponential backoff (10ms-500ms, doubling each attempt).
This avoids hammering the vsock endpoint during slow VM boots while
still converging quickly once the guest agent is ready.
Also improve log messages to include attempt counts and remaining time.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The vfio-ioctls 0.6.0 crate changed the vfio_dma_map signature: the
host address parameter is now a raw pointer (*mut u8) instead of u64,
and the size parameter is usize instead of u64. Since the kernel uses
the host address to set up DMA mappings to physical memory — and the
caller must guarantee the memory behind that pointer remains valid for
the lifetime of the mapping — upstream marked vfio_dma_map as unsafe fn.
Wrap vfio_dma_map calls in unsafe blocks and adjust the type casts
accordingly. vfio_dma_unmap only needed the usize cast for the size
parameter (it does not take a host address, so it remains safe).
Bump workspace dependencies:
- vfio-bindings 0.6.1 -> 0.6.2
- vfio-ioctls 0.5.0 -> 0.6.0
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The VFIO cold-plug path needs to resolve a PCI device's sysfs address
from its /dev/vfio/ group or iommufd cdev node. Extend the PCI helpers
in kata-sys-util to support this: add a function that walks
/sys/bus/pci/devices to find a device by its IOMMU group, and expose the
guest BDF that the QEMU command line will reference.
These helpers are consumed by the runtime-rs hypervisor crate when
building VFIO device descriptors for the QEMU command line.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The Go runtime already exposes a [runtime] pod_resource_api_sock option
that tells the shim where to find the kubelet Pod Resources API socket.
The runtime-rs VFIO cold-plug code needs the same setting so it can
query assigned GPU devices before the VM starts.
Add the field to RuntimeConfig and wire it through deserialization so
that configuration-*.toml files can set it.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a gRPC client crate that speaks the kubelet PodResourcesLister
service (v1). The runtime-rs VFIO cold-plug path needs this to discover
which GPU devices the kubelet has assigned to a pod so they can be
passed through to the guest before the VM boots.
The crate is intentionally kept minimal: it wraps the upstream
pod_resources.proto, exposes a Unix-domain-socket client, and
re-exports the generated types.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Update the name and move it to the static checks as we don't
need to ensure it's running for none code changes.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The cargo deny generated action doesn't seem to work
and seems unnecessarily complex, so try using
EmbarkStudios/cargo-deny-action instead
Fixes: #11218
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The new version of image-rs supports more types of signed images. First,
we added supported for a few more key types. Second, we added support
for multi-arch images where the manifest digest is signed but the
individual arch manifest is not. These images are relatively common, so
let's pickup the fix asap.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
I don't think agent-ctl will benefit from the new image-rs features, but
let's update it to be complete.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
This is not related to this PR, but rather to #12734, which ended up not
running the `make src/agent generate-protocols`.
While here, let's also fix it.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:
block_device_logical_sector_size (config file)
block_device_physical_sector_size (config file)
io.katacontainers.config.hypervisor.blk_logical_sector_size (annotation)
io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)
The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.
Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.
Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.
Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Add a global and per-shim configurable switch to enable/disable
the overhead section in generated RuntimeClasses. This allows users
to omit overhead when it's not needed or managed externally.
Priority: per-shim > global > default(true).
Signed-off-by: LizZhang315 <123134987@qq.com>
Users were confused about which configuration file to edit because
kata-deploy copied the base config into a per-shim runtime directory
(runtimes/<shim>/) for config.d support, leaving the original file
in place untouched. This made it look like the original was the
authoritative config, when in reality the runtime was loading the
copy from the per-shim directory.
Replace the original config file with a symlink pointing to the
per-shim runtime copy after the copy is made. The runtime's
ResolvePath / EvalSymlinks follows the symlink and lands in the
per-shim directory, where it naturally finds config.d/ with all
drop-in fragments. This makes it immediately obvious that the
real configuration lives in the per-shim directory and removes the
ambiguity about which file to inspect or modify.
During cleanup, the symlink at the original location is explicitly
removed before the runtime directory is deleted.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The k8s-confidential-attestation test extracts the QEMU command line
from journal logs to compute the SNP launch measurement. It only
matched the Go runtime's log format ("launching <path> with: [<args>]"),
but runtime-rs logs differently ("qemu args: <args>").
Handle both formats so the test works with qemu-snp-runtime-rs.
Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
As we're in the process to stabilise runtime-rs for the coming 4.0.0
release, we better start running as many tests as possible with that.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>