Extend the in-guest agent's VFIO device handler to support the cold-plug
flow. When the runtime cold-plugs a GPU before the VM boots, the agent
needs to bind the device to the vfio-pci driver inside the guest and
set up the correct /dev/vfio/ group nodes so the workload can access
the GPU.
This updates the device discovery logic to handle the PCI topology that
QEMU presents for cold-plugged vfio-pci devices and ensures the IOMMU
group is properly resolved from the guest's sysfs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Use BlockCfgModern for rawblock volumes when the hypervisor supports it,
passing logical and physical sector sizes from the volume metadata.
In the container manager, clear Linux.Resources fields (Pids, BlockIO,
Network) that genpolicy expects to be null, and filter VFIO character
devices from Linux.Devices to avoid policy rejection.
Update Dragonball's inner_device to handle the DeviceType::VfioModern
variant in its no-op match arm.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Extend the resource manager to handle VfioModern and BlockModern device
types when building the agent's device list and storage list. For VFIO
modern devices, the manager resolves the container path and sets the
agent Device.id to match what genpolicy expects.
Rework CDI device annotation handling in container_device.rs:
- Strip the "vfio" prefix from device names when building CDI annotation
keys (cdi.k8s.io/vfio0, cdi.k8s.io/vfio1, etc.)
- Remove the per-device index suffix that caused policy mismatches
- Add iommufd cdev path support alongside legacy VFIO group paths
Update the vfio driver to detect iommufd cdev vs legacy group from
the CDI device node path.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Query the kubelet Pod Resources API during sandbox setup to discover
which GPU devices have been allocated to the pod. When cold_plug_vfio
is enabled, the sandbox resolves CDI device specs, extracts host PCI
addresses and IOMMU groups from sysfs, and creates VfioModernCfg
device entries that get passed to the hypervisor for cold-plug.
Add pod-resources and cdi crate dependencies to the runtimes and
virt_container workspace members.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Implement add_device() and remove_device() support for
DeviceType::VfioModern and DeviceType::BlockModern in the QEMU inner
hypervisor layer.
For cold-plug (before VM boot): VfioDeviceConfig/VfioDeviceGroup
structs are constructed from the device's resolved PCI address, IOMMU
group, and bus assignment, then appended to the QEMU command line via
cmdline_generator.
Block devices use VirtioBlkDevice with the modern config's sector size
fields and are always cold-plugged onto the command line.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Bump QMP connection timeout from 10s to 30s and initial read timeout
from 250ms to 5s to accommodate the longer initialization time when
VFIO devices are cold-plugged (IOMMU domain setup and device reset
can be slow for GPUs).
Re-export cmdline_generator types from qemu/mod.rs for downstream use.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add QEMU command-line parameter types for VFIO device cold-plug:
- ObjectIommufd: /dev/iommu object for iommufd-backed passthrough
- PCIeVfioDevice: vfio-pci device on a PCIe root port or switch port,
supporting both legacy VFIO group and iommufd cdev backends
- FWCfgDevice: firmware config device for fw_cfg blob injection
- VfioDeviceBase/VfioDeviceConfig/VfioDeviceGroup: high-level wrappers
that compose the above into complete QEMU argument sets, resolving
IOMMU groups, device nodes, and per-device fw_cfg entries
Refactor existing cmdline structs (BalloonDevice, VirtioNetDevice,
VirtioBlkDevice, etc.) to use a shared devices_to_params() helper
and align the ToQemuParams implementations.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Extend PCIeTopology to support cold-plug port reservation and release
for VFIO devices. New fields track the topology mode (NoPort, RootPort,
SwitchPort), whether cold-plug dynamic expansion is enabled, and a map
of reserved bus assignments per device.
PCIeTopology::new() now infers the mode from the configured root-port
and switch-port counts, pre-seeds the port structures, and makes
add_root_ports_on_bus() idempotent so that PortDevice::attach can
safely call it again after the topology has already been initialized.
New methods:
- reserve_bus_for_device: allocate a free root port or switch downstream
port for a device, expanding the port map when cold_plug is enabled
- release_bus_for_device: free the previously reserved port
- find_free_root_port / find_free_switch_down_port: internal helpers
- release_root_port / release_switch_down_port: internal helpers
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add DeviceConfig::VfioModernCfg and DeviceConfig::BlockCfgModern
variants so the device manager can accept creation requests for the
modern VFIO and block drivers introduced in the previous commits.
Wire find_device() to look up VfioModern devices by iommu_group_devnode
and BlockModern devices by path_on_host. Add create_block_device_modern()
for BlockConfigModern with the same driver-option normalization and
virt-path assignment as the legacy path.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a modern block device driver using the Arc<Mutex> pattern for
interior mutability, matching the VfioDeviceModern approach. The driver
implements the Device trait with attach/detach/hotplug lifecycle
management, and supports BlockConfigModern with logical and physical
sector size fields.
Add the DeviceType::BlockModern enum variant so the driver compiles.
The device_manager and hypervisor cold-plug wiring follow in subsequent
commits.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add the VfioDeviceModern driver for VFIO device passthrough in
runtime-rs. The driver handles device discovery through sysfs, detects
whether the host uses iommufd cdev or legacy VFIO group interfaces,
resolves PCI BDF addresses and IOMMU groups, and implements the Device
and PCIeDevice traits for hypervisor integration.
The module is structured as:
- core.rs: sysfs discovery, BDF parsing, IOMMU group resolution,
device-node path logic for both iommufd cdev and legacy group paths
- device.rs: VfioDeviceModern/VfioDeviceModernHandle types, Device
and PCIeDevice trait implementations
- mod.rs: host capability detection (iommufd vs legacy), backend
selection logic
The DeviceType::VfioModern enum variant and stub PCIeTopology methods
(reserve_bus_for_device, release_bus_for_device) are added so the
driver compiles; full topology wiring follows in a subsequent commit.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The vsock connect loop previously ran the blocking connect(2) syscall
directly on a tokio async worker thread, which could stall other async
tasks. Move the socket creation and connect(2) call into
spawn_blocking so the async runtime remains responsive.
Replace the fixed-interval retry loop with an Instant-based deadline
and bounded exponential backoff (10ms-500ms, doubling each attempt).
This avoids hammering the vsock endpoint during slow VM boots while
still converging quickly once the guest agent is ready.
Also improve log messages to include attempt counts and remaining time.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The vfio-ioctls 0.6.0 crate changed the vfio_dma_map signature: the
host address parameter is now a raw pointer (*mut u8) instead of u64,
and the size parameter is usize instead of u64. Since the kernel uses
the host address to set up DMA mappings to physical memory — and the
caller must guarantee the memory behind that pointer remains valid for
the lifetime of the mapping — upstream marked vfio_dma_map as unsafe fn.
Wrap vfio_dma_map calls in unsafe blocks and adjust the type casts
accordingly. vfio_dma_unmap only needed the usize cast for the size
parameter (it does not take a host address, so it remains safe).
Bump workspace dependencies:
- vfio-bindings 0.6.1 -> 0.6.2
- vfio-ioctls 0.5.0 -> 0.6.0
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The VFIO cold-plug path needs to resolve a PCI device's sysfs address
from its /dev/vfio/ group or iommufd cdev node. Extend the PCI helpers
in kata-sys-util to support this: add a function that walks
/sys/bus/pci/devices to find a device by its IOMMU group, and expose the
guest BDF that the QEMU command line will reference.
These helpers are consumed by the runtime-rs hypervisor crate when
building VFIO device descriptors for the QEMU command line.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
The Go runtime already exposes a [runtime] pod_resource_api_sock option
that tells the shim where to find the kubelet Pod Resources API socket.
The runtime-rs VFIO cold-plug code needs the same setting so it can
query assigned GPU devices before the VM starts.
Add the field to RuntimeConfig and wire it through deserialization so
that configuration-*.toml files can set it.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a gRPC client crate that speaks the kubelet PodResourcesLister
service (v1). The runtime-rs VFIO cold-plug path needs this to discover
which GPU devices the kubelet has assigned to a pod so they can be
passed through to the guest before the VM boots.
The crate is intentionally kept minimal: it wraps the upstream
pod_resources.proto, exposes a Unix-domain-socket client, and
re-exports the generated types.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add BATS tests for the GetDiagnosticData termination log feature on
CoCo platforms where shared_fs=none.
Three test cases cover:
- Successful exit (exit 0): termination message is propagated when
GetDiagnosticDataRequest is allowed by policy.
- Failed exit (exit 1): termination message is propagated when
GetDiagnosticDataRequest is allowed by policy.
- Policy denied: with default CoCo policy (GetDiagnosticDataRequest
is false), the container stops cleanly but no termination message
is propagated (best-effort behavior).
Tests are skipped on non-CoCo platforms where shared_fs is not "none".
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add runtime-rs support for the GetDiagnosticData RPC. This extends
the Agent trait, types, and protocol translation layer with the new
request/response types.
During container stop, when shared_fs is "none" and the
terminationMessagePolicy annotation is "File", the runtime copies
the termination log from the guest via GetDiagnosticData. The call
is best-effort to avoid blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add policy rules for the new GetDiagnosticDataRequest RPC.
The request is denied by default in genpolicy-generated policies,
ensuring CoCo workloads do not expose diagnostic data unless
explicitly opted in via policy_data.request_defaults.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.
The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.
On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
Update the stale issues workflow to run more frequently:
- Weekdays: Every 4 hours (6x per day) at 00:00, 06:00, 12:00, 18:00 UTC
- Weekends: Every hour (24x per day)
Previously ran once daily at midnight UTC. This change reduces the time
it will take for us to get through our backlog, particularly increasing
the runs at the weekend, when we should have less other CI running,
which it could impact due to GH API rate limiting.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Utilise the new hypervisor helpers in our CI and test
code to help add clarity and reduce duplication
Note: `kubernetes_dir` is declared as readonly in
tests/integration/kubernetes/setup.sh which is sourced
by tests_common.sh, so we update it to only be set if
unset
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add a pure shell script which the CI and integration tests can
use to check for different categories of runtime
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
I'm doing some bookkeeping in the Azure subscription that requires we move
from eastus to eastus2. This should have no user-facing impact.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
I've seen several cases of the CLH tests just being killed due to the 60
minutes timeout. Let's bump it to 75 and see how it goes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Adding the pod annotation config to the doc site. A symlink is created
at docs/pod-annotations.md that points to
how-to/how-to-set-sandbox-config-kata.md so that the URL for this file will be
created at `/pod-annotations`. Also adding brief contrbuting guidelines and
how-to's for running the documentation site locally for local previews.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Newer kernels and containerd versions (>= 2.2.3) may add extra mount
options to /sys/fs/cgroup that genpolicy does not embed in the policy
(e.g. nsdelegate, memory_recursiveprot). This causes the Kata agent to
reject CreateContainerRequest with PERMISSION_DENIED because the
check_mount rules require an exact match.
Rather than hard-coding the allowed extras in Rego, make them
configurable via genpolicy-settings.json under
cluster_config.cgroup_mount_extras_allowed. The corresponding Rego rule
(check_mount 4) reads the list from policy_data.cluster_config and
allows only those named options beyond the policy-embedded set.
To support this, cluster_config is now included in PolicyData so that
it gets serialized into the Rego policy_data object at generation time.
This follows the established pattern of keeping site- and
version-specific tunables in genpolicy-settings.json so they can be
overridden via JSON-Patch drop-ins without touching the Rego source.
A policy test case is added to verify that the default allowed extras
(nsdelegate, memory_recursiveprot) are accepted and that unknown extras
are rejected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
I've updaed the images on the Confidential Containers side, in order to
add arm64 support, but I didn't realize it'd break tests not using
those.
Apologies!
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>