Address the issue where signal_process returns an INTERNAL error when
the container's init process has already exited, and ensure teardown
is never aborted by signal failures.
Introduce is_no_such_process_error() to detect "no such process"
conditions (ESRCH/ENOENT codes or equivalent messages). When the init
process is already gone, treat it as success with an info log instead
of an error.
In stop_process(), never propagate signal failures. During sandbox
shutdown the agent connection is often already closed, causing
AgentConnectionClosed errors that bypass is_no_such_process_error().
If stop_process() aborts on such errors, cleanup_container() is skipped
and leftover mounts cause "Resource busy" failures in sandbox cleanup.
Restore "always proceed to cleanup" semantics: log the failure as a
warning, but never skip resource cleanup.
Resource cleanup must be best-effort and idempotent regardless of kill
outcome.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This commit refines the error handling within the OOM watcher to
distinguish between genuine failures and errors that occur as a natural
consequence of sandbox shutdown via the helper is_normal_shutdown_error.
Previously, various connection-related errors during teardown were logged
as warnings, contributing to noisy logs.
It aims to improve OOM error handling, distinguish error types:
The logic now differentiates between "normal shutdown" errors (e.g.,
Connection reset by peer, broken pipe) and actual OOM watcher failures.
This enhancement makes OOM event logs more informative and less prone to
clutter during normal sandbox termination.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit introduces an explicit cancellation mechanism for the OOM
watcher loop within VirtSandbox. This addresses the issue where the
watcher continues to poll for OOM events even when the sandbox is being
stopped, leading to spurious "Connection reset by peer" errors.
Key changes:
(1) A CancellationToken is added to VirtSandbox to signal the watcher
loop when the sandbox is undergoing teardown.
(2) The OOM watcher loop in VirtSandbox::start() is now wrapped in a
tokio::select! statement. This allows it to concurrently listen for
two events:
- cancel_token.cancelled(): Triggered when the sandbox/VM is stopping.
- agent.get_oom_event(): The regular OOM event polling.
(3) In the sandbox stop/teardown path, cancel_token.cancel() is called
before stopping the VM. This ensures the OOM watcher loop exits cleanly
via the cancellation token, preventing the occurrence of ECONNRESET/EOF
errors on a closed channel.
This change improves the robustness of OOM event handling during sandbox
lifecycle management.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Returns `true` if the error indicates that the target process/container
no longer exists.
This is used to determine if an operation, like signaling a process,
failed because the target is no longer available. The function checks
for standard OS error codes (`ESRCH`, `ENOENT`) and common error message
patterns.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Block volumes and block-mode device nodes were attached to the guest
read-write regardless of the volume's read-only intent, so the
guest-visible virtio-blk device was always writable.
This matters beyond simple write protection: filesystems such as XFS
inspect the block device read-only state to decide whether to attempt
journal/log recovery. When the device is writable, XFS tries to replay
the log even on a read-only mount, which fails badly. Mounting with
"-o ro" inside the guest is not sufficient; the device itself must
advertise read-only (VIRTIO_BLK_F_RO), which only happens when the VMM
opens the backing image read-only.
Set is_readonly on the block device config from two signals, combined
with OR so either one marks the device read-only:
- the read-only intent from the OCI spec:
* bind-mounted block volumes and direct-assigned (raw block)
volumes derive it from the "ro" mount option, and
* block-mode volumes (e.g. Kubernetes volumeDevices) arrive as
device nodes in spec.Linux.Devices with no mount option; their
intent is expressed only via the cgroup device access in
spec.Linux.Resources.Devices ("rm" = read+mknod, no write, for
read-only; "rwm" for read-write). handler_devices() derives the
flag from the matching cgroup allow rule, and
- the host block device's own read-only flag (queried via the BLKROGET
ioctl). Both the volume path (block_volume/rawblock_volume) and the
device-node path (handler_devices, resolving the host node via
get_host_path) honor it, so a device that is physically read-only on
the host is exposed read-only to the guest even when the intent is
not encoded in the OCI spec.
All in-tree hypervisors (qemu, cloud-hypervisor, dragonball) already
honor BlockConfig.is_readonly, so no hypervisor changes are required.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor
prepare_protection_device_config() called available_guest_protection()
unconditionally and propagated any error before the "confidential_guest
is not set" case was handled.
On AMD hosts where the kvm_amd `sev` module parameter is "Y" but the CPU
does not expose the SEV-SNP CPUID bit (8000_001f EAX[4]) -- e.g. consumer
Ryzen -- available_guest_protection() returns Err("SEV not supported"),
which blocked every non-confidential VM from booting even though no
protection was requested.
When confidential_guest is not set there is no reason to probe the host,
so return Ok(None) before calling available_guest_protection(). Detection
(and any error it produces) now runs only when a confidential guest is
actually requested.
Signed-off-by: nikolasgkou <nikolasgkou@disroot.org>
Fix BlockRootfs to save the queue_size, num_queues, logical_sector_size,
and physical_sector_size of the hypervisor's block device info in the
BlockConfig passed to the vm
Fixes#13210
Signed-off-by: Gregory Ling <17791817+glingy@users.noreply.github.com>
While the config knob is being parsed, it is being unused in the
rust shim. This renders the config knob useless. Remove the
file_mem_backend config option as there is no current users for it.
As this option is being usable in the go shim, we leave it intact.
For the rust shim, /dev/shm is still being used in a similar way to
the go shim when filesystem sharing is enabled (virtio-fs). Future
use cases where other file_mem_backends are being utilized are
currently planning to define these backends in a similar manner:
based on the configuration/platform, determine the proper file
memory backend, but do not let end users determine the file memory
backend.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Introduce `ShareVirtioFsNydus` to enable standalone Nydus rootfs
support. This implementation acts as the bridge between runtime-rs
and the external `nydusd` daemon.
Key Capabilities:
(1) Trait Implementation: Implements `ShareFs` (for VM device/storage) and
`NydusShareFs` (for RAFS lifecycle) traits.
(2) Daemon Lifecycle Management: Handles `nydusd` spawning, supervision,
and graceful shutdown.
(3) Native Overlay Support: Configures `nydusd` with `passthrough_fs`
backend to provide native overlay (upperdir/workdir) support.
(4) API Integration: Utilizes `NydusClient` for granular control over RAFS
mount/umount operations.
(5) QEMU Integration: Enables `virtio-fs-nydus` device support,
facilitating standalone mode execution.
This implementation allows Kata containers to utilize an external `nydusd`
process for Nydus rootfs management, providing a cleaner separation between
the runtime and the Nydus daemon lifecycle.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor the `ShareFs` trait to improve modularity and support
standalone Nydus mode:
(1) Added `stop()` method to manage daemon teardown.
(2) Introduced a dedicated trait for Nydus-specific data-plane
operations.
This refactoring cleans up the `ShareFs` trait by consolidating
daemon lifecycle handling and isolating Nydus-specific extensions,
paving the way for cleaner standalone Nydus implementation.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Implement NydusClient to interact with nydusd daemon via Unix
socket:
(1) check_status: query daemon state via GET /api/v1/daemon.
(2) mount/umount: manage filesystem mounts via POST/DELETE
/api/v1/mount.
(3) wait_until_ready: poll daemon until RUNNING state.
This provides a lightweight, stateless HTTP client layer for nydusd
API.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As independent iothreads can work in both virtio-scsi and virtio-blk
devices, this commit aims to enable such feature in virtio-blk-pci
devices.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
1. Determine iothread for virtio-blk devices, only attach iothread
when:
(1) enable_iothreads is true
(2) indep_iothreads > 0
(3) block driver is not virtio-scsi (i.e., it's
virtio-blk)
And for more complex cases, some enhancements will be done in future
2. Add iothread parameter for virtio-blk devices if specified.
If iothreads set and passed, we will have to set it correctly for
virtio-blk devices via qmp with device_add arguments.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To make it work well for independent IO threads for virtio-blk devices.
A new method for independent IO threads for virtio-blk hotplug devices
within qemu command line.
Note that as ObjectIoThread has been done for days, it can be directly
reused in this case.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It's useful and helpful to set indep_iothreads with enable_iothreads
for high IO performance. And we need provide an entry for people to
set it if needed.
This commit will introduce two configurable items:
- Makefile: DEFINDEPIOTHREADS when make build.
- configurations: indep_iothreads for people to set.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In #13147, for some reason a test block was added in the middle of code
and the code was stale when merged, which meant that a second
`mod test` section was added, breaking our tests. Merge the two
to fix this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Switch qemu-se config templates to use the TEE/CoCo-specific
static_sandbox_resource_mgmt defaults instead of the generic
QEMU defaults.
qemu-se-runtime-rs config now uses DEFSTATICRESOURCEMGMT_COCO
while runtime qemu-se config now uses DEFSTATICRESOURCEMGMT_TEE.
This aligns static sandbox resource management behavior with confidential
container expectations for qemu-se variants.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The dragonball nerdctl CI job can race when creating and attaching the
runtime process to the sandbox cgroup, surfacing an os error 17
(AlreadyExists) during shim task creation.
Let's retry add_proc once on this pre-existing cgroup condition so
startup remains robust.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
Low-CPU sandboxes can take longer than a few seconds to complete guest
boot and start the agent.
Let's clamp the reconnect timeout to a safe minimum so sandbox startup
does not fail early with transient vsock ECONNRESET.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
When static sandbox resource management is enabled, CRI CPU/memory
sizing may live only in sandbox annotations and be missing from the OCI
spec.
Let's fill missing sizing fields from annotations before applying static
VM sizing so runtime-rs follows the expected Kubernetes behavior for
constrained pods.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
Add top-level runtime-rs Makefile options `DEFSANDBOXCGROUP_ONLY` and
`DEFSTATICRESOURCEMGMT`, both defaulting to true, and use them for the
runtime defaults that previously disabled these paths.
This aligns runtime-rs defaults with static sandbox resource management,
which sizes sandbox memory up front instead of relying on memory hotplug,
helping avoid architecture-specific hotplug limitations.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When CreateContainer fails before the runtime instance is registered
(e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal
teardown. containerd's follow-up Shutdown RPC then reaches
get_runtime_instance(), fails with "runtime not ready", and returns
before the service loop is ever told to stop. Because the shim ignores
SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned.
Make the Shutdown RPC force the daemon to exit when there is no runtime
instance, emitting the same Action::Shutdown that sandbox.shutdown()
sends on the normal path. This guarantees the shim process is reaped
after a failed create instead of leaking.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.
Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Make enable_debug promote the effective component log level from the
default info level to debug for runtime, agent, and hypervisor logs.
Keep an explicit log_level value authoritative so users can still choose
trace, warn, or another level.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Run cargo fmt on runtime-rs to ensure consistent formatting
with Rust 1.94 toolchain.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec. As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).
The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network
endpoints via host_bdf()) was already present inside handler_devices()
but could never be consumed because the LinuxDeviceType::C loop has
no entries to iterate over when linux.devices is empty.
After that loop, iterate over any unmatched cold-plug BDFs, derive the
VFIO group path via bdf_to_vfio_group_path() (reads
/sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk
ContainerDevice. The vfio_group_to_bdf() short-circuit inside the
loop handles the case where the device plugin does add VFIO char
devices to linux.devices; it now supports both legacy (/dev/vfio/N)
and iommufd (/dev/vfio/devices/vfioN) path formats.
Add host_bdf() to the Endpoint trait (default: None) so that
PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
network.remove() — which detaches endpoints and rebinds VFs from
vfio-pci back to the host driver — was never being called.
ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs,
swap and ephemeral disks, but completely omitted the network teardown.
Call network.remove() at the start of cleanup(), using the already-held
self.hypervisor reference. Errors are logged as warnings rather than
propagated, so they don't block the rest of the cleanup sequence.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
network_with_netns::remove() bailed out early when network_created=false
(i.e. the netns was created by the CNI, not by kata). This caused
physical endpoint VFs to remain bound to vfio-pci after pod deletion,
because PhysicalEndpoint::detach() — which calls bind_device_to_host()
to rebind the VF from vfio-pci back to mlx5_core — was never reached.
Separate endpoint detachment from netns deletion: always detach
endpoints, but only remove the netns if kata created it. Detach errors
are logged as warnings rather than propagated, to mirror the Go runtime's
best-effort approach and avoid blocking sandbox teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
The PCIe topology pre-computes a wrong path for cold-plugged physical-
endpoint VFs because the root port has no explicit addr and QEMU auto-
assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] }
resolves to 0000:00:00.0 (the Q35 MCH), causing
wait_for_pci_net_interface to time out looking for a netdev there.
Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait.
Implement it in QemuInner using qmp.get_device_by_qdev_id(), which
queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00"
= slot 5 on pcie.0 / slot 0 on the root port bus).
Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during
attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the
Endpoint trait and add an endpoints() accessor to the Network trait.
In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths()
before apply_network_to_agent() to populate the correct path from QMP
into each PhysicalEndpoint's guest_pci_path field. The field is then
consumed by network_with_netns::interfaces() to fill Interface.device_path
before update_interface is sent to the agent.
This is the runtime-rs counterpart of the Go runtime's
ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Without device_path the agent receives Interface.device_path="" in
update_interface, falls back to a by-MAC link lookup, and fails for
SR-IOV VFs whose firmware MAC differs from the CNI-assigned MAC after
the vfio-pci unbind/rebind cycle.
The guest PCI path is computed at attach() time by do_add_pcie_endpoint()
inside VfioDevice::register() — no QMP query is needed. Cache it in
PhysicalEndpoint.guest_pci_path (Mutex<Option<String>>) during attach()
when do_handle_device() returns the DeviceType::Vfio with the path
already filled in.
Add a default-None guest_pci_path() method to the Endpoint trait;
PhysicalEndpoint overrides it to return the cached path. In
network_with_netns.rs::interfaces(), after building each Interface from
network_info, fill device_path from endpoint.guest_pci_path() when the
field would otherwise be empty.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Without an admin MAC the guest mlx5_core inherits whatever firmware-
default MAC the VF was created with. This MAC differs from the IB port
HCA MAC, so mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE appears active but
every verb needing a GID fails.
Before bind_device_to_vfio(), push the CNI-assigned MAC down to the VF
as an "admin MAC" via the parent PF using RTM_SETLINK with
IFLA_VFINFO_LIST — the netlink equivalent of
ip link set <PF> vf <N> mac <MAC>
The operation runs in a spawn_blocking closure that enters the host
network namespace (via NetnsGuard("/proc/1/ns/net")), since attach() is
called while the thread is inside the pod netns.
Best-effort: failures are logged at warn and the existing agent-side MAC
reconciliation (update_interface in rpc.rs) remains as a fallback for
L2/L3 connectivity.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
DeviceType::Vfio (used by physical network VFs) was silently dropped
in start_vm()'s cold-plug loop, falling through to the unsupported-
device info log. The VF never appeared on the QEMU command line and
therefore never became visible inside the guest.
Add handling for DeviceType::Vfio in the start_vm() cold-plug loop.
For each HostDevice in the VfioDevice, emit:
-device vfio-pci,host=<bdf>,id=<hostdev_id>,bus=<root-port>, \
[x-pci-vendor-id=...,x-pci-device-id=...]
The bus assignment and guest PCI path are already computed by
do_add_pcie_endpoint() at VfioDevice::register() time (called from
VfioDevice::attach() via the PCIe topology), so no additional QMP
resolution is needed here.
Add id= support to PCIeVfioDevice so the QEMU device name is stable
and matchable in QMP queries. Add new_without_iommufd() constructor
for the non-IOMMUFD (legacy VFIO container) path used by physical
endpoints, and add_physical_vfio_device() to QemuCmdLine as a
direct emission helper.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH
configs during build.
This keeps Mariner-specific defaults in explicit config artifacts
instead of ad-hoc runtime mutation.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Treat the containerd erofs snapshotter active snapshot as an EROFS
lower plus overlay metadata, with an optional ext4 rwlayer when host
rw backing is enabled. This also covers default_size=0, where
containerd sends no rwlayer and the agent provides the writable upper
inside the guest.
Forward overlay mkdir hints on the EROFS storage so the guest agent
sees them in both layouts, and add unit coverage for the dispatcher
patterns.
Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
TDX QGS takes raw TD report from QEMU/guest VM and signs it in an SGX
enclave. Historically, QGS has supported two transports: vsock and
unix-domain-socket. The former was necessary before the guest kernel
supported the GetQuote "TDVMCALL" hypercall: DCAP library inside the
guest used vsock to talk to QGS directly.
However, with GetQuote, QEMU gets the TDREPORT and sends it to QGS.
In process-to-process communication, unix-domain-socket is a better
approach. This is also the only transport supported by libvirt by default.
With that, align Kata default configuration to use unix-domain-socket
as well. The change in impacts QEMU commandline:
old:
"quote-generation-socket":{"type":"vsock","cid":"2","port":"4050"}
new:
"quote-generation-socket":{"type":"unix","path":"/var/run/tdx-qgs/qgs.socket"}
Host QGS configuration must be changed to listen unix-domain-sockets.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Changing Kata runtime configurations to use TDX QGS port=0 (unix domain
socket transport) means cluster admins must also reconfigure qgsd to
the same and have /var/run/tdx-qgs/qgs.sock available.
Since the early days of TDX attestation in Kata, the configuration has used
vsock with cid=2, port=4050. To avoid unncessary breakages when Kata default
moves to unix domain socket, fall back to the old configuration if
/var/run/tdx-qgs/qgs.sock is not available on the worker node.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
When multiple EROFS layers are present, wrap them into a single
GPT-partitioned virtual disk delivered via one VMDK descriptor and a
single block device hotplug which significantly reduce pci bus slots
compared with the previous one-device-per-layer approach that exhausts
virtio-blk slots for large layer counts.
The host detects multi-layer mounts, computes the GPT layout, generates
head metadata plus a VMDK descriptor referencing all EROFS images, and
hot-plugs the composite disk. Per-partition Storage entries are created
with X-kata.gpt-partitioned and X-kata.partition-number options so the
guest agent can resolve each layer to its partition device.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>