Commit Graph

1218 Commits

Author SHA1 Message Date
Aurélien Bombo
e4fbddb91a ci: rename cloud-hypervisor to clh-runtime-rs
This aligns on qemu-runtime-rs and makes more sense.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-04-28 10:58:01 -05:00
Fabiano Fidêncio
19bb8746f8 runtime-rs: rescan network at Start RPC for Docker 26+
Docker 26+ configures the container's veth pair between the Create
and Start RPCs by bind-mounting `/proc/<vmm_pid>/ns/net`.  The Rust
shim's network scan during sandbox creation finds no interfaces
because they don't exist yet.

The Go shim (commit f7878cc) solves this with `detectHypervisorNetns`
inside `addAllEndpoints`: when the placeholder netns is empty, it
switches to the hypervisor's network namespace and rescans there.

Port this approach to the Rust shim:

- Add `rescan_network()` to the `Sandbox` trait
- Implement it on `VirtSandbox`: build a rescan config that always
  targets the hypervisor's netns (`/proc/<vmm_pid>/ns/net`),
  bypassing the placeholder netns and the `network_created` flag
- Call `sandbox.rescan_network()` synchronously in the `StartProcess`
  handler, before `cm.start_process()`, so interfaces are wired
  before the container process runs

Fixes: #9340

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-28 10:20:18 +02:00
Fabiano Fidêncio
0da2f00488 runtime-rs: resource: add network rescan polling for Docker 26+
Docker 26+ configures veth pairs in the hypervisor's network
namespace between the Create and Start RPCs.  The initial network
scan during sandbox creation finds no interfaces because they do
not exist yet.

Add `rescan_network_if_unconfigured` which polls the network
namespace (50ms intervals, 5s timeout) until interfaces appear,
then pushes the configuration to the guest agent.  This mirrors
the Go runtime's `RescanNetwork` (commit f7878cc).

Supporting changes:
- Derive `Clone` on `NetworkWithNetNsConfig` so it can be reused
  across poll iterations
- Add `tokio/time` feature to the resource crate
- Add `apply_network_to_agent` helper to push interfaces, routes,
  and neighbors to the guest

Fixes: #9340

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-28 10:20:18 +02:00
Fabiano Fidêncio
67679ddd15 runtime-rs: detect Docker 26+ netns from hook args and filter /proc/0/
Docker 26+ with `runtimeType` may not publish the network namespace
in `linux.namespaces` at create time.  Instead, the netns path can
be discovered from `libnetwork-setkey` hook arguments.

Additionally, filter out the invalid `/proc/0/ns/net` placeholder
that appears when the task PID is not yet known.

This mirrors the Go runtime's `DockerNetnsPath` fallback logic.

Fixes: #9340

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-28 10:20:18 +02:00
Fabiano Fidêncio
3ad2de584f runtime-rs: return hypervisor PID from container manager methods
Docker and containerd use the PID returned by the shim to construct
`/proc/<pid>/ns/net` for network namespace operations.  The Rust shim
was returning the shim's own PID instead of the hypervisor's PID,
which meant Docker would look at the wrong network namespace.

Update `create_container`, `start_process`, `state_process`, `pid`,
and `connect_container` to return the VMM master thread/process ID
(`vmm_master_tid`) instead of `self.pid`.  For QEMU this is the QEMU
process PID; for Dragonball this is the VMM thread ID — both are
valid for `/proc/<id>/ns/net` on Linux.

Fixes: #9340

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-28 10:20:18 +02:00
Fabiano Fidêncio
b1393f03c4 runtime-rs: fix ConnectResponse to set both shim_pid and task_pid
The containerd runtime v2 `shimTask.Create()` discards the
`CreateTaskResponse.Pid` and instead retrieves the task PID by
calling the shim's Connect RPC, reading `ConnectResponse.task_pid`.

The Rust shim only set `shim_pid` in the ConnectResponse, leaving
`task_pid` at its default zero value.  This caused Docker to call
`sb.SetKey("/proc/0/ns/net", ...)` which fails with "no such file
or directory".

Set `shim_pid` to the actual shim process ID and `task_pid` to the
hypervisor PID (vmm_master_tid), matching the Go shim's Connect
handler behavior.

Fixes: #9340

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-28 10:20:18 +02:00
Steve Horsman
d5785b4eba Merge pull request #12872 from stevenhorsman/bump-rust-to-1.93
Bump rust to 1.93
2026-04-27 09:01:00 +01:00
Fabiano Fidêncio
b3ed669d16 Merge pull request #12913 from pmores/fix-exec
runtime-rs: fix exec when selinux is disabled on guest
2026-04-25 17:34:46 +02:00
stevenhorsman
1dbfd4b7f4 runtime-rs: Fix clippy warnings for Rust 1.93
- Replace is_ok() check followed by unwrap_err() with if let Err pattern
- Replace .err().expect() with .expect_err()
- Replace is_some() check followed by unwrap() with if let Some pattern

These changes address clippy::unnecessary_unwrap and clippy::err_expect
warnings in Rust 1.93.

Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-25 11:27:39 +01:00
Pavel Mores
d3f56cd3a6 runtime-rs: remove process selinux label on exec if disable_guest_selinux
Without this commit any attempt to exec a command in a container will fail
if SELinux is disabled in the guest but an SELinux label is given for
the new process.  That will happen pretty much any time SELinux is enabled
on the host (and the container is not privileged).

Signed-off-by: Pavel Mores <pmores@redhat.com>
2026-04-25 11:27:15 +01:00
Pavel Mores
1390ad650b runtime-rs: factor getting disable_guest_linux value out to own function
We'll need to get the `disable_guest_linux` value in the exec handler, too.
This will allow us to avoid duplicating the get.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2026-04-25 11:27:15 +01:00
Fabiano Fidêncio
8c3a0e692b runtime-rs: network: handle "device" type interfaces (mlx5 SFs)
Same fix as the Go runtime: interfaces whose drivers do not register
a specific netlink kind (e.g. mlx5 Scalable Functions) are reported
with the generic type "device", which is not handled by the endpoint
creation match, causing sandbox creation to fail with:

  "unsupported link type: device"

Add "device" as an alternative pattern alongside "veth" so these
interfaces are connected through a TAP + TC-filter bridge.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-25 12:26:20 +02:00
Fabiano Fidêncio
e0927e0e0c Merge pull request #12846 from RainaYL/rainax/split_irqchip_pr
dragonball: Implement userspace IOAPIC to enable split irqchip
2026-04-24 19:07:45 +02:00
Fabiano Fidêncio
12bb497ce2 runtime-rs: Set QEMU as the default hypervisor
Dragonball is only supported on x86_64 and aarch64, so using it as the
default hypervisor means architectures like s390x, powerpc64le, and
riscv64gc have no working default. Switch to QEMU, which is available
across all supported architectures.

Dragonball is still compiled as a feature on x86_64 and aarch64 via
USE_BUILTIN_DB, and users can still override the default with
HYPERVISOR=dragonball.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-24 09:42:10 +02:00
Xiaofan Xxf
fd39117a21 dragonball: Implement userspace IOAPIC to enable split irqchip
From Linux 6.14, creating a TDX VM requires that split irqchip is
enabled. Under this circumstance, device IOAPIC would be managed
in userspace, instead of KVM, so a manager is needed to handle
MMIO read/write to emulated IOAPIC registers.
Also, with split irqchip, irqfd is no longer able to trigger an
interrupt after device IO is completed. Instead, KVM_SIGNAL_MSI
is used for interrupt triggering.

Note that only legacy irq with edge-triggered interrupt is
implemented here. And split irqchip feature is only enabled
when confidential VM type is set to TDX.

Signed-off-by: Xiaofan Xxf <xiaofan.xxf@antgroup.com>
2026-04-24 10:33:05 +08:00
Saul Paredes
ed44b363ba runtime-rs: ch: disable nested vCPUs on MSHV
This is a runtime-rs port for 7973e4e2a8

The recently-added nested property is true by default, but is not
supported yet on MSHV.

See https://github.com/cloud-hypervisor/cloud-hypervisor/pull/7408 for additional information.

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2026-04-23 21:04:53 -05:00
Fupan Li
18378145d2 Merge pull request #12821 from fidencio/topic/runtime-rs-cpu-pinning
runtime-rs: Add vCPU thread pinning support
2026-04-23 16:49:18 +08:00
Aurélien Bombo
206c1d3be8 Merge pull request #12889 from fidencio/topic/ch-config
hypervisor: Enable cloud-hypervisor feature by default
2026-04-21 11:04:31 -05:00
Fabiano Fidêncio
48669a894e runtime-rs: Add vCPU thread pinning support
Port the Go runtime's enable_vcpus_pinning feature to runtime-rs.

The Go runtime already lets users pin each vCPU thread to a specific
host CPU when the vCPU count matches the sandbox cpuset size, using
sched_setaffinity. This is useful for latency-sensitive workloads that
benefit from eliminating cross-CPU migration of vCPU threads.

The approach mirrors the Go implementation:

After VM start and on every container add/update/delete, we fetch the
vCPU thread IDs (via QMP query-cpus-fast for QEMU), compute the union of
all containers' OCI cpusets, and if the two counts match, pin vCPU i to
cpuset[i]. If they diverge (hotplug, container removal, etc.) we reset
all threads back to the full cpuset so nothing gets stuck on a single
core.

The pinning check lives in CgroupsResourceInner::update_sandbox_cgroups,
which already runs at exactly the right points in the lifecycle. The
enable_vcpus_pinning flag flows from the TOML config through
CgroupConfig into the cgroup resource layer, and can also be overridden
per-pod via the io.katacontainers.config.runtime.enable_vcpus_pinning
annotation.

The QEMU config templates default to false. The NV GPU configs will get
their own default (true) in a follow-up once those templates are added.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-21 12:45:56 +02:00
Fabiano Fidêncio
2bfa94b7cb hypervisor: Enable cloud-hypervisor feature by default
The cloud-hypervisor feature has been fully functional for some time
now: it's enabled by default in virt_container, used by agent-ctl,
and exercised in CI.  Drop the stale comments referencing issue #6264
and promote the feature to a default.

Fixes: #6264

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-21 11:26:12 +02:00
Aurélien Bombo
3cf9581fbe runtime-rs/ch: Fix errors on pod deletion
* get_rootless_symlink_sandbox_path() would get without first checking for
   is_rootless(), meaning cleanup() would ALWAYS fail (see below error), even
   though the shim/CH would NOT leak thanks to containerd's recovery routine.

 * Cleanup wouldn't be idempotent (in case the CRI issues multiple shutdown requests).
   This was fixed by introducing remove_dir_all_if_exists().

   Apr 17 17:53:21 containerd[4078033]: time="2026-04-17T17:53:21.821624475-05:00" level=error msg="failed to shutdown shim task and the shim might be leaked" error="Others(\"failed to handle message handler TaskRequest\\n\\nCaused by:\\n    0: do shutdown\\n    1: do the clean up\\n    2: delete hypervisor\\n    3: No such file or directory (os error 2)\\n\\nStack backtrace:\\n   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from\\n   1: <hypervisor::ch::CloudHypervisor as hypervisor::Hypervisor>::cleanup::{{closure}}\\n   2: <virt_container::sandbox::VirtSandbox as common::sandbox::Sandbox>::cleanup::{{closure}}\\n   3: <virt_container::sandbox::VirtSandbox as common::sandbox::Sandbox>::shutdown::{{closure}}\\n   4: runtimes::manager::RuntimeHandlerManager::handler_task_message::{{closure}}::{{closure}}\\n   5: runtimes::manager::RuntimeHandlerManager::handler_task_message::{{closure}}\\n   6: <service::task_service::TaskService as containerd_shim_protos::shim::shim_ttrpc_async::Task>::shutdown::{{closure}}\\n   7: <containerd_shim_protos::shim::shim_ttrpc_async::ShutdownMethod as ttrpc::asynchronous::utils::MethodHandler>::handler::{{closure}}\\n   8: ttrpc::asynchronous::server::HandlerContext::handle_msg::{{closure}}\\n   9: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll\\n  10: <ttrpc::asynchronous::server::ServerReader as ttrpc::asynchronous::connection::ReaderDelegate>::handle_msg::{{closure}}::{{closure}}\\n  11: tokio::runtime::task::core::Core<T,S>::poll\\n  12: tokio::runtime::task::harness::Harness<T,S>::poll\\n  13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\\n  14: tokio::runtime::scheduler::multi_thread::worker::Context::run\\n  15: tokio::runtime::context::scoped::Scoped<T>::set\\n  16: tokio::runtime::context::runtime::enter_runtime\\n  17: tokio::runtime::scheduler::multi_thread::worker::run\\n  18: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\\n  19: tokio::runtime::task::core::Core<T,S>::poll\\n  20: tokio::runtime::task::harness::Harness<T,S>::poll\\n  21: tokio::runtime::blocking::pool::Inner::run\\n  22: std::sys::backtrace::__rust_begin_short_backtrace\\n  23: core::ops::function::FnOnce::call_once{{vtable.shim}}\\n  24: std::sys::thread::unix::Thread::new::thread_start\\n  25: <unknown>\\n  26: <unknown>\")" id=fca6a162b8f0ed7ef2b33cd99b6f1b58124e85c5489c193ceac487db0e4acdde

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-04-20 15:36:18 -05:00
Aurélien Bombo
93bd2899fb runtime-rs/ch: Fix hang on pod deletion
This serializes CH API calls to avoid a race condition where deleting a pod
would hang indefinitely and leak both the shim and CH processes.

The race happened because the CRI can send multiple shutdown requests for the
same pod, however the CH socket wasn't guarded against concurrent usage, hence
it was possible that HTTP responses would interleave (see below) on the
shutdown path, leading to an error.

This would repro in <15 iterations (sometime 2-3) using a 2-container pod.
With this commit, I haven't observed a repro in 200+ iterations.

Fixes: #12858

ORIGINAL REPRO:

while true; do
  kubectl apply -f busybox.yaml
  kubectl wait --for=condition=ready po busybox
  kubectl exec busybox -- echo foo
  kubectl delete po busybox
done

ORIGINAL ERROR:

 Apr 17 20:15:54 kata[2297383]: Failed to stop process, process = ContainerProcess { container_id: ContainerID { container_id: "d4eb8984d630111bbf808c7ea30b7a21274c0193cdb8d501d20e4f26a0a69151" }, exec_id: "", process_type: Container }, err = failed to update_mem_resource

                               Caused by:
                                   0: resize memory
                                   1: get vminfo
                                   2: failed to serde {"config":{"cpus":{"boot_vcpus":1,"max_vcpus":32,"topology":{"threads_per_core":1,"cores_per_die":32,"dies_per_package":1,"packages":1},"kvm_hyperv":false,"max_phys_bits":46,"affinity":null,"features":{"amx":false},"nested":null},"memory":{"size":2147483648,"mergeable":false,"hotplug_method":"Acpi","hotplug_size":132024107008,"hotplugged_size":null,"shared":true,"hugepages":false,"hugepage_size":null,"prefault":false,"zones":null,"thp":true},"payload":{"firmware":null,"kernel":"/usr/share/cloud-hypervisor/vmlinux.bin","cmdline":"reboot=k panic=1 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service agent.log_vport=1025 console=ttyS0,115200n8 root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4 no_timer_check noreplace-smp systemd.log_target=console agent.container_pipe_size=1 agent.log=debug cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1","initramfs":null},"rate_limit_groups":null,"disks":[{"path":"/usr/share/kata-containers/kata-containers.img","readonly":true,"direct":false,"iommu":false,"num_queues":1,"queue_size":128,"vhost_user":false,"vhost_socket":null,"rate_limit_group":null,"rate_limiter_config":null,"id":"_disk0","disable_io_uring":false,"disable_aio":false,"pci_segment":0,"serial":null,"queue_affinity":null,"backing_files":false}],"net":[{"tap":null,"ip":"192.168.249.1","mask":"255.255.255.0","mac":"9e:7e:13:ee:03:5c","host_mac":null,"mtu":null,"iommu":false,"num_queues":2,"queue_size":256,"vhost_user":false,"vhost_socket":null,"vhost_mode":"Client","id":"_net1","fds":[-1],"rate_limiter_config":null,"pci_segment":0,"offload_tso":true,"offload_ufo":true,"offload_csum":true}],"rng":{"src":"/dev/urandom","iommu":false},"balloon":null,"fs":[{"tag":"kataShared","socket":"/run/kata/e1ae0a05f575a13a535aa95a9990d1fded4766a759f76be0e528c7912d3a5e39/root/virtiofsd.sock","num_queues":1,"queue_size":1024,"id":"_fs2","pci_segment":0}],"pmem":null:"/run/kata/e1ae0a05f575a13a535aa95a9990d1fded4766a759f76be0e528c7912d3a5e39/ch-vm.sock","iommu":false,"id":"_vsock3","pci_segment":0},"pvpanic":false,"iommu":false,"numa":null,"watchdog":false,"pci_segments":null,"platform":null,"tpm":null,"landlock_enabl"index":0,"base":3891789824,"size":524288,"type_":"Mmio32","prefetchable":false}}],"parent":null,"children":["_disk0"],"pci_bdf":"0000:00:01.0"},"_virtio-pci-_vsock3":{"id":"_virtio-pci-_vsock3","resources":[{"PciBar":{"index":0,"base":70367622201344,"sizee":false}}],"parent":null,"children":["_fs2"],"pci_bdf":"0000:00:04.0"},"_vsock3":{"id":"_vsock3","resources":[],"parent":"_virtio-pci-_vsock3","children":[],"pci_bdf":null},"_net1":{"id":"_net1","resources":[],"parent":"_virtio-pci-_net1","children":[],"presources":[{"PciBar":{"index":0,"base":70367623774208,"size":524288,"type_":"Mmio64","prefetchable":false}}],"parent":null,"children":["_net1"],"pci_bdf":"0000:00:02.0"},"_virtio-pci-__rng":{"id":"_virtio-pci-__rng","resources":[{"PciBar":{"index":0,"baseesources":[],"parent":null,"children":[],"pci_bdf":null}}}HTTP/1.1 200
                                      Server: Cloud Hypervisor API
                                      Connection: keep-alive
                                      Content-Type: application/json
                                      Content-Length: 4285

                                      {"config":{"cpus":{"boot_vcpus":1,"max_vcpus":32,"topology":{"threads_per_core":1,"cores_per_die":32,"dies_per_package":1,"packagesepage_size":null,"prefault":false,"zones":null,"thp":true},"payload":{"firmware":null,"kernel":"/usr/share/cloud-hypervisor/vmlinux.bin","cmdline":"reboot=k panic=1 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service agent.log_vport=1025 console=ttyS0,115200n8 root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4 no_timer_check noreplace-smp systemd.log_target=console agent.container_pipe_size=1 agent.log=debug cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1","miter_config":null,"id":"_disk0","disable_io_uring":false,"disable_aio":false,"pci_segment":0,"serial":null,"queue_affinity":null,"backing_files":false}],"net":[{"tap":null,"ip":"192.168.249.1","mask":"255.255.255.0","mac":"9e:7e:13:ee:03:5c","host_mac":nu,"serial":{"file":null,"mode":"Tty","iommu":false,"socket":null},"console":{"file":null,"mode":"Off","iommu":false,"socket":null},"debug_console":{"file":null,"mode":"Off","iobase":233},"devices":[],"user_devices":null,"vdpa":null,"vsock":{"cid":3,"socket"
                                   3: expected `,` or `}` at line 1 column 1924

                               Stack backtrace:
                                  0: <E as anyhow::context::ext::StdError>::ext_context
                                  1: anyhow::context::<impl anyhow::Context<T,E> for core::result::Result<T,E>>::with_context
                                  2: <hypervisor::ch::CloudHypervisor as hypervisor::Hypervisor>::resize_memory::{{closure}}
                                  3: resource::manager_inner::ResourceManagerInner::update_linux_resource::{{closure}}
                                  4: virt_container::container_manager::container::Container::stop_process::{{closure}}
                                  5: virt_container::container_manager::process::Process::run_io_wait::{{closure}}::{{closure}}
                                  6: tokio::runtime::task::core::Core<T,S>::poll
                                  7: tokio::runtime::task::harness::Harness<T,S>::poll
                                  8: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
                                  9: tokio::runtime::scheduler::multi_thread::worker::Context::run
                                 10: tokio::runtime::context::scoped::Scoped<T>::set
                                 11: tokio::runtime::context::runtime::enter_runtime
                                 12: tokio::runtime::scheduler::multi_thread::worker::run
                                 13: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
                                 14: tokio::runtime::task::core::Core<T,S>::poll
                                 15: tokio::runtime::task::harness::Harness<T,S>::poll
                                 16: tokio::runtime::blocking::pool::Inner::run
                                 17: std::sys::backtrace::__rust_begin_short_backtrace
                                 18: core::ops::function::FnOnce::call_once{{vtable.shim}}
                                 19: std::sys::thread::unix::Thread::new::thread_start
                                 20: <unknown>
                                 21: <unknown>

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-04-20 15:36:00 -05:00
Fupan Li
2629df2785 Merge pull request #12763 from Apokleos/fsmerged-erofs-rs
runtime-rs: support erofs snapshotter with Fsmerge enabled
2026-04-20 11:54:19 +08:00
Alex Lyn
e975b3158b Merge pull request #12837 from stevenhorsman/rand-bump-GHSA-cq8v-f236-94qc
versions: Bump rand crate where possible
2026-04-20 10:05:19 +08:00
Alex Lyn
be47c2e932 runtime-rs: Avoid share-rw on readonly virtio-scsi/blk devices
Hotplugging a readonly block device could fail with:

  Block node is read-only

The backend block node was created readonly, but the virtio-scsi/blk
frontend path still forced share-rw=true. This is unnecessary and can
cause QEMU to reject the attach because the frontend configuration
does not match the readonly backend.

Fix the virtio-scsi/blk hotplug path by:
- setting read-only for readonly devices where supported
- skipping share-rw for readonly devices

Readonly handling remains in the backend block node configuration,
while the frontend keeps normal disk semantics for block devices.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
02f975f88b runtime-rs: Enforce read-only and shared access for RO block devices
Explicitly configure `read_only` and `force_share` for readonly block
devices to ensure consistency between the image's read-only state and
QEMU's  access mode.

Motivation:
Previously, EROFS images were being accessed in a way that triggered
QEMU's exclusive locking (e.g., the 'resize' lock), even when the images
were intended to be read-only. This conflicted with external processes
(e.g., containerd snapshotter) that held read-only handles, resulting in
"Failed to get shared 'resize' lock" errors during blockdev-add.

Changes:
- Set `read_only=true` and `force_share=true` on both format and file
  nodes for VMDK descriptors and Raw images.
- This ensures QEMU requests shared locks, correctly matching the
  read-only nature of EROFS filesystems and preventing write-mode
  locking conflicts with concurrent processes.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
526126904e runtime-rs: Add support for handling vmdk hotplugging with scsi
We should also support virtio-scsi driver for handling vmdk format
block device, and this will help address more cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
d8db044c63 runtime-rs: Add erofs rootfs handling logic in handler_rootfs
Add handling for multi-layer EROFS rootfs in RootFsResource
handler_rootfs method. It will correctly handle the multi-layers
erofs rootfs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
8d7051436a runtime-rs: Add support for erofs rootfs with multi-layer
Add erofs_rootfs.rs implementing ErofsMultiLayerRootfs for
multi-layer EROFS rootfs with VMDK descriptor generation.

It's the core implementation of Erofs rootfs within runtime.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
cb706219ae runtime-rs: Change Rootfs::get_storage return type
Change Rootfs::get_storage to return Option<Vec<Storage>>
to support multi-layer rootfs with multiple storages.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-18 22:46:33 +02:00
Alex Lyn
c06bc388c2 runtime-rs: Add format argument to hotplug_block_device method
Add format argument to hotplug_block_device for flexibly specifying
different block formats.
With this, we can support kinds of formats, currently raw and vmdk are
supported, and some other formats will be supported in future.

Aside the formats, the corresponding handling logics are also required
to properly handle its options needed in QMP blockdev-add.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-18 22:46:33 +02:00
Alex Lyn
15740439eb runtime-rs: Add BlockDeviceFormat enum to support more block formats
In practice, we need more kinds of block formats, not limited to `Raw`.

This commit aims to add BlockDeviceFormat enum for kinds of block device
formats support, like RAW, VMDK, etc. And it will do some following actions
to make this changes work well, including format field in BlockConfig.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-18 19:00:44 +02:00
Alex Lyn
8ed4fa1406 runtime-rs: Add RUNTIME_ALLOW_MOUNTS to RuntimeInfo
Add RUNTIME_ALLOW_MOUNTS annotation to RuntimeInfo to specify
custom mount types allowed by the runtime.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-18 19:00:44 +02:00
Fabiano Fidêncio
d04bb98e09 runtime-rs: Increase reconnect_timeout_ms for confidential VMs
The Go runtime's CoCo dev config uses dial_timeout = 45s, but all
runtime-rs confidential VM configs had reconnect_timeout_ms set to
3000ms (3s) or 5000ms (SE). This is too short for confidential VMs,
especially on arm64 where UEFI firmware (AAVMF) adds significant
boot time on top of the measured boot process, causing ECONNRESET
errors on the vsock connection before the agent is ready.

Bump reconnect_timeout_ms to 45000ms across all confidential VM
configs (coco-dev, SNP, TDX, SE) to match the Go runtime.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-18 00:48:13 +02:00
Saul Paredes
6f6e45522e Merge pull request #11562 from Apokleos/clh-initdata
runtime-rs: Add CoCo/protected device for initdata within runtime-rs/Cloud Hypervisor
2026-04-17 11:09:19 -07:00
Fabiano Fidêncio
690f5a2b62 Merge pull request #12862 from fidencio/topic/runtime-rs-enable-measured-rootfs-tests
runtime-rs: enable measured rootfs for qemu-coco-dev-runtime-rs
2026-04-17 18:48:47 +02:00
stevenhorsman
35be1a938d versions: Bump rand crate where possible
Update all versions of rand that are controlled by us to remediate
GHSA-cq8v-f236-94qc.

Note: There are still some usages of rand 0.8.5 it that are from
transitive dependencies which we can't currently update:
- fail
- phf_generator
- opentelemetry
due to them being archived, or our usage being 17 versions out of date

Also update the rand API breakages e.g. :
- rand::thread_rng() → rand::rng() (function renamed)
- rand::distributions::Alphanumeric → rand::distr::Alphanumeric (module renamed)
- rng.gen_range() → rng.random_range() (function renamed)

Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-17 15:58:58 +01:00
Fabiano Fidêncio
1ec0e344e5 runtime-rs: enable measured rootfs for qemu-coco-dev-runtime-rs
Add kernel_verity_params to the qemu-coco-dev-runtime-rs configuration
so the runtime can assemble dm-verity kernel parameters, and remove the
test skip that was disabling measured rootfs tests for this hypervisor.

Fixes: #12851

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-17 15:22:17 +02:00
Fabiano Fidêncio
eda3bc6190 runtime-rs: wire GetDiagnosticData for termination logs
Add runtime-rs support for the GetDiagnosticData RPC. This extends
the Agent trait, types, and protocol translation layer with the new
request/response types.

During container stop, when shared_fs is "none" and the
terminationMessagePolicy annotation is "File", the runtime copies
the termination log from the guest via GetDiagnosticData. The call
is best-effort to avoid blocking container teardown.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-17 13:16:25 +02:00
Alex Lyn
c546b3c585 Merge pull request #12843 from microsoft/saul/build-opt
runtime-rs: add build optimization flags
2026-04-16 09:05:20 +08:00
Alex Lyn
2f6319f130 runtime-rs: Fix unformatted code in runtime-rs
When build runtime-rs, one unformatted code block comes up,as below:
```
-        config
-            .hypervisor
-            .entry("qemu".to_owned())
-            .and_modify(|hv| {
-                hv.cpu_info.default_vcpus = default_vcpus;
-                hv.cpu_info.default_maxvcpus = default_maxvcpus;
-                hv.memory_info.default_memory = default_memory;
-                hv.memory_info.default_maxmemory = default_maxmemory;
-            });
+        config.hypervisor.entry("qemu".to_owned()).and_modify(|hv| {
+            hv.cpu_info.default_vcpus = default_vcpus;
+            hv.cpu_info.default_maxvcpus = default_maxvcpus;
+            hv.memory_info.default_memory = default_memory;
+            hv.memory_info.default_maxmemory = default_maxmemory;
+        });
```
Let's format it now.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-15 14:48:23 +02:00
Saul Paredes
9404104aba runtime-rs: add build optimization flags
Enable the following optimizations when building runtime-rs in release mode:
- lto: true
- codegen-units=1:

Setting these reduce the binary size and improve performance at the cost of longer build times.

Without these flags:
- build time: 4m 55s
- binary size: 51 MB

With these flags:
- build time: 7m 21s
- binary size: 38MB

Per https://github.com/kata-containers/kata-containers/issues/1125 and local experiments,
a smaller binary size leads to a smaller shim memory footprint.

- https://nnethercote.github.io/perf-book/build-configuration.html#codegen-units
- https://nnethercote.github.io/perf-book/build-configuration.html#link-time-optimization

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2026-04-14 15:52:38 -07:00
stevenhorsman
5bcc006447 runtime-rs: Add missing license
The ch-config crate was missing a license

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-11 08:46:32 +01:00
Fabiano Fidêncio
3ce3644c3c Merge pull request #12807 from PiotrProkop/blk-sector-rust
runtime-rs: allow specifying logical/physical sector size for block devices
2026-04-11 00:42:45 +02:00
Fabiano Fidêncio
36a2d8e7f2 agent: Make launch_process_timeout configurable
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).

When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.

Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.

The NVIDIA GPU configs set the new default to 15 seconds.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-10 14:47:01 +02:00
PiotrProkop
82de35c720 runtime-rs: allow specifying logical/physical sector size for block devices
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:

  block_device_logical_sector_size  (config file)
  block_device_physical_sector_size (config file)

  io.katacontainers.config.hypervisor.blk_logical_sector_size  (annotation)
  io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)

The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.

Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.

Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.

Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-04-10 11:14:51 +02:00
Fabiano Fidêncio
218077506b Merge pull request #12769 from RuoqingHe/runtime-rs-allow-install-on-riscv
runtime-rs: Allow installation on RISC-V platforms
2026-04-10 10:24:40 +02:00
Steve Horsman
9e8069569e Merge pull request #12734 from Apokleos/rm-v9p-rs
runtime-rs: Remove virtio-9p Shared Filesystem Support
2026-04-09 16:15:55 +01:00
Hyounggyu Choi
f15f7f49f1 Merge pull request #12787 from fidencio/topic/runtime-rs-qemu-arm64-use-static-sandbox-resource-mgmt
runtime: qemu: Enable static sandbox resource management on ARM & s390x
2026-04-09 09:18:11 +02:00
Ruoqing He
98ee385220 runtime-rs: Consolidate unsupported arch
Consolidate arch we don't support at the moment, and avoid hard coding
error messages per arch.

Signed-off-by: Ruoqing He <ruoqing.he@lingcage.com>
2026-04-09 04:18:50 +00:00