The cloud-hypervisor feature has been fully functional for some time
now: it's enabled by default in virt_container, used by agent-ctl,
and exercised in CI. Drop the stale comments referencing issue #6264
and promote the feature to a default.
Fixes: #6264
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
* get_rootless_symlink_sandbox_path() would get without first checking for
is_rootless(), meaning cleanup() would ALWAYS fail (see below error), even
though the shim/CH would NOT leak thanks to containerd's recovery routine.
* Cleanup wouldn't be idempotent (in case the CRI issues multiple shutdown requests).
This was fixed by introducing remove_dir_all_if_exists().
Apr 17 17:53:21 containerd[4078033]: time="2026-04-17T17:53:21.821624475-05:00" level=error msg="failed to shutdown shim task and the shim might be leaked" error="Others(\"failed to handle message handler TaskRequest\\n\\nCaused by:\\n 0: do shutdown\\n 1: do the clean up\\n 2: delete hypervisor\\n 3: No such file or directory (os error 2)\\n\\nStack backtrace:\\n 0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from\\n 1: <hypervisor::ch::CloudHypervisor as hypervisor::Hypervisor>::cleanup::{{closure}}\\n 2: <virt_container::sandbox::VirtSandbox as common::sandbox::Sandbox>::cleanup::{{closure}}\\n 3: <virt_container::sandbox::VirtSandbox as common::sandbox::Sandbox>::shutdown::{{closure}}\\n 4: runtimes::manager::RuntimeHandlerManager::handler_task_message::{{closure}}::{{closure}}\\n 5: runtimes::manager::RuntimeHandlerManager::handler_task_message::{{closure}}\\n 6: <service::task_service::TaskService as containerd_shim_protos::shim::shim_ttrpc_async::Task>::shutdown::{{closure}}\\n 7: <containerd_shim_protos::shim::shim_ttrpc_async::ShutdownMethod as ttrpc::asynchronous::utils::MethodHandler>::handler::{{closure}}\\n 8: ttrpc::asynchronous::server::HandlerContext::handle_msg::{{closure}}\\n 9: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll\\n 10: <ttrpc::asynchronous::server::ServerReader as ttrpc::asynchronous::connection::ReaderDelegate>::handle_msg::{{closure}}::{{closure}}\\n 11: tokio::runtime::task::core::Core<T,S>::poll\\n 12: tokio::runtime::task::harness::Harness<T,S>::poll\\n 13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\\n 14: tokio::runtime::scheduler::multi_thread::worker::Context::run\\n 15: tokio::runtime::context::scoped::Scoped<T>::set\\n 16: tokio::runtime::context::runtime::enter_runtime\\n 17: tokio::runtime::scheduler::multi_thread::worker::run\\n 18: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\\n 19: tokio::runtime::task::core::Core<T,S>::poll\\n 20: tokio::runtime::task::harness::Harness<T,S>::poll\\n 21: tokio::runtime::blocking::pool::Inner::run\\n 22: std::sys::backtrace::__rust_begin_short_backtrace\\n 23: core::ops::function::FnOnce::call_once{{vtable.shim}}\\n 24: std::sys::thread::unix::Thread::new::thread_start\\n 25: <unknown>\\n 26: <unknown>\")" id=fca6a162b8f0ed7ef2b33cd99b6f1b58124e85c5489c193ceac487db0e4acdde
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
This serializes CH API calls to avoid a race condition where deleting a pod
would hang indefinitely and leak both the shim and CH processes.
The race happened because the CRI can send multiple shutdown requests for the
same pod, however the CH socket wasn't guarded against concurrent usage, hence
it was possible that HTTP responses would interleave (see below) on the
shutdown path, leading to an error.
This would repro in <15 iterations (sometime 2-3) using a 2-container pod.
With this commit, I haven't observed a repro in 200+ iterations.
Fixes: #12858
ORIGINAL REPRO:
while true; do
kubectl apply -f busybox.yaml
kubectl wait --for=condition=ready po busybox
kubectl exec busybox -- echo foo
kubectl delete po busybox
done
ORIGINAL ERROR:
Apr 17 20:15:54 kata[2297383]: Failed to stop process, process = ContainerProcess { container_id: ContainerID { container_id: "d4eb8984d630111bbf808c7ea30b7a21274c0193cdb8d501d20e4f26a0a69151" }, exec_id: "", process_type: Container }, err = failed to update_mem_resource
Caused by:
0: resize memory
1: get vminfo
2: failed to serde {"config":{"cpus":{"boot_vcpus":1,"max_vcpus":32,"topology":{"threads_per_core":1,"cores_per_die":32,"dies_per_package":1,"packages":1},"kvm_hyperv":false,"max_phys_bits":46,"affinity":null,"features":{"amx":false},"nested":null},"memory":{"size":2147483648,"mergeable":false,"hotplug_method":"Acpi","hotplug_size":132024107008,"hotplugged_size":null,"shared":true,"hugepages":false,"hugepage_size":null,"prefault":false,"zones":null,"thp":true},"payload":{"firmware":null,"kernel":"/usr/share/cloud-hypervisor/vmlinux.bin","cmdline":"reboot=k panic=1 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service agent.log_vport=1025 console=ttyS0,115200n8 root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4 no_timer_check noreplace-smp systemd.log_target=console agent.container_pipe_size=1 agent.log=debug cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1","initramfs":null},"rate_limit_groups":null,"disks":[{"path":"/usr/share/kata-containers/kata-containers.img","readonly":true,"direct":false,"iommu":false,"num_queues":1,"queue_size":128,"vhost_user":false,"vhost_socket":null,"rate_limit_group":null,"rate_limiter_config":null,"id":"_disk0","disable_io_uring":false,"disable_aio":false,"pci_segment":0,"serial":null,"queue_affinity":null,"backing_files":false}],"net":[{"tap":null,"ip":"192.168.249.1","mask":"255.255.255.0","mac":"9e:7e:13:ee:03:5c","host_mac":null,"mtu":null,"iommu":false,"num_queues":2,"queue_size":256,"vhost_user":false,"vhost_socket":null,"vhost_mode":"Client","id":"_net1","fds":[-1],"rate_limiter_config":null,"pci_segment":0,"offload_tso":true,"offload_ufo":true,"offload_csum":true}],"rng":{"src":"/dev/urandom","iommu":false},"balloon":null,"fs":[{"tag":"kataShared","socket":"/run/kata/e1ae0a05f575a13a535aa95a9990d1fded4766a759f76be0e528c7912d3a5e39/root/virtiofsd.sock","num_queues":1,"queue_size":1024,"id":"_fs2","pci_segment":0}],"pmem":null:"/run/kata/e1ae0a05f575a13a535aa95a9990d1fded4766a759f76be0e528c7912d3a5e39/ch-vm.sock","iommu":false,"id":"_vsock3","pci_segment":0},"pvpanic":false,"iommu":false,"numa":null,"watchdog":false,"pci_segments":null,"platform":null,"tpm":null,"landlock_enabl"index":0,"base":3891789824,"size":524288,"type_":"Mmio32","prefetchable":false}}],"parent":null,"children":["_disk0"],"pci_bdf":"0000:00:01.0"},"_virtio-pci-_vsock3":{"id":"_virtio-pci-_vsock3","resources":[{"PciBar":{"index":0,"base":70367622201344,"sizee":false}}],"parent":null,"children":["_fs2"],"pci_bdf":"0000:00:04.0"},"_vsock3":{"id":"_vsock3","resources":[],"parent":"_virtio-pci-_vsock3","children":[],"pci_bdf":null},"_net1":{"id":"_net1","resources":[],"parent":"_virtio-pci-_net1","children":[],"presources":[{"PciBar":{"index":0,"base":70367623774208,"size":524288,"type_":"Mmio64","prefetchable":false}}],"parent":null,"children":["_net1"],"pci_bdf":"0000:00:02.0"},"_virtio-pci-__rng":{"id":"_virtio-pci-__rng","resources":[{"PciBar":{"index":0,"baseesources":[],"parent":null,"children":[],"pci_bdf":null}}}HTTP/1.1 200
Server: Cloud Hypervisor API
Connection: keep-alive
Content-Type: application/json
Content-Length: 4285
{"config":{"cpus":{"boot_vcpus":1,"max_vcpus":32,"topology":{"threads_per_core":1,"cores_per_die":32,"dies_per_package":1,"packagesepage_size":null,"prefault":false,"zones":null,"thp":true},"payload":{"firmware":null,"kernel":"/usr/share/cloud-hypervisor/vmlinux.bin","cmdline":"reboot=k panic=1 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service agent.log_vport=1025 console=ttyS0,115200n8 root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4 no_timer_check noreplace-smp systemd.log_target=console agent.container_pipe_size=1 agent.log=debug cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1","miter_config":null,"id":"_disk0","disable_io_uring":false,"disable_aio":false,"pci_segment":0,"serial":null,"queue_affinity":null,"backing_files":false}],"net":[{"tap":null,"ip":"192.168.249.1","mask":"255.255.255.0","mac":"9e:7e:13:ee:03:5c","host_mac":nu,"serial":{"file":null,"mode":"Tty","iommu":false,"socket":null},"console":{"file":null,"mode":"Off","iommu":false,"socket":null},"debug_console":{"file":null,"mode":"Off","iobase":233},"devices":[],"user_devices":null,"vdpa":null,"vsock":{"cid":3,"socket"
3: expected `,` or `}` at line 1 column 1924
Stack backtrace:
0: <E as anyhow::context::ext::StdError>::ext_context
1: anyhow::context::<impl anyhow::Context<T,E> for core::result::Result<T,E>>::with_context
2: <hypervisor::ch::CloudHypervisor as hypervisor::Hypervisor>::resize_memory::{{closure}}
3: resource::manager_inner::ResourceManagerInner::update_linux_resource::{{closure}}
4: virt_container::container_manager::container::Container::stop_process::{{closure}}
5: virt_container::container_manager::process::Process::run_io_wait::{{closure}}::{{closure}}
6: tokio::runtime::task::core::Core<T,S>::poll
7: tokio::runtime::task::harness::Harness<T,S>::poll
8: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
9: tokio::runtime::scheduler::multi_thread::worker::Context::run
10: tokio::runtime::context::scoped::Scoped<T>::set
11: tokio::runtime::context::runtime::enter_runtime
12: tokio::runtime::scheduler::multi_thread::worker::run
13: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
14: tokio::runtime::task::core::Core<T,S>::poll
15: tokio::runtime::task::harness::Harness<T,S>::poll
16: tokio::runtime::blocking::pool::Inner::run
17: std::sys::backtrace::__rust_begin_short_backtrace
18: core::ops::function::FnOnce::call_once{{vtable.shim}}
19: std::sys::thread::unix::Thread::new::thread_start
20: <unknown>
21: <unknown>
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Related to previous commit, which adds the default gateway neighbor, and that
entry has the state of reachable.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
This change mirrors host networking into the guest as before, but now also
includes the default gateway neighbor entry for each interface.
Pods using overlay/synthetic gateways (e.g., 169.254.1.1) can hit a
first-connect race while the guest performs the initial ARP. Preseeding the
gateway neighbor removes that latency and makes early connections (e.g.,
to the API Service) deterministic.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
reqwest 0.11 required rustls-webpki 0.101.x, so we had to bump it
to use 0.103.12 to fix CVEs:
- RUSTSEC-2026-0098
- RUSTSEC-2026-0099
Assisted-by IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Hotplugging a readonly block device could fail with:
Block node is read-only
The backend block node was created readonly, but the virtio-scsi/blk
frontend path still forced share-rw=true. This is unnecessary and can
cause QEMU to reject the attach because the frontend configuration
does not match the readonly backend.
Fix the virtio-scsi/blk hotplug path by:
- setting read-only for readonly devices where supported
- skipping share-rw for readonly devices
Readonly handling remains in the backend block node configuration,
while the frontend keeps normal disk semantics for block devices.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Explicitly configure `read_only` and `force_share` for readonly block
devices to ensure consistency between the image's read-only state and
QEMU's access mode.
Motivation:
Previously, EROFS images were being accessed in a way that triggered
QEMU's exclusive locking (e.g., the 'resize' lock), even when the images
were intended to be read-only. This conflicted with external processes
(e.g., containerd snapshotter) that held read-only handles, resulting in
"Failed to get shared 'resize' lock" errors during blockdev-add.
Changes:
- Set `read_only=true` and `force_share=true` on both format and file
nodes for VMDK descriptors and Raw images.
- This ensures QEMU requests shared locks, correctly matching the
read-only nature of EROFS filesystems and preventing write-mode
locking conflicts with concurrent processes.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We should also support virtio-scsi driver for handling vmdk format
block device, and this will help address more cases.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When using multi-layer EROFS snapshotter, the destroy() method fails to
kill container processes, causing process leaks in shared PID namespace
scenarios.
Problem Background:
1. Multi-layer EROFS creates temporary mount points under the container's
root directory:
- /run/kata-containers/<cid>/multi-layer/upper (ext4, writable)
- /run/kata-containers/<cid>/multi-layer/lower-0 (EROFS, read-only)
2. The original destroy() method executed in this order:
(1) umount rootfs
(2) fs::remove_dir_all(&self.root) <- FAILS with "Read-only file system"
(3) cgroup cleanup and process killing <- NEVER EXECUTED
3. When remove_dir_all() encounters the read-only EROFS mount point, it
returns EROFS error (os error 30), causing destroy() to exit early
without killing processes.
Why This Fix:
1. The test case k8s-kill-all-process-in-container.bats creates an init
container with a background process (tail -f /dev/null), expecting it
to be killed when the init container is destroyed.
2. With shared PID namespace (shareProcessNamespace: true), the orphaned
process continues running, causing the test to fail.
Solution:
1. Reorder the destroy() method to kill processes BEFORE attempting to
remove the container directory:
(1) Get PIDs from cgroup and send SIGKILL
(2) Destroy cgroup
(3) umount rootfs
(4) fs::remove_dir_all(&self.root)
2. This ensures processes are always killed regardless of filesystem
cleanup status, matching the behavior of overlayfs snapshotter.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor the multi-layer EROFS storage handling to improve code
maintainability and reduce duplication.
Key changes:
(1) Extract update_storage_device() to unify device state management
for both multi-layer and standard storages
(2) Simplify handle_multi_layer_storage() to focus on device creation,
returning MultiLayerProcessResult struct instead of managing state
(3) Unify the processing flow in add_storages() with clear separation:
(4) Support multiple EROFS lower layers with dynamic lower-N mount paths
(5) Improve mkdir directive handling with deferred {{ mount 1 }}
resolution
This reduces code duplication, improves readability, and makes the
storage handling logic more consistent across different storage types.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce MultiLayerErofsHandler and method of
handle_multi_layer_storage for multi-layer storage:
(1) Register MultiLayerErofsHandler to STORAGE_HANDLERS to handle
multi-layer EROFS storage with driver type 'multi-layer-erofs'.
(2) Add handle_multi_layer_erofs function to process multiple EROFS
storages with X-kata.multi-layer marker together in guest.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add multi_layer_erofs.rs implementing guest-side processing logics
of multi-layer EROFS rootfs with overlay mount support.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add handling for multi-layer EROFS rootfs in RootFsResource
handler_rootfs method. It will correctly handle the multi-layers
erofs rootfs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add erofs_rootfs.rs implementing ErofsMultiLayerRootfs for
multi-layer EROFS rootfs with VMDK descriptor generation.
It's the core implementation of Erofs rootfs within runtime.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Change Rootfs::get_storage to return Option<Vec<Storage>>
to support multi-layer rootfs with multiple storages.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add format argument to hotplug_block_device for flexibly specifying
different block formats.
With this, we can support kinds of formats, currently raw and vmdk are
supported, and some other formats will be supported in future.
Aside the formats, the corresponding handling logics are also required
to properly handle its options needed in QMP blockdev-add.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In practice, we need more kinds of block formats, not limited to `Raw`.
This commit aims to add BlockDeviceFormat enum for kinds of block device
formats support, like RAW, VMDK, etc. And it will do some following actions
to make this changes work well, including format field in BlockConfig.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add RUNTIME_ALLOW_MOUNTS annotation to RuntimeInfo to specify
custom mount types allowed by the runtime.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The Go runtime's CoCo dev config uses dial_timeout = 45s, but all
runtime-rs confidential VM configs had reconnect_timeout_ms set to
3000ms (3s) or 5000ms (SE). This is too short for confidential VMs,
especially on arm64 where UEFI firmware (AAVMF) adds significant
boot time on top of the measured boot process, causing ECONNRESET
errors on the vsock connection before the agent is ready.
Bump reconnect_timeout_ms to 45000ms across all confidential VM
configs (coco-dev, SNP, TDX, SE) to match the Go runtime.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Update all versions of rand that are controlled by us to remediate
GHSA-cq8v-f236-94qc.
Note: There are still some usages of rand 0.8.5 it that are from
transitive dependencies which we can't currently update:
- fail
- phf_generator
- opentelemetry
due to them being archived, or our usage being 17 versions out of date
Also update the rand API breakages e.g. :
- rand::thread_rng() → rand::rng() (function renamed)
- rand::distributions::Alphanumeric → rand::distr::Alphanumeric (module renamed)
- rng.gen_range() → rng.random_range() (function renamed)
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add kernel_verity_params to the qemu-coco-dev-runtime-rs configuration
so the runtime can assemble dm-verity kernel parameters, and remove the
test skip that was disabling measured rootfs tests for this hypervisor.
Fixes: #12851
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add runtime-rs support for the GetDiagnosticData RPC. This extends
the Agent trait, types, and protocol translation layer with the new
request/response types.
During container stop, when shared_fs is "none" and the
terminationMessagePolicy annotation is "File", the runtime copies
the termination log from the guest via GetDiagnosticData. The call
is best-effort to avoid blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add policy rules for the new GetDiagnosticDataRequest RPC.
The request is denied by default in genpolicy-generated policies,
ensuring CoCo workloads do not expose diagnostic data unless
explicitly opted in via policy_data.request_defaults.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.
The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.
On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
Newer kernels and containerd versions (>= 2.2.3) may add extra mount
options to /sys/fs/cgroup that genpolicy does not embed in the policy
(e.g. nsdelegate, memory_recursiveprot). This causes the Kata agent to
reject CreateContainerRequest with PERMISSION_DENIED because the
check_mount rules require an exact match.
Rather than hard-coding the allowed extras in Rego, make them
configurable via genpolicy-settings.json under
cluster_config.cgroup_mount_extras_allowed. The corresponding Rego rule
(check_mount 4) reads the list from policy_data.cluster_config and
allows only those named options beyond the policy-embedded set.
To support this, cluster_config is now included in PolicyData so that
it gets serialized into the Rego policy_data object at generation time.
This follows the established pattern of keeping site- and
version-specific tunables in genpolicy-settings.json so they can be
overridden via JSON-Patch drop-ins without touching the Rego source.
A policy test case is added to verify that the default allowed extras
(nsdelegate, memory_recursiveprot) are accepted and that unknown extras
are rejected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Wait() was releasing s.mu immediately after getContainer(), then
calling getExec() — which reads c.execs — without holding any lock.
Concurrent Exec() or Delete() calls that write to c.execs under s.mu
triggered a "concurrent map read and map write" fatal panic.
Add a dedicated sync.RWMutex to the container struct that protects the
execs map. getExec() now acquires a read lock internally, and all
writes go through new setExec()/deleteExec() helpers that acquire the
write lock. This keeps the locking concern local to the map and avoids
complicating the s.mu usage in Wait().
Add a regression test (TestConcurrentExecAccess) that exercises
concurrent getExec reads against setExec/deleteExec writes; this
reliably reproduces the panic under the race detector without the fix.
Fixes: #12825
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
I don't think agent-ctl will benefit from the new image-rs features, but
let's update it to be complete.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>