18578 Commits

Author SHA1 Message Date
Fabiano Fidêncio
8dccf4cf37 Merge pull request #12896 from fidencio/release/3.29.0
release: Bump version to 3.29.0
3.29.0
2026-04-22 20:45:50 +02:00
Fabiano Fidêncio
1b9e49eb27 Merge commit from fork
genpolicy: restrict symlinks in CopyFile
2026-04-22 20:05:03 +02:00
Fabiano Fidêncio
ed3f8b4efe release: Bump version to 3.29.0
Bump VERSION and helm-charts versions.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-22 15:57:39 +02:00
Markus Rudy
639ff3578d genpolicy: restrict symlinks in CopyFile
Allowing arbitrary symlinks in the shared directory is unsafe for
confidential VM use cases. In order to make CopyFile safe both for the
VM as well for the consuming containers, we implement the following
rules for symlinks (in addition to the existing rules for other files):

1. Symlinks may not be placed directly into the shared directory.
2. Symlinks must not point 'upwards', i.e. contain `..` as a path
   element.
3. Symlinks must be relative.

These rules ensure that all writes initiated by CopyFile are restricted
to the shared directory (protecting the VM), and that symlinks can't
point outside their mount points (protecting the container).

These new restrictions mean that we can't support arbitrary mount
sources (which might not follow these rules), but the usual k8s suspects
(ConfigMap, Secret, ServiceAccountToken) should still pass.

In order to aid writing the policy, we convert the CopyFileRequest to a
structure that does not contain binary data, but well-defined strings
and types.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2026-04-22 15:46:12 +02:00
Markus Rudy
d6bd666b3f agent: fix naming for symlinks in CopyFile
The agent referred to the `data` field of an incoming CopyFileRequest
as the 'src'. This is misleading, because 'source' is not mentioned
in the specification (where links are just a path with attached
bytes), and because the documentation for the `ln` utility calls the
path LINK_NAME and the data TARGET. This commit fixes the glitch and
calls the first argument to `symlinkat` the target.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2026-04-22 15:46:12 +02:00
Markus Rudy
5c362adcff agent: add required features for standalone build
Building the kata-agent-policy crate only succeeded when its parents
(agent and genpolicy) pulled in the required features. This commit adds
the required features to the crate itself, such that it can be built
standalone and IDEs don't show errors while browsing it.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2026-04-22 15:46:12 +02:00
Fabiano Fidêncio
47dea24409 Merge pull request #12895 from fidencio/topic/kata-deploy-avoid-shipping-what-we-do-not-test
kata-deploy: Remove arm64 and qemu-cca shim support
2026-04-22 15:42:43 +02:00
Fabiano Fidêncio
726992cde3 Merge pull request #12702 from Apokleos/update-docs2
docs: Update docs of kata-containers
2026-04-22 12:04:48 +02:00
Fabiano Fidêncio
9b62021049 kata-deploy: Remove untested arm64 and qemu-cca shim support
We should not ship configurations that we do not actively test.

This commit drops the following from the kata-deploy helm chart:

values.yaml:
- arm64 from supportedArches for the clh shim
- arm64 from supportedArches for the cloud-hypervisor shim
- arm64 from supportedArches for the dragonball shim
- arm64 from supportedArches for the fc shim
- arm64 from supportedArches for the qemu-nvidia-gpu shim
- the entire qemu-cca shim definition

try-kata-tee.values.yaml:
- CCA from the file description comment
- qemu-cca from the TEE shims list comment
- the entire qemu-cca shim definition
- arm64: qemu-cca from the defaultShim mapping, replaced with
  arm64: qemu-coco-dev-runtime-rs (which is tested)

try-kata-nvidia-gpu.values.yaml:
- arm64 from supportedArches for the qemu-nvidia-gpu shim
- arm64: qemu-nvidia-gpu from the defaultShim mapping

Once arm64 and qemu-cca support are properly tested, they can be
re-added.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-22 10:55:29 +02:00
Alex Lyn
978f40d631 docs: Remove obsolete and update documentation index
This commit prunes the documentation tree by removing file
that are either no longer relevant to the current architecture
or have been superseded by newer guides.

Specifically, the doc Intel-Discrete-GPU-passthrough-and-Kata.md
and update using-Intel-QAT-and-kata.md index in nav.yaml

Refining the documentation helps ensure that new contributors
find accurate and up-to-date information.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-22 16:29:46 +08:00
Alex Lyn
59609463e0 docs: Update kernel modules loading document
- Restructure document with clearer sections and better readability
- Add configuration format examples for both runtimes
- Add technical details including data flow and implementation references
- Add debugging section for troubleshooting

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-22 16:29:46 +08:00
Alex Lyn
d6308ffb8c docs: Update SPDK vhost-user guide with CSI driver
- Add support for runtime-rs with Dragonball
- Add CSI driver integration method for Kubernetes
- Add kata-ctl direct-volume method for manual setup
- Preserve SPDK vhost-user Target Overview principles
- Fix minor typo (can exposes -> can expose)

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-22 16:29:46 +08:00
Fabiano Fidêncio
0c80372cf5 Merge pull request #12881 from stevenhorsman/bump-web-pki-to-0.103.12
Bump web pki to 0.103.12
2026-04-21 18:11:26 +02:00
Aurélien Bombo
206c1d3be8 Merge pull request #12889 from fidencio/topic/ch-config
hypervisor: Enable cloud-hypervisor feature by default
2026-04-21 11:04:31 -05:00
Fabiano Fidêncio
1c2d5cb57d Merge pull request #12848 from kata-containers/sprt/fix-block-vol-test
tests: make k8s-block-volume more robust
2026-04-21 11:27:43 +02:00
Fabiano Fidêncio
2bfa94b7cb hypervisor: Enable cloud-hypervisor feature by default
The cloud-hypervisor feature has been fully functional for some time
now: it's enabled by default in virt_container, used by agent-ctl,
and exercised in CI.  Drop the stale comments referencing issue #6264
and promote the feature to a default.

Fixes: #6264

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-21 11:26:12 +02:00
Fabiano Fidêncio
3b481813f9 Merge pull request #12887 from kata-containers/sprt/fix-runtime-rs-ch-cleanup
runtime-rs/ch: Fix pod deletion hang and make deletion idempotent
2026-04-21 11:21:09 +02:00
Aurélien Bombo
a401266f0e Merge pull request #11704 from microsoft/saulparedes/allow_default_gateway_neigh
network: preseed default-gateway neighbor
2026-04-20 15:43:55 -05:00
Aurélien Bombo
d64fce3998 Revert "ci: k8s: Adjust timeout on free runners"
This reverts commit 8d6f1d6f34.
2026-04-20 15:36:35 -05:00
Aurélien Bombo
3cf9581fbe runtime-rs/ch: Fix errors on pod deletion
* get_rootless_symlink_sandbox_path() would get without first checking for
   is_rootless(), meaning cleanup() would ALWAYS fail (see below error), even
   though the shim/CH would NOT leak thanks to containerd's recovery routine.

 * Cleanup wouldn't be idempotent (in case the CRI issues multiple shutdown requests).
   This was fixed by introducing remove_dir_all_if_exists().

   Apr 17 17:53:21 containerd[4078033]: time="2026-04-17T17:53:21.821624475-05:00" level=error msg="failed to shutdown shim task and the shim might be leaked" error="Others(\"failed to handle message handler TaskRequest\\n\\nCaused by:\\n    0: do shutdown\\n    1: do the clean up\\n    2: delete hypervisor\\n    3: No such file or directory (os error 2)\\n\\nStack backtrace:\\n   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from\\n   1: <hypervisor::ch::CloudHypervisor as hypervisor::Hypervisor>::cleanup::{{closure}}\\n   2: <virt_container::sandbox::VirtSandbox as common::sandbox::Sandbox>::cleanup::{{closure}}\\n   3: <virt_container::sandbox::VirtSandbox as common::sandbox::Sandbox>::shutdown::{{closure}}\\n   4: runtimes::manager::RuntimeHandlerManager::handler_task_message::{{closure}}::{{closure}}\\n   5: runtimes::manager::RuntimeHandlerManager::handler_task_message::{{closure}}\\n   6: <service::task_service::TaskService as containerd_shim_protos::shim::shim_ttrpc_async::Task>::shutdown::{{closure}}\\n   7: <containerd_shim_protos::shim::shim_ttrpc_async::ShutdownMethod as ttrpc::asynchronous::utils::MethodHandler>::handler::{{closure}}\\n   8: ttrpc::asynchronous::server::HandlerContext::handle_msg::{{closure}}\\n   9: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll\\n  10: <ttrpc::asynchronous::server::ServerReader as ttrpc::asynchronous::connection::ReaderDelegate>::handle_msg::{{closure}}::{{closure}}\\n  11: tokio::runtime::task::core::Core<T,S>::poll\\n  12: tokio::runtime::task::harness::Harness<T,S>::poll\\n  13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\\n  14: tokio::runtime::scheduler::multi_thread::worker::Context::run\\n  15: tokio::runtime::context::scoped::Scoped<T>::set\\n  16: tokio::runtime::context::runtime::enter_runtime\\n  17: tokio::runtime::scheduler::multi_thread::worker::run\\n  18: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\\n  19: tokio::runtime::task::core::Core<T,S>::poll\\n  20: tokio::runtime::task::harness::Harness<T,S>::poll\\n  21: tokio::runtime::blocking::pool::Inner::run\\n  22: std::sys::backtrace::__rust_begin_short_backtrace\\n  23: core::ops::function::FnOnce::call_once{{vtable.shim}}\\n  24: std::sys::thread::unix::Thread::new::thread_start\\n  25: <unknown>\\n  26: <unknown>\")" id=fca6a162b8f0ed7ef2b33cd99b6f1b58124e85c5489c193ceac487db0e4acdde

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-04-20 15:36:18 -05:00
Aurélien Bombo
93bd2899fb runtime-rs/ch: Fix hang on pod deletion
This serializes CH API calls to avoid a race condition where deleting a pod
would hang indefinitely and leak both the shim and CH processes.

The race happened because the CRI can send multiple shutdown requests for the
same pod, however the CH socket wasn't guarded against concurrent usage, hence
it was possible that HTTP responses would interleave (see below) on the
shutdown path, leading to an error.

This would repro in <15 iterations (sometime 2-3) using a 2-container pod.
With this commit, I haven't observed a repro in 200+ iterations.

Fixes: #12858

ORIGINAL REPRO:

while true; do
  kubectl apply -f busybox.yaml
  kubectl wait --for=condition=ready po busybox
  kubectl exec busybox -- echo foo
  kubectl delete po busybox
done

ORIGINAL ERROR:

 Apr 17 20:15:54 kata[2297383]: Failed to stop process, process = ContainerProcess { container_id: ContainerID { container_id: "d4eb8984d630111bbf808c7ea30b7a21274c0193cdb8d501d20e4f26a0a69151" }, exec_id: "", process_type: Container }, err = failed to update_mem_resource

                               Caused by:
                                   0: resize memory
                                   1: get vminfo
                                   2: failed to serde {"config":{"cpus":{"boot_vcpus":1,"max_vcpus":32,"topology":{"threads_per_core":1,"cores_per_die":32,"dies_per_package":1,"packages":1},"kvm_hyperv":false,"max_phys_bits":46,"affinity":null,"features":{"amx":false},"nested":null},"memory":{"size":2147483648,"mergeable":false,"hotplug_method":"Acpi","hotplug_size":132024107008,"hotplugged_size":null,"shared":true,"hugepages":false,"hugepage_size":null,"prefault":false,"zones":null,"thp":true},"payload":{"firmware":null,"kernel":"/usr/share/cloud-hypervisor/vmlinux.bin","cmdline":"reboot=k panic=1 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service agent.log_vport=1025 console=ttyS0,115200n8 root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4 no_timer_check noreplace-smp systemd.log_target=console agent.container_pipe_size=1 agent.log=debug cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1","initramfs":null},"rate_limit_groups":null,"disks":[{"path":"/usr/share/kata-containers/kata-containers.img","readonly":true,"direct":false,"iommu":false,"num_queues":1,"queue_size":128,"vhost_user":false,"vhost_socket":null,"rate_limit_group":null,"rate_limiter_config":null,"id":"_disk0","disable_io_uring":false,"disable_aio":false,"pci_segment":0,"serial":null,"queue_affinity":null,"backing_files":false}],"net":[{"tap":null,"ip":"192.168.249.1","mask":"255.255.255.0","mac":"9e:7e:13:ee:03:5c","host_mac":null,"mtu":null,"iommu":false,"num_queues":2,"queue_size":256,"vhost_user":false,"vhost_socket":null,"vhost_mode":"Client","id":"_net1","fds":[-1],"rate_limiter_config":null,"pci_segment":0,"offload_tso":true,"offload_ufo":true,"offload_csum":true}],"rng":{"src":"/dev/urandom","iommu":false},"balloon":null,"fs":[{"tag":"kataShared","socket":"/run/kata/e1ae0a05f575a13a535aa95a9990d1fded4766a759f76be0e528c7912d3a5e39/root/virtiofsd.sock","num_queues":1,"queue_size":1024,"id":"_fs2","pci_segment":0}],"pmem":null:"/run/kata/e1ae0a05f575a13a535aa95a9990d1fded4766a759f76be0e528c7912d3a5e39/ch-vm.sock","iommu":false,"id":"_vsock3","pci_segment":0},"pvpanic":false,"iommu":false,"numa":null,"watchdog":false,"pci_segments":null,"platform":null,"tpm":null,"landlock_enabl"index":0,"base":3891789824,"size":524288,"type_":"Mmio32","prefetchable":false}}],"parent":null,"children":["_disk0"],"pci_bdf":"0000:00:01.0"},"_virtio-pci-_vsock3":{"id":"_virtio-pci-_vsock3","resources":[{"PciBar":{"index":0,"base":70367622201344,"sizee":false}}],"parent":null,"children":["_fs2"],"pci_bdf":"0000:00:04.0"},"_vsock3":{"id":"_vsock3","resources":[],"parent":"_virtio-pci-_vsock3","children":[],"pci_bdf":null},"_net1":{"id":"_net1","resources":[],"parent":"_virtio-pci-_net1","children":[],"presources":[{"PciBar":{"index":0,"base":70367623774208,"size":524288,"type_":"Mmio64","prefetchable":false}}],"parent":null,"children":["_net1"],"pci_bdf":"0000:00:02.0"},"_virtio-pci-__rng":{"id":"_virtio-pci-__rng","resources":[{"PciBar":{"index":0,"baseesources":[],"parent":null,"children":[],"pci_bdf":null}}}HTTP/1.1 200
                                      Server: Cloud Hypervisor API
                                      Connection: keep-alive
                                      Content-Type: application/json
                                      Content-Length: 4285

                                      {"config":{"cpus":{"boot_vcpus":1,"max_vcpus":32,"topology":{"threads_per_core":1,"cores_per_die":32,"dies_per_package":1,"packagesepage_size":null,"prefault":false,"zones":null,"thp":true},"payload":{"firmware":null,"kernel":"/usr/share/cloud-hypervisor/vmlinux.bin","cmdline":"reboot=k panic=1 systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service agent.log_vport=1025 console=ttyS0,115200n8 root=/dev/vda1 rootflags=data=ordered,errors=remount-ro ro rootfstype=ext4 no_timer_check noreplace-smp systemd.log_target=console agent.container_pipe_size=1 agent.log=debug cgroup_no_v1=all systemd.unified_cgroup_hierarchy=1","miter_config":null,"id":"_disk0","disable_io_uring":false,"disable_aio":false,"pci_segment":0,"serial":null,"queue_affinity":null,"backing_files":false}],"net":[{"tap":null,"ip":"192.168.249.1","mask":"255.255.255.0","mac":"9e:7e:13:ee:03:5c","host_mac":nu,"serial":{"file":null,"mode":"Tty","iommu":false,"socket":null},"console":{"file":null,"mode":"Off","iommu":false,"socket":null},"debug_console":{"file":null,"mode":"Off","iobase":233},"devices":[],"user_devices":null,"vdpa":null,"vsock":{"cid":3,"socket"
                                   3: expected `,` or `}` at line 1 column 1924

                               Stack backtrace:
                                  0: <E as anyhow::context::ext::StdError>::ext_context
                                  1: anyhow::context::<impl anyhow::Context<T,E> for core::result::Result<T,E>>::with_context
                                  2: <hypervisor::ch::CloudHypervisor as hypervisor::Hypervisor>::resize_memory::{{closure}}
                                  3: resource::manager_inner::ResourceManagerInner::update_linux_resource::{{closure}}
                                  4: virt_container::container_manager::container::Container::stop_process::{{closure}}
                                  5: virt_container::container_manager::process::Process::run_io_wait::{{closure}}::{{closure}}
                                  6: tokio::runtime::task::core::Core<T,S>::poll
                                  7: tokio::runtime::task::harness::Harness<T,S>::poll
                                  8: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
                                  9: tokio::runtime::scheduler::multi_thread::worker::Context::run
                                 10: tokio::runtime::context::scoped::Scoped<T>::set
                                 11: tokio::runtime::context::runtime::enter_runtime
                                 12: tokio::runtime::scheduler::multi_thread::worker::run
                                 13: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
                                 14: tokio::runtime::task::core::Core<T,S>::poll
                                 15: tokio::runtime::task::harness::Harness<T,S>::poll
                                 16: tokio::runtime::blocking::pool::Inner::run
                                 17: std::sys::backtrace::__rust_begin_short_backtrace
                                 18: core::ops::function::FnOnce::call_once{{vtable.shim}}
                                 19: std::sys::thread::unix::Thread::new::thread_start
                                 20: <unknown>
                                 21: <unknown>

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-04-20 15:36:00 -05:00
Fabiano Fidêncio
847f0f40cb Merge pull request #12880 from fidencio/topic/improve-qemu-cache
ci: cache: qemu: Take configure-hypervisor.sh into account
2026-04-20 19:16:01 +02:00
Saul Paredes
f1bcfb8a62 policy: allow neighbors with reachable state
Related to previous commit, which adds the default gateway neighbor, and that
entry has the state of reachable.

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2026-04-20 10:00:23 -07:00
Saul Paredes
83bbfedc08 network: preseed default-gateway neighbor
This change mirrors host networking into the guest as before, but now also
includes the default gateway neighbor entry for each interface.

Pods using overlay/synthetic gateways (e.g., 169.254.1.1) can hit a
first-connect race while the guest performs the initial ARP. Preseeding the
gateway neighbor removes that latency and makes early connections (e.g.,
to the API Service) deterministic.

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2026-04-20 10:00:19 -07:00
Dan Mihai
b2ea9a8fc6 Merge pull request #12460 from microsoft/danmihai1/k8s-openvpn-runtime
tests: annotations for all k8s-openvpn yaml files
2026-04-20 09:47:02 -07:00
stevenhorsman
6b1fd4c782 kata-ctl: Bump reqwest to 0.12
reqwest 0.11 required rustls-webpki 0.101.x, so we had to bump it
to use 0.103.12 to fix CVEs:
- RUSTSEC-2026-0098
- RUSTSEC-2026-0099

Assisted-by IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-20 17:20:54 +01:00
stevenhorsman
9fbdf513ca kata-deploy: Delete Cargo.lock
In #12776 kata-deploy's binary was moved to the main cargo workspace,
but the Cargo.lock wasn't deleted. As it shares the main Cargo.lock tidy
this up.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-20 17:09:21 +01:00
stevenhorsman
a59afa3154 versions: Update rustls-webpki to 0.103.12
Simple bump to fix CVEs:
- RUSTSEC-2026-0098
- RUSTSEC-2026-0099

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-20 16:24:20 +01:00
Fabiano Fidêncio
b64673196a ci: cache: qemu: Take configure-hypervisor.sh into account
The script is used to change the options used to build QEMU and **must**
be taken into consideration in case something changes, otherwise the
QEMU used by the CI would be the old cached one (ignoring any flag newly
added).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-20 14:52:57 +02:00
Fabiano Fidêncio
07731cde21 Merge pull request #12879 from stevenhorsman/confidential-tests-fixes
Confidential tests fixes
2026-04-20 14:33:02 +02:00
stevenhorsman
c75c432c01 ci: Update TEE scope
`k8s-confidential.bats` technically doesn't need attestation, but only runs
on TEE hardware, so include it in the attestation list so we can test it in PRs

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-20 09:36:10 +01:00
stevenhorsman
7179e92142 tests/confidentials: Remove pointless skip
The skip conditional is wrong, but it's not needed as the setup
and teardown only allow confidential hardware anyway

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-20 09:36:10 +01:00
Fupan Li
2629df2785 Merge pull request #12763 from Apokleos/fsmerged-erofs-rs
runtime-rs: support erofs snapshotter with Fsmerge enabled
2026-04-20 11:54:19 +08:00
Alex Lyn
e975b3158b Merge pull request #12837 from stevenhorsman/rand-bump-GHSA-cq8v-f236-94qc
versions: Bump rand crate where possible
2026-04-20 10:05:19 +08:00
Fabiano Fidêncio
d6f0b15578 ci: erofs: restrict to runtime-rs only
The erofs snapshotter configuration is node-wide (a single containerd
drop-in) and cannot be split per runtime handler.  The Go runtime does
not support fsmerged EROFS — it rejects fsmeta.erofs mount sources with
"unsupported mount source" — so erofs is only usable with runtime-rs.

Drop qemu-coco-dev (Go) from the erofs CI matrix and add a check in
kata-deploy's configure_erofs_snapshotter() that inspects the
SNAPSHOTTER_HANDLER_MAPPING: if any Go shim is explicitly mapped to
erofs, emit a prominent warning and bail out with a clear error telling
the operator to fix the mapping.

Since all shims are now guaranteed to be runtime-rs when erofs is
active, remove the conditional is_rust_shim gating and always emit the
full erofs configuration (differ options, default_size,
max_unmerged_layers=1).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Fabiano Fidêncio
cf1e6f82f2 tests: Show full kata-deploy pod logs in CI
Remove --tail=N limits from `kubectl logs` for kata-deploy pods so
the complete output is visible in CI job logs for debugging.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
c26f647a3a test: Improve process verification and robustness in kill test
During tests, one error as below:
```
..k8s-kill-all-process-in-container.bats: line 40: [: too many arguments
```
This commit aims to address such issue follows:
(1) Update process query command to "ps aux || ps" to ensure
  compatibility across different container images while maximizing
  process visibility.
(2) Use "[t]ail" in grep to reliably match the process without
  self-matching.
(3) Quote variable in assertion to resolve "too many arguments" bash
  error.
(4) Improve test reliability by ensuring the process list is actually
  visible to the verification logic.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
f4f6c78e9e tests: Update expectation for no-layer-image test case
The 'no-layer-image' test case was failing because the underlying shim
returned a "unsupported rootfs mounts count" error instead of the
expected application-level "file not found" or "ENOENT" error.

This change updates the BATS test to accept the shim-level rootfs
validation error as a valid failure condition for this unsupported
image scenario, ensuring the CI remains green while reflecting
current runtime behavior.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
be47c2e932 runtime-rs: Avoid share-rw on readonly virtio-scsi/blk devices
Hotplugging a readonly block device could fail with:

  Block node is read-only

The backend block node was created readonly, but the virtio-scsi/blk
frontend path still forced share-rw=true. This is unnecessary and can
cause QEMU to reject the attach because the frontend configuration
does not match the readonly backend.

Fix the virtio-scsi/blk hotplug path by:
- setting read-only for readonly devices where supported
- skipping share-rw for readonly devices

Readonly handling remains in the backend block node configuration,
while the frontend keeps normal disk semantics for block devices.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
02f975f88b runtime-rs: Enforce read-only and shared access for RO block devices
Explicitly configure `read_only` and `force_share` for readonly block
devices to ensure consistency between the image's read-only state and
QEMU's  access mode.

Motivation:
Previously, EROFS images were being accessed in a way that triggered
QEMU's exclusive locking (e.g., the 'resize' lock), even when the images
were intended to be read-only. This conflicted with external processes
(e.g., containerd snapshotter) that held read-only handles, resulting in
"Failed to get shared 'resize' lock" errors during blockdev-add.

Changes:
- Set `read_only=true` and `force_share=true` on both format and file
  nodes for VMDK descriptors and Raw images.
- This ensures QEMU requests shared locks, correctly matching the
  read-only nature of EROFS filesystems and preventing write-mode
  locking conflicts with concurrent processes.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Fabiano Fidêncio
9c803d86a6 ci: erofs: Bump containerd to v2.3
To ensure we're using the latest released version of the project, as I
think we're missing patches on v2.2.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Fabiano Fidêncio
cdd09c3c65 ci: enable erofs tests with runtime-rs
Now that erofs snapshotter has added , let's make sure this is tested.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
7f7cca16fa kata-deploy: Complete containerd config for erofs snapshotter
Add missing containerd configuration items for erofs snapshotter to
enable fsmerged erofs feature:

Add snapshotter plugin configuration:
 - default_size: "10G" # can be customized
 - max_unmerged_layers: 1 # Fixed with 1

These configurations align with the documentation in
docs/how-to/how-to-use-fsmerged-erofs-with-kata.md Step 2,
ensuring the CI workflow run-k8s-tests-coco-nontee-with-erofs-snapshotter
can properly configure containerd for erofs fsmerged rootfs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Fabiano Fidêncio
04e0f1c403 qemu: Enable VMDK block format support
The multi-layer EROFS rootfs feature relies on QEMU's VMDK flat-extent
driver to merge multiple EROFS layers into a single virtual block
device. Replace --disable-vmdk with an explicit --enable-vmdk so the
Kata static QEMU build includes VMDK support.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
27341f45f1 docs: Add how-to guide for using fsmerged EROFS rootfs with Kata
Document the end-to-end workflow for using the containerd EROFS
snapshotter with Kata Containers runtime-rs, covering containerd
configuration, Kata QEMU settings, and pod deployment examples
via crictl/ctr/Kubernetes.

Include prerequisites (containerd >= 2.2, runtime-rs main branch),
QEMU VMDK format verification command, architecture diagram,
VMDK descriptor format reference, and troubleshooting guide.

Note that Cloud Hypervisor, Firecracker, and Dragonball do not
support VMDK block devices and are currently unsupported for
fsmerged EROFS rootfs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
526126904e runtime-rs: Add support for handling vmdk hotplugging with scsi
We should also support virtio-scsi driver for handling vmdk format
block device, and this will help address more cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
ce3473d272 agent: Kill processes before removing container directory in destroy()
When using multi-layer EROFS snapshotter, the destroy() method fails to
kill container processes, causing process leaks in shared PID namespace
scenarios.

Problem Background:
1. Multi-layer EROFS creates temporary mount points under the container's
  root directory:
  - /run/kata-containers/<cid>/multi-layer/upper (ext4, writable)
  - /run/kata-containers/<cid>/multi-layer/lower-0 (EROFS, read-only)
2. The original destroy() method executed in this order:
  (1) umount rootfs
  (2) fs::remove_dir_all(&self.root) <- FAILS with "Read-only file system"
  (3) cgroup cleanup and process killing <- NEVER EXECUTED
3. When remove_dir_all() encounters the read-only EROFS mount point, it
  returns EROFS error (os error 30), causing destroy() to exit early
  without killing processes.

Why This Fix:
1. The test case k8s-kill-all-process-in-container.bats creates an init
  container with a background process (tail -f /dev/null), expecting it
  to be killed when the init container is destroyed.
2. With shared PID namespace (shareProcessNamespace: true), the orphaned
  process continues running, causing the test to fail.

Solution:
1. Reorder the destroy() method to kill processes BEFORE attempting to
  remove the container directory:
  (1) Get PIDs from cgroup and send SIGKILL
  (2) Destroy cgroup
  (3) umount rootfs
  (4) fs::remove_dir_all(&self.root)
2. This ensures processes are always killed regardless of filesystem
  cleanup status, matching the behavior of overlayfs snapshotter.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
c745d18e00 agent: Add virtio-scsi for multilayer erofs storage handler
It aims to suppport virtio-scsi driver for handling vmdk and rwlayer
storage in kata-agent.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
37a542c20f agent: Refactor multi-layer EROFS handling with unified flow
Refactor the multi-layer EROFS storage handling to improve code
maintainability and reduce duplication.

Key changes:
(1) Extract update_storage_device() to unify device state management
  for both multi-layer and standard storages
(2) Simplify handle_multi_layer_storage() to focus on device creation,
  returning MultiLayerProcessResult struct instead of managing state
(3) Unify the processing flow in add_storages() with clear separation:
(4) Support multiple EROFS lower layers with dynamic lower-N mount paths
(5) Improve mkdir directive handling with deferred {{ mount 1 }}
  resolution

This reduces code duplication, improves readability, and makes the
storage handling logic more consistent across different storage types.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00
Alex Lyn
27c59f15a0 agent: Register MultiLayerErofsHandler and process multiple EROFS
Introduce MultiLayerErofsHandler and method of
handle_multi_layer_storage for multi-layer storage:
(1) Register MultiLayerErofsHandler to STORAGE_HANDLERS to handle
multi-layer EROFS storage with driver type 'multi-layer-erofs'.
(2) Add handle_multi_layer_erofs function to process multiple EROFS
storages with X-kata.multi-layer marker together in guest.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-19 13:24:31 +02:00