Support multi-layer EROFS storage without an explicit ext4 upper
layer. When runtime-rs sends only EROFS lower storage and overlay
metadata, create the overlay upper/work directories under the
container bundle in /run/kata-containers.
Keep the explicit ext4 rwlayer path for disk-backed snapshots, and
only track real temporary mount points for cleanup. The implicit
/run-backed upper is bundle-scoped state and is removed with the
container bundle.
Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Commit b56313472 ("agent: Align agent OCI spec with oci-spec-rs",
PR #9944) inverted the condition guarding the create_dir_all call
for process.cwd: the leading `!` was dropped during the refactor.
As a result, the CWD is created only when process.cwd is the empty
string.
When the guest then runs chdir(process.cwd) and CWD doesn't exist
it returns ENOENT. The agent propagates that to the shim, which
surfaces it to containerd as "failed to create shim task: ENOENT:
No such file or directory" — indistinguishable from a missing
argv[0].
This regressed the original fix in PR #2375 (Fixes#2374), which
deliberately mirrored runc's behavior. Put the `!` back.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Use kata_types::mount::Mount for the final multi-layer EROFS
overlay mount instead of calling baremount() directly.
The mount helper detects overlay option strings close to the kernel
mount data limit. When lowerdir entries share a common parent, it
changes into that directory and rewrites lowerdir to relative paths.
That avoids repeating the same long prefix for every layer.
Multi-layer EROFS images can have many lower layers under
/run/kata-containers/<cid>/multi-layer. Passing the raw absolute
lowerdir list can exceed the mount option buffer and fail the final
overlay mount, even after all layer devices mounted successfully.
Reuse the helper so this path follows Kata's normal overlay mount
handling, including lowerdir compaction before mount(2).
Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
In GPT mode, all partitions share the same base block device, so
resolving it once per uevent source and caching the result avoids
redundant hotplug waits that would otherwise scale linearly with
layer count.
Layers are sorted by partition number before mounting to guarantee
correct overlay lowerdir precedence regardless of the order the host
emits Storage entries.
And it will remove dead_code attributes to mark the codes working.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The guest agent needs to resolve individual partition devices from a
single GPT-partitioned block device, but the kernel does not always
create partition nodes immediately after the base device appears,
especially when another fd holds the device open during hot-plug.
Add utility functions that handle two problems:
(1) Mapping a base device path to its partition path following the
kernel naming convention (bare suffix vs 'p' separator).
(2) And ensuring the partition node exists before mount.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit has No functional change — all callers pass None, so
every call still resolves the device via uevent exactly as before.
It just prepare the multi-layer EROFS handler for GPT partition and
dm-verity support by widening the wait_and_mount_layer() interface
without changing behavior.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As previous unused codes are with attribute of dead_code which
actually are never used, we'd better remove them totally.
It will remove the mode field from MkdirDirective structure and
also remove its relavent test cases.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., /dev/vdaX) which is critical for partitioned disks where
partition uevents appear alongside whole-disk uevents.
This commit aims to eliminate such bad cases.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor ScsiBlockMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., block/sdd/sdd9) which is critical for partitioned disks
where partition uevents appear alongside whole-disk uevents.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path
format "root_complex/bus/device" (e.g. "10/00/02") in addition to the
legacy "bus/device" format, defaulting to root complex "00" for backward
compatibility.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The rtnetlink crate has had an API for neighbours since 0.11. The last
attempt to use this API caused problems on AKS, but looking at it again
shows that not all functionality was ported back then (state, flags and
lladdr). Attempt the migration again, considering all parameters.
Fixes: #11942
Signed-off-by: Markus Rudy <mr@edgeless.systems>
VirtioBlkCcwHandler::create_device was calling common_storage_handler
directly, bypassing the handle_block_storage function that checks for
the encryption_key=ephemeral driver option. This meant that encrypted
emptyDir volumes on s390x would attempt a plain mount of the raw block
device instead of setting up dm-crypt via the CDH, resulting in an
EINVAL mount error.
Route CCW block devices through handle_block_storage, matching the
pattern used by VirtioBlkPciHandler.
Fixes: failed to mount /dev/vda to .../storage/..., EINVAL
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
If the policy loading encounters an error, we `abort(3)` the agent for
safety. Since abort causes the process to stop immediately, the async
logs might not be flushed yet, and thus won't make it to the runtime,
hiding the reason for the abort. Wait a bit before aborting so that the
logs are fully written.
Fixes: #13031
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Add the core volume handler for block-encrypted emptyDir support
in runtime-rs, bringing it to parity with the Go runtime (PR #10559).
When emptydir_mode is set to "block-encrypted", host emptyDir bind
mounts are intercepted and handled as follows:
1. A sparse disk image (disk.img) is created inside the emptyDir
folder, sized to match the host filesystem capacity.
2. A mountInfo.json is written under the kata direct-volume root
with volume_type "blk", fs_type "ext4", and metadata
encryptionKey=ephemeral.
3. The disk image is plugged into the guest VM as a virtio-blk
device via the hypervisor device manager.
4. An agent::Storage is built with driver_options containing
encryption_key=ephemeral and shared=true, so the kata-agent
delegates formatting and encryption to CDH using LUKS2.
The volume is registered in the dispatch chain before the regular
block-volume check, and ephemeral disk metadata is tracked for
sandbox-level cleanup at teardown.
Also re-exports EMPTYDIR_MODE_* constants from kata-types::config
so downstream crates can reference them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
So many unformatted rust codes cause uncommitted change files in
rust runtime and its libs or agent sources, which can be easily
found just by `cargo fmt --all`.
Let's reduce such noisy bad experiences
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Extend the in-guest agent's VFIO device handler to support the cold-plug
flow. When the runtime cold-plugs a GPU before the VM boots, the agent
needs to bind the device to the vfio-pci driver inside the guest and
set up the correct /dev/vfio/ group nodes so the workload can access
the GPU.
This updates the device discovery logic to handle the PCI topology that
QEMU presents for cold-plugged vfio-pci devices and ensures the IOMMU
group is properly resolved from the guest's sysfs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
`make vendor` isn't required anymore. People who need vendored code should
use the `tools/packaging/release/generate_vendor.sh` script instead.
Assisted-by: Claude AI
Signed-off-by: Greg Kurz <groug@kaod.org>
libc::S_IF* are u16 on Darwin/BSD and u32 on Linux. The match in
FileType::from and its tests mix both widths and don't compile on
Darwin. Cast everything to u32; on Linux that's a no-op, hence the
clippy::unnecessary_cast allow (rust-lang/rust-clippy#6466).
Fixes: #12916
Signed-off-by: Spyros Seimenis <sse@edgeless.systems>
regorus 0.9.0 introduced a hard, per-engine ceiling on parsed-policy
size (1024 columns / 1 MiB / 20 000 lines, see lexer.rs:30 in
microsoft/regorus). The 1024-column cap rejects realistic policies
emitted by `genpolicy`: the `NVIDIA_REQUIRE_CUDA` environment variable
on `nvcr.io/nvidia/k8s/cuda-sample` is roughly 1.3 KiB on a single line,
so the agent's `set_policy()` returns an error, the agent (PID 1) exits,
the guest kernel reboots, and the runtime eventually times out
connecting to the agent's vsock.
regorus PR #624 ("feat: make policy length limits configurable per
engine") adds `Engine::set_policy_length_config`, but it has not been
released yet -- the latest published version is still 0.9.1, which
predates that change.
Pin `regorus` to the upstream commit that includes #624 and call the
new setter from `AgentPolicy::new_engine()` with values that comfortably
fit any policy we expect to evaluate (64 KiB per line, 16 MiB per file,
200 000 lines) while still rejecting pathological/minified input. Once
a regorus release > 0.9.1 ships with #624, the dependency can be moved
back to crates.io.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The version we used before was released in 2024, it's about time to use
a newer version. The new version of the crate comes with a license,
which addresses a `cargo deny` finding.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Remove unnecessary let binding for unit value expression to fix clippy
warning in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Remove unnecessary reference operator from expression that is
immediately dereferenced by the compiler to fix clippy warning in
Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Replace assert_eq! with literal bool values with assert! or assert!
with negation for more idiomatic assertions to fix clippy warnings in
Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Replace 'as u8' casts with type suffix literals (_u8) for binary
literals to fix clippy warnings in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Replace .iter().any(|&ap| ap == p) with .contains(&p) for more
idiomatic code to fix clippy warning in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Remove unnecessary reference operators from format!() calls passed to
Command::arg() to fix clippy warnings in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Replace vec![] with array literals [] for immutable test data to fix
clippy warnings in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Replace octal escape sequences (\040) with hex escape sequences (\x20)
for space characters in mountinfo test data to fix clippy warning in
Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Replace decimal literal with cast (0660 as u32) with proper octal
literal syntax (0o660) to fix clippy warning in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Replace is_some() check followed by unwrap() with if let pattern
to address clippy::unnecessary_unwrap warning in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Device plugins may set PCIDEVICE_* environment variables with
non-PCI identifiers (e.g. "mlx5_core.sf.10" for mlx5 Scalable
Functions). The update_env_pci() function assumed all values were
PCI BDF addresses and failed to parse them, causing container
creation to fail with:
"PCI address mlx5_core.sf.10 should have the format DDDD:BB:SS.F"
Skip PCIDEVICE_* entries whose values don't parse as PCI addresses,
leaving them untouched for the workload. The corresponding _INFO
variable is also left as-is since no mapping is collected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Allowing arbitrary symlinks in the shared directory is unsafe for
confidential VM use cases. In order to make CopyFile safe both for the
VM as well for the consuming containers, we implement the following
rules for symlinks (in addition to the existing rules for other files):
1. Symlinks may not be placed directly into the shared directory.
2. Symlinks must not point 'upwards', i.e. contain `..` as a path
element.
3. Symlinks must be relative.
These rules ensure that all writes initiated by CopyFile are restricted
to the shared directory (protecting the VM), and that symlinks can't
point outside their mount points (protecting the container).
These new restrictions mean that we can't support arbitrary mount
sources (which might not follow these rules), but the usual k8s suspects
(ConfigMap, Secret, ServiceAccountToken) should still pass.
In order to aid writing the policy, we convert the CopyFileRequest to a
structure that does not contain binary data, but well-defined strings
and types.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
The agent referred to the `data` field of an incoming CopyFileRequest
as the 'src'. This is misleading, because 'source' is not mentioned
in the specification (where links are just a path with attached
bytes), and because the documentation for the `ln` utility calls the
path LINK_NAME and the data TARGET. This commit fixes the glitch and
calls the first argument to `symlinkat` the target.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Building the kata-agent-policy crate only succeeded when its parents
(agent and genpolicy) pulled in the required features. This commit adds
the required features to the crate itself, such that it can be built
standalone and IDEs don't show errors while browsing it.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
When using multi-layer EROFS snapshotter, the destroy() method fails to
kill container processes, causing process leaks in shared PID namespace
scenarios.
Problem Background:
1. Multi-layer EROFS creates temporary mount points under the container's
root directory:
- /run/kata-containers/<cid>/multi-layer/upper (ext4, writable)
- /run/kata-containers/<cid>/multi-layer/lower-0 (EROFS, read-only)
2. The original destroy() method executed in this order:
(1) umount rootfs
(2) fs::remove_dir_all(&self.root) <- FAILS with "Read-only file system"
(3) cgroup cleanup and process killing <- NEVER EXECUTED
3. When remove_dir_all() encounters the read-only EROFS mount point, it
returns EROFS error (os error 30), causing destroy() to exit early
without killing processes.
Why This Fix:
1. The test case k8s-kill-all-process-in-container.bats creates an init
container with a background process (tail -f /dev/null), expecting it
to be killed when the init container is destroyed.
2. With shared PID namespace (shareProcessNamespace: true), the orphaned
process continues running, causing the test to fail.
Solution:
1. Reorder the destroy() method to kill processes BEFORE attempting to
remove the container directory:
(1) Get PIDs from cgroup and send SIGKILL
(2) Destroy cgroup
(3) umount rootfs
(4) fs::remove_dir_all(&self.root)
2. This ensures processes are always killed regardless of filesystem
cleanup status, matching the behavior of overlayfs snapshotter.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor the multi-layer EROFS storage handling to improve code
maintainability and reduce duplication.
Key changes:
(1) Extract update_storage_device() to unify device state management
for both multi-layer and standard storages
(2) Simplify handle_multi_layer_storage() to focus on device creation,
returning MultiLayerProcessResult struct instead of managing state
(3) Unify the processing flow in add_storages() with clear separation:
(4) Support multiple EROFS lower layers with dynamic lower-N mount paths
(5) Improve mkdir directive handling with deferred {{ mount 1 }}
resolution
This reduces code duplication, improves readability, and makes the
storage handling logic more consistent across different storage types.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce MultiLayerErofsHandler and method of
handle_multi_layer_storage for multi-layer storage:
(1) Register MultiLayerErofsHandler to STORAGE_HANDLERS to handle
multi-layer EROFS storage with driver type 'multi-layer-erofs'.
(2) Add handle_multi_layer_erofs function to process multiple EROFS
storages with X-kata.multi-layer marker together in guest.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add multi_layer_erofs.rs implementing guest-side processing logics
of multi-layer EROFS rootfs with overlay mount support.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.
The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.
On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Remove the Virtio9pHandler implementation and its registration
from the storage handler manager:
(1) Remove Virtio9pHandler struct and StorageHandler implementation.
(2) Remove DRIVER_9P_TYPE and Virtio9pHandler from STORAGE_HANDLERS
registration.
(3) Update watcher.rs comments to remove 9p references.
This completes the removal of virtio-9p support in the agent.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
After agent was moved to root workspace, the products are now under the
repo root. Change the TARGET_PATH accordingly to tell Makefile where to
lookup output.
Signed-off-by: Jiahao Wang <jiahao.wang@lingcage.com>