Allowing arbitrary symlinks in the shared directory is unsafe for
confidential VM use cases. In order to make CopyFile safe both for the
VM as well for the consuming containers, we implement the following
rules for symlinks (in addition to the existing rules for other files):
1. Symlinks may not be placed directly into the shared directory.
2. Symlinks must not point 'upwards', i.e. contain `..` as a path
element.
3. Symlinks must be relative.
These rules ensure that all writes initiated by CopyFile are restricted
to the shared directory (protecting the VM), and that symlinks can't
point outside their mount points (protecting the container).
These new restrictions mean that we can't support arbitrary mount
sources (which might not follow these rules), but the usual k8s suspects
(ConfigMap, Secret, ServiceAccountToken) should still pass.
In order to aid writing the policy, we convert the CopyFileRequest to a
structure that does not contain binary data, but well-defined strings
and types.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
The agent referred to the `data` field of an incoming CopyFileRequest
as the 'src'. This is misleading, because 'source' is not mentioned
in the specification (where links are just a path with attached
bytes), and because the documentation for the `ln` utility calls the
path LINK_NAME and the data TARGET. This commit fixes the glitch and
calls the first argument to `symlinkat` the target.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Building the kata-agent-policy crate only succeeded when its parents
(agent and genpolicy) pulled in the required features. This commit adds
the required features to the crate itself, such that it can be built
standalone and IDEs don't show errors while browsing it.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
When using multi-layer EROFS snapshotter, the destroy() method fails to
kill container processes, causing process leaks in shared PID namespace
scenarios.
Problem Background:
1. Multi-layer EROFS creates temporary mount points under the container's
root directory:
- /run/kata-containers/<cid>/multi-layer/upper (ext4, writable)
- /run/kata-containers/<cid>/multi-layer/lower-0 (EROFS, read-only)
2. The original destroy() method executed in this order:
(1) umount rootfs
(2) fs::remove_dir_all(&self.root) <- FAILS with "Read-only file system"
(3) cgroup cleanup and process killing <- NEVER EXECUTED
3. When remove_dir_all() encounters the read-only EROFS mount point, it
returns EROFS error (os error 30), causing destroy() to exit early
without killing processes.
Why This Fix:
1. The test case k8s-kill-all-process-in-container.bats creates an init
container with a background process (tail -f /dev/null), expecting it
to be killed when the init container is destroyed.
2. With shared PID namespace (shareProcessNamespace: true), the orphaned
process continues running, causing the test to fail.
Solution:
1. Reorder the destroy() method to kill processes BEFORE attempting to
remove the container directory:
(1) Get PIDs from cgroup and send SIGKILL
(2) Destroy cgroup
(3) umount rootfs
(4) fs::remove_dir_all(&self.root)
2. This ensures processes are always killed regardless of filesystem
cleanup status, matching the behavior of overlayfs snapshotter.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor the multi-layer EROFS storage handling to improve code
maintainability and reduce duplication.
Key changes:
(1) Extract update_storage_device() to unify device state management
for both multi-layer and standard storages
(2) Simplify handle_multi_layer_storage() to focus on device creation,
returning MultiLayerProcessResult struct instead of managing state
(3) Unify the processing flow in add_storages() with clear separation:
(4) Support multiple EROFS lower layers with dynamic lower-N mount paths
(5) Improve mkdir directive handling with deferred {{ mount 1 }}
resolution
This reduces code duplication, improves readability, and makes the
storage handling logic more consistent across different storage types.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce MultiLayerErofsHandler and method of
handle_multi_layer_storage for multi-layer storage:
(1) Register MultiLayerErofsHandler to STORAGE_HANDLERS to handle
multi-layer EROFS storage with driver type 'multi-layer-erofs'.
(2) Add handle_multi_layer_erofs function to process multiple EROFS
storages with X-kata.multi-layer marker together in guest.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add multi_layer_erofs.rs implementing guest-side processing logics
of multi-layer EROFS rootfs with overlay mount support.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.
The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.
On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Remove the Virtio9pHandler implementation and its registration
from the storage handler manager:
(1) Remove Virtio9pHandler struct and StorageHandler implementation.
(2) Remove DRIVER_9P_TYPE and Virtio9pHandler from STORAGE_HANDLERS
registration.
(3) Update watcher.rs comments to remove 9p references.
This completes the removal of virtio-9p support in the agent.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
After agent was moved to root workspace, the products are now under the
repo root. Change the TARGET_PATH accordingly to tell Makefile where to
lookup output.
Signed-off-by: Jiahao Wang <jiahao.wang@lingcage.com>
This commit adds kata agent to the root workspace, as a follow up work
of #12413.
Remove agent from exclude list, and make it as a member of root
workspace.
Signed-off-by: Jiahao Wang <jiahao.wang@lingcage.com>
It's a dev-dependency that doesn't seem to be used, so
remove it and resolve RUSTSEC-2025-0052
Assisted-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Bump tracing-subscriber to 0.3.20 to resolve RUSTSEC-2025-0055
- Switch deprecated `slog_info!` for `slog::info!`
Generated-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add an mkfs_opts parameter to cdh_secure_mount so that its users
can parametrize these options depending on their needs. For now,
there is two users providing explicit values (container image
layer storage and container data storage features).
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Calling .unwrap() after an .is_some() check is considered non-idiomatic in
as it performs redundant work and makes the code more verbose.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
With the new CDH version, the secure_mount API changes.
Further, the new CDH version no longer uses the luks-encrypt-storage
script but utilizes libcryptsetup as well as mkfs.ext4 and dd. Hence, adapt
some of the CDH and Kata components build steps
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
* Introduces a new cluster_config setting encrypted_emptydir defaulting to true.
* Adapts genpolicy for encrypted emptyDirs.
Crucially, the rules.rego change checks that the mount and the storage are
well-formed together:
* i_storage.source matches a known regex.
* i_storage.mount_point == $(spath)/BASE64(i_storage.source)
* i_storage.mount_point == p_storage.mount_point
* i_storage.mount_point == i_mount.source
Note that policy enforcement is necessary to prevent rogue device injection.
E.g. the agent could not blindly encrypt all block devices as some use cases
only need dm-verity.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Handles block-based emptyDirs plugged via virtio-blk and virtio-scsi by
encrypting and formatting them.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Introduce host_memory_mib() with OS-specific implementations
(Linux/Android via nix::sysinfo,
macOS via sysctl) selected at compile time. This improves
portability and allows consistent host memory sizing/validation
across different platforms.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As the memory related information has been serialized at the sandbox
initalization specially at the moment of parsing configuration toml.
This commit aims to refactor MemoryInfo initialization logics:
(1) Remove memory sizing/host-memory adjustment logic from QEMU cmdline
Memory::new()
(2) Initialize/adjust memory values via kata-types MemoryInfo (single
source of truth)
(3) Replace sysinfo::System::new_with_specifics with
nix::sys::sysinfo::sysinfo() to get host RAM
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
With some older kernels some fs implementations don't handle empty
options strings well. This leads to failures in "setup rootfs" step.
E.g. `cgroup: cgroup2: unknown option ""`.
This is fixed by mapping empty string to `None` before passing to
`nix::mount`.
Signed-off-by: Jacek Tomasiak <jtomasiak@arista.com>
Signed-off-by: Jacek Tomasiak <jacek.tomasiak@gmail.com>
Update time to resolve CVE-2026-25727.
Note: this involved bumping the versions of slog-term and slog-json
and bumping the MSRV to 1.88.0 which time 0.3.47 requires.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The previous comment incorrectly implied that `biased` prevents data
loss and the exit notifier would never be polled before all buffered
data is read. And the detailed info can be seen from the document:
https://docs.rs/tokio/latest/src/tokio/macros/select.rs.html#67
Tokio's `biased` only makes polling order deterministic(top-to-bottom)
when multiple branches are ready in the same poll, and it makes fairness
the caller's responsibility. Output can still be truncated if the exit
notification becomes ready while `read_stream` is pending.
This change updates the comment to reflect the actual semantics and
caveats. No functional behavior change.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Short-lived processes (e.g., `kubectl exec echo`) in legacy-io mode
occasionally lose the last segments of their output.
The root cause is a race condition where the `term_exit_notifier`
triggers before the pipe buffers are fully drained. In the previous
implementation, once the exit notification was received, the agent
immediately returned an EOF, causing the runtime's `run_io_copy` to
terminate and drop any residual data in the pipe.
This patch introduces a "drain after exit" mechanism:
- Upon receiving an exit notification, the agent enters a 500ms window
for polling `read_streaim` to flush remaining data from the buffer.
- A true EOF is only returned if the stream is confirmed empty or the
timeout is reached.
This ensures reliable output delivery for transient exec tasks under
high concurrency.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Legacy IO uses shim polling via read_stdout/read_stderr. The agent
previously mapped pipe EOF (read() == 0) and term_exit_notifier to
errors ("read meet eof"/"eof"), which became ttrpc INTERNAL failures.
This caused runtime IO copy to abort early, leading to lost
stdout/stderr for short-lived exec (e.g."echo") and spurious failures.
Normalize EOF semantics: read_stream now returns Ok(empty) on EOF
instead of Err("read meet eof").
This makes legacy IO behave like a proper stream: data until EOF, no
INTERNAL errors for normal termination.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We're introducing a root_complex to assign each
and every device to a NUMA node or to the default
root_complex="00" aka pcie.0. This patch introduces
the proper handling of the current qom path being
bus/device == "00/02" with NUMAA we need to extend it
with the root_complex/bus/device == "10/00/02".
We're defaulting to root_complex="00".
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
This comment was first introduced in e111093 with secure_join()
but then we forgot to remove it when we switched to the safe-path
lib in c0ceaf6
Signed-off-by: Qingyuan Hou <lenohou@gmail.com>
Downstream builders at Red Hat complain that `Cargo.lock` doesn't match
`Cargo.toml`.
Run `cargo check` to refresh `Cargo.lock`.
`git bisect` shows that 7cfb97d41b is the first commit where
`cargo check` has an effect in `src/agent`.
Signed-off-by: Greg Kurz <groug@kaod.org>
Change the secure_storage_integrity option's default value to true.
With this, integrity protection for encrypted block device contents
will be requested from the confidential data hub by default, see the
agent's cdh_handler_trusted_storage function in rpc.rs.
This behavior can be disabled by explicitly setting the
agent.secure_storage_integrity parameter to 0 or false via kernel
command line parameters.
This will affect the trusted storage implementation for the guest-pull
mechanism, and it will affect future implementations using this code
path, such as implementations for ephemeral secure storage.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
In #12151 the version was bumped in cargo.toml, but the update not
done, so run `cargo update -p container-device-interface` to apply it
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
VirtioBlkCcwDeviceHandler and VirtioBlkCcwHandler
are only constructed on s390x, so add #[cfg(target_arch = "s390x")]
to all the code
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When updating ephemeral storages, MS_REMOUNT is explicitly passed as,
for instance, `/dev/shm` should be remounted after memory is hotplugged.
Till now Kata Containers has been explicitly ignoring such updates,
leading to the containers' `/dev/shm` having the size of "half of the
memory allocated, during the startup time", which goes against the
expected behaviour.
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
- Replace generic errors in sandbox operations with typed SandboxError variants (InvalidContainerId, InitProcessNotFound, InvalidExecId).
- This enables the kata shim to handle specific failure cases differently.
Fixes#12120
Signed-off-by: Adeet Phanse <adeet.phanse@mongodb.com>
Allow users to build the Kata Agent using INIT_DATA=no to disable the
detect_initdata_device() code loop and associated debug log output.
Future additional improvements related to Init Data are tracked by #11532.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>