The shared filesystem device builder in `prepare_virtiofs` was
hardcoding `queue_size = 0` and `queue_num = 0` on the `ShareFsConfig`
it hands to the hypervisor, ignoring `SharedFsInfo.virtio_fs_queue_size`
parsed from `configuration.toml` entirely.
For qemu, this is silently broken: the cmdline generator's
`DeviceVhostUserFs::set_queue_size` treats 0 as "not set" and skips the
`queue-size=` argument when emitting the `vhost-user-fs-pci` device, so
QEMU falls back to its built-in default of 128, regardless of what the
user configured.
For Cloud Hypervisor it happens to work in practice today, but only
because `ch::handle_share_fs_device` and `TryFrom<ShareFsSettings> for
FsConfig` substitute a hardcoded 1024 when the incoming
`queue_num`/`queue_size` are zero. That fallback masks the real bug; the
toml value still never reaches the VMM.
Add a `get_shared_fs_info` accessor on `DeviceManager` mirroring the
existing `get_block_device_info` helper, and use it in
`prepare_virtiofs` to populate `ShareFsConfig.queue_size` from
`SharedFsInfo.virtio_fs_queue_size`. Use a single virtqueue
(`queue_num = 1`), matching what runtime-go hardcodes for both qemu
(govmm `QemuFSParams` does not emit `num-queues=`) and CH
(`numQueues := int32(1)` in `clh.go`).
The CH-side fallback and the CH config template are addressed in a
follow-up commit.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
When the jailer is in use (the default for kata-fc), cmd.Process.Pid in
fcInit() is the jailer's PID, not firecracker's. The jailer forks +
execs firecracker as a separate child and exits. fc.info.PID was
therefore stored as the (soon-to-be-dead) jailer PID.
At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...).
syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess
returns nil immediately, and the real firecracker microVM never
receives a signal. It gets reparented to init and stays alive
indefinitely, holding open resources from the host. Over many
container lifecycles this becomes a serious resource leak.
Read the real PID from <jailerRoot>/firecracker.pid, which firecracker
itself writes after the exec. Update fc.info.PID with that value so all
downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates
on the actual firecracker process.
Also fix a small adjacent bug in Sandbox.Stop where the per-container
teardown loop ignored the force flag, causing any container.stop error
to short-circuit Stop before stopVM ran.
Signed-off-by: Sebastian Wolf <swolf@nvidia.com>
The two code blocks of extracting a block device storage
source information for DeviceType::BlockModern/Block are
essentially identical except the async lock operation.
Extract the common logic into a helper function.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
- Add agent-ctl to be a workspace member to simplify the
dependency management.
- Also add a test target as we've been running it in static-checks
without it doing anything
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fixes the spelling "one ore more events have occured" to
"one or more events have occurred" in the doc comment for the
VsockEpollListener::notify trait method.
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Two tests relied on the side-effect of create_dir_all (removed in
the previous commit) to pass:
(1) test_get_uds_with_sid_ok: use a directory name that actually
starts with the search prefix so prefix matching works without
creating dirs.
(2) test_get_uds_with_sid_with_zero: assert Err on zero matches
instead of Ok, matching the corrected lookup behavior.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When running kata-ctl exec <short-id>, kata-ctl may fail with:
"more than one sandbox exists with the provided prefix "ed07",
please provide a unique prefix".
At the same time, a new subdirectory named <short-id> is incorrectly
created under /run/kata/. This is wrong behavior: a short ID should be
used only to match an existing sandbox by prefix, and must not trigger
creation of a new sandbox directory when lookup fails or is ambiguous.
Update the exec path to perform prefix matching and return an error on
no match or non-unique matches, without creating any new directories.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
sb_storage_path() is a path accessor shared by both server (shim) and
client (kata-ctl). Having it call create_dir_all(KATA_PATH) on every
invocation is incorrect: the client side should never create directories
— if /run/kata/ does not exist, no shim is running.
Move the directory creation to MgmtServer::new(), which is the server-
side component that manages the shim management socket under KATA_PATH.
Make sb_storage_path() a pure accessor returning &'static str directly.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We are already masking systemd-networkd.service, which causes systemd to
log an error about the socket still being enabled. In runtime-go, we're
masking the socket, so mask it in runtime-rs, too.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
genpolicy supports building and testing on Darwin, both for Kata
developers as well as for users of the tool. In CI, we're currently only
testing the binary build on darwin, the test is only executed on Linux.
Since we aim to support development on darwin, including test execution,
we need to prevent regressions such as [1]. This commit adds the test
binaries to the `make build` target, such that they are covered by
`ci/darwin-tests.sh`.
In order to avoid unnecessary recompilation between the build and test
target, we align the `--release` handling between the two.
[1]: 639ff3578d
Signed-off-by: Markus Rudy <mr@edgeless.systems>
The test currently uses a static directory at `/tmp/initimg_test`. This
introduces non-determinism into the unit test:
* Files that already exist in that dir might alter test results.
* If the directory is owned by root, the test will fail due to
permissions.
Switch to using the tempfile crate instead.
Fixes: #13053
Signed-off-by: Markus Rudy <mr@edgeless.systems>
The initdata is currently being decoded, and then re-encoded with the
to_string function. This will usually not preserve the original initdata
document, and thus the initdata hash will differ between the annotation
and the block device.
This commit changes the logic to only decode the base64, but keep the
initdata document intact. Since the error message is now nested, adjust
the tests to look for the expected error in the chain.
Fixes: #12951
Signed-off-by: Markus Rudy <mr@edgeless.systems>
VirtioBlkCcwHandler::create_device was calling common_storage_handler
directly, bypassing the handle_block_storage function that checks for
the encryption_key=ephemeral driver option. This meant that encrypted
emptyDir volumes on s390x would attempt a plain mount of the raw block
device instead of setting up dm-crypt via the CDH, resulting in an
EINVAL mount error.
Route CCW block devices through handle_block_storage, matching the
pattern used by VirtioBlkPciHandler.
Fixes: failed to mount /dev/vda to .../storage/..., EINVAL
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Store the hotplugged CCW address in BlockModern configs and use it when
building storage sources so s390x encrypted emptyDir paths no longer
fall back to /dev/vda.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
If the policy loading encounters an error, we `abort(3)` the agent for
safety. Since abort causes the process to stop immediately, the async
logs might not be flushed yet, and thus won't make it to the runtime,
hiding the reason for the abort. Wait a bit before aborting so that the
logs are fully written.
Fixes: #13031
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Group the shared-context parameters (share_fs, device_manager, sid,
agent, emptydir_mode) into a VolumeContext struct so handler_volumes
stays within clippy's argument count limit and avoids -D warnings
breakage in CI.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Remove the runtime-rs skip from the trusted ephemeral data storage
test now that runtime-rs implements block-encrypted emptyDir volumes.
Also remove the genpolicy drop-in that disabled encrypted_emptydir
for runtime-rs and the corresponding copy logic in tests_common.sh.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add the emptydir_mode configuration option to all runtime-rs config
template files. CoCo configs (snp, tdx, se, coco-dev, nvidia-gpu-snp,
nvidia-gpu-tdx) default to block-encrypted via @DEFEMPTYDIRMODE_COCO@,
while non-CoCo configs (qemu, nvidia-gpu, fc) default to shared-fs
via @DEFEMPTYDIRMODE@.
Also add DEFEMPTYDIRMODE and DEFEMPTYDIRMODE_COCO variables to the
runtime-rs Makefile for template substitution.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
When emptydir_mode is "block-encrypted", host emptyDir paths must
remain as "bind" mounts so the EncryptedEmptyDirVolume handler can
intercept them in the volume dispatch chain. Previously,
update_ephemeral_storage_type() would unconditionally convert them
to "local" type, causing them to be handled as plain local volumes
instead.
Add the emptydir_mode parameter to update_ephemeral_storage_type()
and its call chain (amend_spec in container.rs) and skip the
host-emptyDir-to-local conversion when the mode is block-encrypted.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add the core volume handler for block-encrypted emptyDir support
in runtime-rs, bringing it to parity with the Go runtime (PR #10559).
When emptydir_mode is set to "block-encrypted", host emptyDir bind
mounts are intercepted and handled as follows:
1. A sparse disk image (disk.img) is created inside the emptyDir
folder, sized to match the host filesystem capacity.
2. A mountInfo.json is written under the kata direct-volume root
with volume_type "blk", fs_type "ext4", and metadata
encryptionKey=ephemeral.
3. The disk image is plugged into the guest VM as a virtio-blk
device via the hypervisor device manager.
4. An agent::Storage is built with driver_options containing
encryption_key=ephemeral and shared=true, so the kata-agent
delegates formatting and encryption to CDH using LUKS2.
The volume is registered in the dispatch chain before the regular
block-volume check, and ephemeral disk metadata is tracked for
sandbox-level cleanup at teardown.
Also re-exports EMPTYDIR_MODE_* constants from kata-types::config
so downstream crates can reference them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
The proto Storage message already has a "shared" field (field 8),
but the runtime-rs agent crate's internal Storage struct was
missing it, so it was never forwarded to the kata-agent.
Add the field to the Rust struct and its From<Storage> translation,
and update all explicit struct initialisers across the resource
crate to include shared: false so the build stays clean.
This is needed for trusted ephemeral data storage, where the
agent uses the shared flag to avoid premature cleanup of volumes
that are shared across containers in a pod.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add add_volume_mount_info(), is_volume_mounted(), and
remove_volume_path() to the mount module. These mirror the Go
helpers (AddMountInfo, IsVolumeMounted, Remove) in
src/runtime/pkg/direct-volume/utils.go and are needed by the
upcoming EncryptedEmptyDirVolume to write and clean up
mountInfo.json metadata for block-encrypted emptyDir volumes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add the emptydir_mode field to the Runtime configuration struct,
allowing runtime-rs to read the emptyDir handling mode from the
TOML config file. This is groundwork for trusted ephemeral data
storage support in runtime-rs (parity with the Go runtime).
Two modes are supported:
- shared-fs (default): share emptyDir via virtio-fs/9p.
- block-encrypted: plug a block device encrypted in-guest via
CDH/LUKS2.
Empty values default to "shared-fs"; unknown values are rejected
during validation.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Added a firmware module to dbs_boot crate, and guest VM is allowed
to load tdshim into memory, which serves as a prerequisite for
booting TDX VM. And other sections (including kernel payload and
cmdline) are also loaded into correct guest physical addresses
according to the design of tdshim layout.
Signed-off-by: Xiaofan Xxf <xiaofan.xxf@antgroup.com>
based on current runtime-go behaviour introduced in https://github.com/kata-containers/kata-containers/pull/9195
When using static resources, always set maxvcpus value equal to the vcpus value.
This is because the static resources case does not support dynamic CPU hotplugging,
and therefore the maximum number of vCPUs should be limited to the number of vCPUs.
Booting with a high number of max vCPUs is a bit slower compared to a lower number.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Add Kubernetes nodeAffinity structures so genpolicy can parse Pod
YAMLs that carry scheduling constraints ignored by policy.
Cover the shape in the ignored-fields fixture alongside the
existing Pod affinity and anti-affinity data.
Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Because intptr() returns a fresh pointer on every call, those comparisons compared addresses,
never values, so every check evaluated to false.
As a result /dev/null, /dev/urandom, /dev/ptmx, /dev/loop-control and /dev/loop*
were appended to devices allowlist for sandbox_cgroup
even when the runtime spec already listed them, producing duplicate entries.
Switch to nil-safe value comparisons via a type switch on the cgroup device type
and dereferenced *d.Major / *d.Minor,
keeping the same detection semantics but actually matching existing entries.
Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>