A TDX VM requires that guest memfd is managed by KVM, so that
KVM is able to toggle the memory attribute for the region to
shared/private. Therefore, only anonymous guest memory is allowed
for TDX VM, and the KVM-managed memfd should be created by
KVM_CREATE_GUEST_MEMFD ioctl, instead of issuing memfd_create
system call. Also, in order to bind this memfd with corresponding
memory region, KVM_SET_USER_MEMORY_REGION2 should be invoked,
instead of KVM_SET_USER_MEMORY_REGION.
Signed-off-by: Xiaofan Xxf <xiaofan.xxf@antgroup.com>
The script was creating .cargo/config.toml but referencing .cargo/config
in the vendor_dir_list, causing tar to fail with 'Cannot stat' error.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
After commit e2240b694a ("runtime-rs: ch: source virtio-fs queue size
from toml"), Cloud Hypervisor no longer provides fallback defaults for
virtio-fs queue configuration. When queue_size or queue_num are 0, CH
now uses those values directly instead of substituting defaults, which
causes a panic in the device manager.
The agent-ctl tool was hardcoding queue_size=0 and queue_num=0 in
share_fs_utils.rs, relying on CH's fallback behavior. This broke the
agent-api tests for Cloud Hypervisor while QEMU tests continued to pass.
Fix by reading virtio_fs_queue_size from the hypervisor config and
falling back to sensible defaults (1024 queue size, 1 queue) when not
configured, matching the previous CH default behavior.
Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The agent-ctl tests are failing in the CI, but there is no log reporting,
so debugging is not possible. Add some debug to help.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Now that `prepare_virtiofs` populates `ShareFsConfig` from
`SharedFsInfo.virtio_fs_queue_size`, the CH-side fallback that
substitutes `DEFAULT_FS_QUEUE_SIZE` (1024) when the incoming
`queue_num`/`queue_size` are zero is no longer needed. Drop it from
both `handle_share_fs_device` and `TryFrom<ShareFsSettings> for
FsConfig` and use the values straight from the config. Drop the now
unused `DEFAULT_FS_QUEUES` and `DEFAULT_FS_QUEUE_SIZE` constants.
This also removes a latent bug in both call sites: the previous code
gated `queue_size` on `queue_num > 0`, so a user setting only the
queue size and not the (currently unconfigurable) queue count would
have had their `queue_size` silently overwritten by the default.
The CH config template (`configuration-clh-runtime-rs.toml.in`) did
not ship the `virtio_fs_queue_size` key (unlike the qemu-runtime-rs
templates), so without an explicit override the field would have
deserialized to 0 and the fallback would have been the only thing
keeping CH working. Add the key to the template, defaulted to
`@DEFVIRTIOFSQUEUESIZE@` (1024), matching the qemu-runtime-rs
templates.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The shared filesystem device builder in `prepare_virtiofs` was
hardcoding `queue_size = 0` and `queue_num = 0` on the `ShareFsConfig`
it hands to the hypervisor, ignoring `SharedFsInfo.virtio_fs_queue_size`
parsed from `configuration.toml` entirely.
For qemu, this is silently broken: the cmdline generator's
`DeviceVhostUserFs::set_queue_size` treats 0 as "not set" and skips the
`queue-size=` argument when emitting the `vhost-user-fs-pci` device, so
QEMU falls back to its built-in default of 128, regardless of what the
user configured.
For Cloud Hypervisor it happens to work in practice today, but only
because `ch::handle_share_fs_device` and `TryFrom<ShareFsSettings> for
FsConfig` substitute a hardcoded 1024 when the incoming
`queue_num`/`queue_size` are zero. That fallback masks the real bug; the
toml value still never reaches the VMM.
Add a `get_shared_fs_info` accessor on `DeviceManager` mirroring the
existing `get_block_device_info` helper, and use it in
`prepare_virtiofs` to populate `ShareFsConfig.queue_size` from
`SharedFsInfo.virtio_fs_queue_size`. Use a single virtqueue
(`queue_num = 1`), matching what runtime-go hardcodes for both qemu
(govmm `QemuFSParams` does not emit `num-queues=`) and CH
(`numQueues := int32(1)` in `clh.go`).
The CH-side fallback and the CH config template are addressed in a
follow-up commit.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
When the jailer is in use (the default for kata-fc), cmd.Process.Pid in
fcInit() is the jailer's PID, not firecracker's. The jailer forks +
execs firecracker as a separate child and exits. fc.info.PID was
therefore stored as the (soon-to-be-dead) jailer PID.
At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...).
syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess
returns nil immediately, and the real firecracker microVM never
receives a signal. It gets reparented to init and stays alive
indefinitely, holding open resources from the host. Over many
container lifecycles this becomes a serious resource leak.
Read the real PID from <jailerRoot>/firecracker.pid, which firecracker
itself writes after the exec. Update fc.info.PID with that value so all
downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates
on the actual firecracker process.
Also fix a small adjacent bug in Sandbox.Stop where the per-container
teardown loop ignored the force flag, causing any container.stop error
to short-circuit Stop before stopVM ran.
Signed-off-by: Sebastian Wolf <swolf@nvidia.com>
We sometimes get this error when creating the pod sandbox:
failed to create shim task: Failed to add qdisc for network index 2 : device or resource busy.
Adding a linear backoff retry when adding the qdisc to help mitigate the issue at the source and avoid the cascading error.
Signed-off-by: Mayeul Blanzat <mayeul.blanzat@datadoghq.com>
The two code blocks of extracting a block device storage
source information for DeviceType::BlockModern/Block are
essentially identical except the async lock operation.
Extract the common logic into a helper function.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
kata-deploy is a per-node infrastructure DaemonSet; if it gets evicted
under node memory/CPU pressure the node loses its Kata runtime until
the pod is rescheduled. Default to system-node-critical so the kubelet
evicts lower-priority workloads first.
The value is configurable via `priorityClassName` in values.yaml.
Fixes: #13068
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Moving agent-ctl into the root workspace moves the target
directory, so update this target to be in root, not src/tools
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Add agent-ctl to be a workspace member to simplify the
dependency management.
- Also add a test target as we've been running it in static-checks
without it doing anything
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The tests haven't been run at least since we moved to GHA,
so in the spirit of lean and mean, let clear them up
Fixes: #10957
Assisted-by IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fixes the spelling "one ore more events have occured" to
"one or more events have occurred" in the doc comment for the
VsockEpollListener::notify trait method.
Signed-off-by: SAY-5 <say.apm35@gmail.com>
Two tests relied on the side-effect of create_dir_all (removed in
the previous commit) to pass:
(1) test_get_uds_with_sid_ok: use a directory name that actually
starts with the search prefix so prefix matching works without
creating dirs.
(2) test_get_uds_with_sid_with_zero: assert Err on zero matches
instead of Ok, matching the corrected lookup behavior.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When running kata-ctl exec <short-id>, kata-ctl may fail with:
"more than one sandbox exists with the provided prefix "ed07",
please provide a unique prefix".
At the same time, a new subdirectory named <short-id> is incorrectly
created under /run/kata/. This is wrong behavior: a short ID should be
used only to match an existing sandbox by prefix, and must not trigger
creation of a new sandbox directory when lookup fails or is ambiguous.
Update the exec path to perform prefix matching and return an error on
no match or non-unique matches, without creating any new directories.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
sb_storage_path() is a path accessor shared by both server (shim) and
client (kata-ctl). Having it call create_dir_all(KATA_PATH) on every
invocation is incorrect: the client side should never create directories
— if /run/kata/ does not exist, no shim is running.
Move the directory creation to MgmtServer::new(), which is the server-
side component that manages the shim management socket under KATA_PATH.
Make sb_storage_path() a pure accessor returning &'static str directly.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We are already masking systemd-networkd.service, which causes systemd to
log an error about the socket still being enabled. In runtime-go, we're
masking the socket, so mask it in runtime-rs, too.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Since gc and trustee were bumped (#13046), the test
"Cannot get CDH resource when affirming policy is set without reference values"
has started failing for IBM SEL.
The attestation policy for IBM SEL returns an "affirming"
result whenever the claim can be parsed successfully,
meaning the evidence verification succeeds. As a result,
the negative test above always produces a positive result.
Skip this negative test for IBM SEL environments
(e.g. qemu-se*).
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
genpolicy supports building and testing on Darwin, both for Kata
developers as well as for users of the tool. In CI, we're currently only
testing the binary build on darwin, the test is only executed on Linux.
Since we aim to support development on darwin, including test execution,
we need to prevent regressions such as [1]. This commit adds the test
binaries to the `make build` target, such that they are covered by
`ci/darwin-tests.sh`.
In order to avoid unnecessary recompilation between the build and test
target, we align the `--release` handling between the two.
[1]: 639ff3578d
Signed-off-by: Markus Rudy <mr@edgeless.systems>
The test currently uses a static directory at `/tmp/initimg_test`. This
introduces non-determinism into the unit test:
* Files that already exist in that dir might alter test results.
* If the directory is owned by root, the test will fail due to
permissions.
Switch to using the tempfile crate instead.
Fixes: #13053
Signed-off-by: Markus Rudy <mr@edgeless.systems>
The initdata is currently being decoded, and then re-encoded with the
to_string function. This will usually not preserve the original initdata
document, and thus the initdata hash will differ between the annotation
and the block device.
This commit changes the logic to only decode the base64, but keep the
initdata document intact. Since the error message is now nested, adjust
the tests to look for the expected error in the chain.
Fixes: #12951
Signed-off-by: Markus Rudy <mr@edgeless.systems>
VirtioBlkCcwHandler::create_device was calling common_storage_handler
directly, bypassing the handle_block_storage function that checks for
the encryption_key=ephemeral driver option. This meant that encrypted
emptyDir volumes on s390x would attempt a plain mount of the raw block
device instead of setting up dm-crypt via the CDH, resulting in an
EINVAL mount error.
Route CCW block devices through handle_block_storage, matching the
pattern used by VirtioBlkPciHandler.
Fixes: failed to mount /dev/vda to .../storage/..., EINVAL
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>