Replace the agent's inline devicemapper implementation with the libs
kata-types::dmverity module. The agent's devicemapper Cargo feature
now forwards to kata-types/devicemapper, removing the direct
libdevmapper link dependency from the agent crate. Gate all dm-verity
imports, constants, and call sites behind libdevmapper.
Add USE_DEVMAPPER Makefile variable (default no) that appends the
devicemapper feature flag and forces LIBC=gnu when enabled.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit is just a enhancement without any functionality changes.
Replace the sequential loop in handle_multi_layer_erofs_group with
join_all-based concurrent mounting. Base device paths and mount
directories are pre-resolved before spawning futures to avoid lock
contention. On partial failure, successfully mounted layers are
unmounted and dm-verity devices cleaned up before propagating the
error.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Wire the dm-verity helpers into the layer mount flow so that GPT
partitions carrying verity metadata are mounted through a verified
device-mapper target instead of the raw partition.
Refactor wait_and_mount_layer to resolve partition path and verity
device as separate steps: create a dm-verity device when
X-kata.dmverity-enabled=true is set, fall back to direct partition
mount otherwise, and return the verity device path for cleanup
tracking.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add per-container verity_devices tracking in Sandbox and wire the
teardown path: destroy_partition_dmverity_device removes the
device-mapper target via deferred-remove ioctl and deletes the mknod
node, cleanup_dmverity_devices iterates all devices in reverse order.
Wire into remove_container_resources (rpc.rs) so verity devices are
torn down after unmount, and record verity device paths in
add_storages (storage/mod.rs) for tracking.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
GPT-partitioned EROFS layers can carry dm-verity hashes appended after
the filesystem data within the same partition. The host runtime passes
the root hash and parameters as X-kata.dmverity.* storage options; the
agent must set up the kernel dm-verity target before mounting so that
every read is integrity-checked against the Merkle tree.
Implement dm-verity device creation: option parsing from storage
options, device name generation, and create helper via devicemapper
ioctls with hash_start_block calculation (accounting for v1 superblock
presence).
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The kata guest VM runs without udev, so device-mapper nodes under
/dev/mapper are never created automatically. Add the foundational
helpers that subsequent dm-verity integration will rely on:
It focus on the following key points:
(1) DmOptions builders that disable all udev synchronization flags,
with read-only and deferred-remove variants.
(2) mknod-based device node creation/removal under /dev/mapper, since
devtmpfs nodes are not auto-created without udev.
Also add the devicemapper crate dependency (default-features = false).
But note that the commit depends on device mapper with no-udev support
with the PR:https://github.com/stratis-storage/devicemapper-rs/pull/1036
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When cgroup v2 is enabled, exec can fail with EBUSY while writing the
process to cgroup.procs if the container process has been delegated to an
init subcgroup.
PR #10845 fixed this behavior for the systemd/D-Bus cgroup manager
path, which was related to #10733. The cgroupfs manager still writes the
process directly to the container cgroup, so apply the same init
subcgroup handling there.
Also fix the cgroupfs init-subcgroup existence check for absolute OCI
cgroup paths by joining the trimmed cgroup path under the cgroup root.
Fixes: #9701
Signed-off-by: Chris Ayoub <cayoub@openai.com>
Generated-By: OpenAI Codex
When a container process is terminated by a signal, the agent's SIGCHLD
reaper stored the raw signal number as the process exit code. As a result
a process killed by SIGKILL(9) reported exit code 9 instead of the
conventional 137 (128+9).
Apply the standard shell convention of 128+signal_number so that
signal-terminated processes report the expected exit codes, e.g.
SIGKILL(9) -> 137, SIGTERM(15) -> 143, SIGINT(2) -> 130. This mimics
runc, which encodes wait-status exit codes the same way:
https://github.com/opencontainers/runc/blob/v1.4.3/libcontainer/utils/utils.go#L19
Both runc and this new Kata behaviour follow the conventional exit code
semantics documented at https://tldp.org/LDP/abs/html/exitcodes.html.
The conversion is factored into a small helper and covered by a unit
test. The runtime and shim already pass the exit code through unchanged,
so no further changes are needed for the corrected value to surface.
Fixes: signal-terminated containers reporting raw signal numbers
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Upgrade the nix crate across the workspace to version 0.30.1 to address
security vulnerabilities and adopt safer file descriptor handling patterns.
### Breaking Changes in nix 0.28.0
1. **File Descriptor Type Changes**
- Functions now return `OwnedFd` instead of `RawFd` (i32)
- Functions requiring file descriptors now expect types implementing `AsFd` trait
- This provides RAII-based automatic cleanup and prevents fd leaks
2. **API Signature Changes**
- `pipe()`, `pipe2()`, `openpty()` now return `OwnedFd` tuples
- `socket()` returns `OwnedFd` instead of `RawFd`
- `open()`, `memfd_create()` return `OwnedFd`
- `setns()`, `write()`, `fcntl()` require `AsFd` trait
- `madvise()` requires `NonNull<c_void>` instead of raw pointer
- `bind()`, `listen()`, `connect()` require `AsFd` and `Backlog` type
3. **Module Feature Flags**
- Modules now require explicit feature flags (mman, reboot, etc.)
### Additional Breaking Changes in nix 0.30.1
1. **symlinkat() API Change**
- `dirfd` parameter now requires `AsFd` trait instead of `Option<RawFd>`
- Use `BorrowedFd::borrow_raw(libc::AT_FDCWD)` for current directory
2. **Type Alias Deprecation**
- `MemFdCreateFlag` renamed to `MFdFlags` for consistency
### Changes Made
**Workspace Configuration (Cargo.toml)**
- Updated nix to 0.30.1 with features: fs, mount, sched, process, ioctl,
signal, socket, feature, user, hostname, term, event, mman, reboot
**File Descriptor Handling Patterns**
- Use `BorrowedFd::borrow_raw(raw_fd)` to wrap RawFd for AsFd requirements
- Use `.as_fd().as_raw_fd()` to extract raw fd without ownership transfer
- Use `.into_raw_fd()` only when ownership transfer is needed
- Use `NonNull::new().unwrap()` for madvise pointer conversion
**Deprecated API Replacements**
- `eventfd()` → `EventFd::from_value_and_flags()`
- `Errno::from_i32()` → `Errno::from_raw()`
- `listen(fd, backlog)` → `listen(&fd, Backlog::new(backlog).unwrap())`
- `MemFdCreateFlag` → `MFdFlags`
Generated by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add comprehensive test coverage for the device handler modules under
src/agent/src/device, including matcher behavior, edge cases, and
shared helper coverage across block, network, nvdimm, scsi, and vfio
device paths.
Assisted-by: IBM Bob
Signed-off-by: SantoshMadhukar-K <SantoshMadhukar.Khandyana@ibm.com>
Add an opt-in `visible_cdi_devices` agent option that lets a container
select which of the VM's CDI-known devices it sees via a
VISIBLE_CDI_DEVICES env var. The schema is `<cdi-kind>=<devices>`
(e.g. "nvidia.com/gpu=all", or "kata.com/gpu=0,1"), with multiple kinds
delimited by ':'.
When enabled, the agent maps the value to CDI device requests and feeds
them through the existing CDI injection path, so device nodes, mounts,
env and createContainer hooks from the guest CDI spec (e.g.
/var/run/cdi/nvidia.yaml, generated by NVRC/nvidia-ctk) are applied.
The variable is intentionally distinct from NVIDIA_VISIBLE_DEVICES and
does not promise identical semantics.
If a requested kind is present in the guest CDI registry but the
specific device index is not, the agent fails fast rather than waiting
for the CDI-spec watch/timeout path. An entirely absent kind falls
through to the existing wait/timeout behavior.
Defaults to false; containers that don't set the env var are unaffected.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>
When the agent-protocol-forwarder's inbound connection restarts (e.g.
during a Cloud API Adaptor restart in peer pod environments), the shim
re-sends a GetOOMEvent request through the new connection. Since the
forwarder→agent Unix socket survives the restart, the old handler from
the previous connection remains alive, holding the event_rx lock while
blocked in recv().await.
The new handler acquires the sandbox lock, then attempts to acquire the
event_rx lock — which is held by the old handler. Because the sandbox
lock is still held during this wait, every subsequent RPC
(ExecProcess, WaitProcess, StatsContainer, SignalProcess, etc.) blocks
on the sandbox lock, rendering the pod completely unresponsive.
The root cause is a lock ordering violation: get_oom_event held the
sandbox lock while acquiring the event_rx lock. Fix this by scoping the
sandbox lock acquisition so it is dropped before the event_rx lock is
acquired. The sandbox lock is only needed to clone the Arc<Mutex<Receiver>>
— once cloned, it can be released immediately.
Assisted-by: Claude Code <noreply@anthropic.com>
Signed-off-by: Thejas N <thn@redhat.com>
In standalone nydusd mode with virtio-fs passthrough, the guest-side
mkdir may fail with ENOSYS. Update the overlayfs storage handler to
skip directory creation when the directory already exists, logging a
warning instead of failing.
This ensures container rootfs setup succeeds when nydusd's native
overlay manages the directory structure.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add update_guest_filesystem_metrics() that collects disk space usage
(total/used/available) for all read-write mounted filesystems inside
the guest VM. This enables monitoring guest disk usage in kata/coco
pod through the existing GetMetrics RPC.
And its output metrics looks like as below:
- kata_guest_filesystem_bytes{mount="/",device="vda",item="total|used|available"}
- kata_guest_filesystem_inodes{mount="/",device="vda",item="total|used|available"}
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add two new GaugeVec metrics to expose guest filesystem space usage:
(1) kata_guest_filesystem_bytes{mount, device, item}: space in bytes
(total/used/available)
(2) kata_guest_filesystem_inodes{mount, device, item}: inode counts
(total/used/available)
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When a VF is cold-plugged in guest-kernel mode, mlx5_core binds to the
PCI device inside the VM and mlx5_ib creates IB character devices under
/dev/infiniband/ (uverbs*, rdma_cm, umad*). The container cannot reach
these devices unless they are explicitly added to its OCI spec.
Add expose_guest_infiniband_devices(), called from create_devices() when
the container carries at least one VFIO device entry. The function:
- Walks /dev/infiniband/ inside the guest VM.
- Appends each char device to spec.linux.devices.
- Inserts matching cgroup allow rules (rwm).
- Is a no-op if /dev/infiniband/ is absent or empty (no IB driver,
or VF not yet rebound), so non-RDMA pods are unaffected.
Gate the call on container_has_vfio_device() so unrelated containers
sharing the sandbox do not get IB device access widened.
Add is_vfio_device_type() and snapshot_infiniband() to
kata-sys-util/pcilibs. is_vfio_device_type() lets the agent check
device type strings against the VFIO driver name constants without
duplication. snapshot_infiniband() summarises /sys/class/infiniband,
/sys/class/infiniband_verbs, and /dev/infiniband as a single diagnostic
string for log context; it lives in pcilibs because it has no
agent-specific dependencies (pure sysfs/devfs reads).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Populate missing VFIO guest PCI paths via QMP before serializing
container devices so guest-kernel PCI env translation has the mappings
it needs.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
When a VFIO cold-plugged network device appears in guest with a
different MAC than the runtime request, resolve the netdev by PCI path
and apply the requested MAC before the normal by-MAC update flow.
This preserves existing behavior while avoiding UpdateInterface
mismatches in SR-IOV cold-plug cases.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Support multi-layer EROFS storage without an explicit ext4 upper
layer. When runtime-rs sends only EROFS lower storage and overlay
metadata, create the overlay upper/work directories under the
container bundle in /run/kata-containers.
Keep the explicit ext4 rwlayer path for disk-backed snapshots, and
only track real temporary mount points for cleanup. The implicit
/run-backed upper is bundle-scoped state and is removed with the
container bundle.
Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Commit b56313472 ("agent: Align agent OCI spec with oci-spec-rs",
PR #9944) inverted the condition guarding the create_dir_all call
for process.cwd: the leading `!` was dropped during the refactor.
As a result, the CWD is created only when process.cwd is the empty
string.
When the guest then runs chdir(process.cwd) and CWD doesn't exist
it returns ENOENT. The agent propagates that to the shim, which
surfaces it to containerd as "failed to create shim task: ENOENT:
No such file or directory" — indistinguishable from a missing
argv[0].
This regressed the original fix in PR #2375 (Fixes#2374), which
deliberately mirrored runc's behavior. Put the `!` back.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Use kata_types::mount::Mount for the final multi-layer EROFS
overlay mount instead of calling baremount() directly.
The mount helper detects overlay option strings close to the kernel
mount data limit. When lowerdir entries share a common parent, it
changes into that directory and rewrites lowerdir to relative paths.
That avoids repeating the same long prefix for every layer.
Multi-layer EROFS images can have many lower layers under
/run/kata-containers/<cid>/multi-layer. Passing the raw absolute
lowerdir list can exceed the mount option buffer and fail the final
overlay mount, even after all layer devices mounted successfully.
Reuse the helper so this path follows Kata's normal overlay mount
handling, including lowerdir compaction before mount(2).
Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
In GPT mode, all partitions share the same base block device, so
resolving it once per uevent source and caching the result avoids
redundant hotplug waits that would otherwise scale linearly with
layer count.
Layers are sorted by partition number before mounting to guarantee
correct overlay lowerdir precedence regardless of the order the host
emits Storage entries.
And it will remove dead_code attributes to mark the codes working.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The guest agent needs to resolve individual partition devices from a
single GPT-partitioned block device, but the kernel does not always
create partition nodes immediately after the base device appears,
especially when another fd holds the device open during hot-plug.
Add utility functions that handle two problems:
(1) Mapping a base device path to its partition path following the
kernel naming convention (bare suffix vs 'p' separator).
(2) And ensuring the partition node exists before mount.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit has No functional change — all callers pass None, so
every call still resolves the device via uevent exactly as before.
It just prepare the multi-layer EROFS handler for GPT partition and
dm-verity support by widening the wait_and_mount_layer() interface
without changing behavior.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As previous unused codes are with attribute of dead_code which
actually are never used, we'd better remove them totally.
It will remove the mode field from MkdirDirective structure and
also remove its relavent test cases.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., /dev/vdaX) which is critical for partitioned disks where
partition uevents appear alongside whole-disk uevents.
This commit aims to eliminate such bad cases.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor ScsiBlockMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., block/sdd/sdd9) which is critical for partitioned disks
where partition uevents appear alongside whole-disk uevents.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path
format "root_complex/bus/device" (e.g. "10/00/02") in addition to the
legacy "bus/device" format, defaulting to root complex "00" for backward
compatibility.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The rtnetlink crate has had an API for neighbours since 0.11. The last
attempt to use this API caused problems on AKS, but looking at it again
shows that not all functionality was ported back then (state, flags and
lladdr). Attempt the migration again, considering all parameters.
Fixes: #11942
Signed-off-by: Markus Rudy <mr@edgeless.systems>
VirtioBlkCcwHandler::create_device was calling common_storage_handler
directly, bypassing the handle_block_storage function that checks for
the encryption_key=ephemeral driver option. This meant that encrypted
emptyDir volumes on s390x would attempt a plain mount of the raw block
device instead of setting up dm-crypt via the CDH, resulting in an
EINVAL mount error.
Route CCW block devices through handle_block_storage, matching the
pattern used by VirtioBlkPciHandler.
Fixes: failed to mount /dev/vda to .../storage/..., EINVAL
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
If the policy loading encounters an error, we `abort(3)` the agent for
safety. Since abort causes the process to stop immediately, the async
logs might not be flushed yet, and thus won't make it to the runtime,
hiding the reason for the abort. Wait a bit before aborting so that the
logs are fully written.
Fixes: #13031
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Add the core volume handler for block-encrypted emptyDir support
in runtime-rs, bringing it to parity with the Go runtime (PR #10559).
When emptydir_mode is set to "block-encrypted", host emptyDir bind
mounts are intercepted and handled as follows:
1. A sparse disk image (disk.img) is created inside the emptyDir
folder, sized to match the host filesystem capacity.
2. A mountInfo.json is written under the kata direct-volume root
with volume_type "blk", fs_type "ext4", and metadata
encryptionKey=ephemeral.
3. The disk image is plugged into the guest VM as a virtio-blk
device via the hypervisor device manager.
4. An agent::Storage is built with driver_options containing
encryption_key=ephemeral and shared=true, so the kata-agent
delegates formatting and encryption to CDH using LUKS2.
The volume is registered in the dispatch chain before the regular
block-volume check, and ephemeral disk metadata is tracked for
sandbox-level cleanup at teardown.
Also re-exports EMPTYDIR_MODE_* constants from kata-types::config
so downstream crates can reference them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
So many unformatted rust codes cause uncommitted change files in
rust runtime and its libs or agent sources, which can be easily
found just by `cargo fmt --all`.
Let's reduce such noisy bad experiences
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Extend the in-guest agent's VFIO device handler to support the cold-plug
flow. When the runtime cold-plugs a GPU before the VM boots, the agent
needs to bind the device to the vfio-pci driver inside the guest and
set up the correct /dev/vfio/ group nodes so the workload can access
the GPU.
This updates the device discovery logic to handle the PCI topology that
QEMU presents for cold-plugged vfio-pci devices and ensures the IOMMU
group is properly resolved from the guest's sysfs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
`make vendor` isn't required anymore. People who need vendored code should
use the `tools/packaging/release/generate_vendor.sh` script instead.
Assisted-by: Claude AI
Signed-off-by: Greg Kurz <groug@kaod.org>
libc::S_IF* are u16 on Darwin/BSD and u32 on Linux. The match in
FileType::from and its tests mix both widths and don't compile on
Darwin. Cast everything to u32; on Linux that's a no-op, hence the
clippy::unnecessary_cast allow (rust-lang/rust-clippy#6466).
Fixes: #12916
Signed-off-by: Spyros Seimenis <sse@edgeless.systems>
regorus 0.9.0 introduced a hard, per-engine ceiling on parsed-policy
size (1024 columns / 1 MiB / 20 000 lines, see lexer.rs:30 in
microsoft/regorus). The 1024-column cap rejects realistic policies
emitted by `genpolicy`: the `NVIDIA_REQUIRE_CUDA` environment variable
on `nvcr.io/nvidia/k8s/cuda-sample` is roughly 1.3 KiB on a single line,
so the agent's `set_policy()` returns an error, the agent (PID 1) exits,
the guest kernel reboots, and the runtime eventually times out
connecting to the agent's vsock.
regorus PR #624 ("feat: make policy length limits configurable per
engine") adds `Engine::set_policy_length_config`, but it has not been
released yet -- the latest published version is still 0.9.1, which
predates that change.
Pin `regorus` to the upstream commit that includes #624 and call the
new setter from `AgentPolicy::new_engine()` with values that comfortably
fit any policy we expect to evaluate (64 KiB per line, 16 MiB per file,
200 000 lines) while still rejecting pathological/minified input. Once
a regorus release > 0.9.1 ships with #624, the dependency can be moved
back to crates.io.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The version we used before was released in 2024, it's about time to use
a newer version. The new version of the crate comes with a license,
which addresses a `cargo deny` finding.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Remove unnecessary let binding for unit value expression to fix clippy
warning in Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Remove unnecessary reference operator from expression that is
immediately dereferenced by the compiler to fix clippy warning in
Rust 1.93.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>