Add the emptydir_mode configuration option to all runtime-rs config
template files. CoCo configs (snp, tdx, se, coco-dev, nvidia-gpu-snp,
nvidia-gpu-tdx) default to block-encrypted via @DEFEMPTYDIRMODE_COCO@,
while non-CoCo configs (qemu, nvidia-gpu, fc) default to shared-fs
via @DEFEMPTYDIRMODE@.
Also add DEFEMPTYDIRMODE and DEFEMPTYDIRMODE_COCO variables to the
runtime-rs Makefile for template substitution.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
When emptydir_mode is "block-encrypted", host emptyDir paths must
remain as "bind" mounts so the EncryptedEmptyDirVolume handler can
intercept them in the volume dispatch chain. Previously,
update_ephemeral_storage_type() would unconditionally convert them
to "local" type, causing them to be handled as plain local volumes
instead.
Add the emptydir_mode parameter to update_ephemeral_storage_type()
and its call chain (amend_spec in container.rs) and skip the
host-emptyDir-to-local conversion when the mode is block-encrypted.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add the core volume handler for block-encrypted emptyDir support
in runtime-rs, bringing it to parity with the Go runtime (PR #10559).
When emptydir_mode is set to "block-encrypted", host emptyDir bind
mounts are intercepted and handled as follows:
1. A sparse disk image (disk.img) is created inside the emptyDir
folder, sized to match the host filesystem capacity.
2. A mountInfo.json is written under the kata direct-volume root
with volume_type "blk", fs_type "ext4", and metadata
encryptionKey=ephemeral.
3. The disk image is plugged into the guest VM as a virtio-blk
device via the hypervisor device manager.
4. An agent::Storage is built with driver_options containing
encryption_key=ephemeral and shared=true, so the kata-agent
delegates formatting and encryption to CDH using LUKS2.
The volume is registered in the dispatch chain before the regular
block-volume check, and ephemeral disk metadata is tracked for
sandbox-level cleanup at teardown.
Also re-exports EMPTYDIR_MODE_* constants from kata-types::config
so downstream crates can reference them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
The proto Storage message already has a "shared" field (field 8),
but the runtime-rs agent crate's internal Storage struct was
missing it, so it was never forwarded to the kata-agent.
Add the field to the Rust struct and its From<Storage> translation,
and update all explicit struct initialisers across the resource
crate to include shared: false so the build stays clean.
This is needed for trusted ephemeral data storage, where the
agent uses the shared flag to avoid premature cleanup of volumes
that are shared across containers in a pod.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add add_volume_mount_info(), is_volume_mounted(), and
remove_volume_path() to the mount module. These mirror the Go
helpers (AddMountInfo, IsVolumeMounted, Remove) in
src/runtime/pkg/direct-volume/utils.go and are needed by the
upcoming EncryptedEmptyDirVolume to write and clean up
mountInfo.json metadata for block-encrypted emptyDir volumes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add the emptydir_mode field to the Runtime configuration struct,
allowing runtime-rs to read the emptyDir handling mode from the
TOML config file. This is groundwork for trusted ephemeral data
storage support in runtime-rs (parity with the Go runtime).
Two modes are supported:
- shared-fs (default): share emptyDir via virtio-fs/9p.
- block-encrypted: plug a block device encrypted in-guest via
CDH/LUKS2.
Empty values default to "shared-fs"; unknown values are rejected
during validation.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
based on current runtime-go behaviour introduced in https://github.com/kata-containers/kata-containers/pull/9195
When using static resources, always set maxvcpus value equal to the vcpus value.
This is because the static resources case does not support dynamic CPU hotplugging,
and therefore the maximum number of vCPUs should be limited to the number of vCPUs.
Booting with a high number of max vCPUs is a bit slower compared to a lower number.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Update CDH to a newer version and:
- adjust the NVIDIA root filesystem build to reflect the change from
using libcryptsetup to using the cryptsetup binary.
- adjust image-pull test cases to conduct parallel write operations
on the /dev/trusted_store backed guest image pull location since
issue #12721 has been solved on CDH side.
Fixes#12721
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
- Added click==8.3.3 to docs/requirements.txt
- Click 8.3.3 is the latest version for Python >=3.10
- Required for mkdocs toolchain compatibility and resolves vulnerability in indirect dependencies
- Ref : CVE-2026-7246
Signed-off-by: pavithiran34 <pavithiran.p@ibm.com>
Because intptr() returns a fresh pointer on every call, those comparisons compared addresses,
never values, so every check evaluated to false.
As a result /dev/null, /dev/urandom, /dev/ptmx, /dev/loop-control and /dev/loop*
were appended to devices allowlist for sandbox_cgroup
even when the runtime spec already listed them, producing duplicate entries.
Switch to nil-safe value comparisons via a type switch on the cgroup device type
and dereferenced *d.Major / *d.Minor,
keeping the same detection semantics but actually matching existing entries.
Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
When sandbox_cgroup_only is enabled, the kata shim threads inherit
the sandbox device cgroup. For container rootfs whose mount source
is a regular file backed by a loop device (notably the blockfile
snapshotter), containerd's mount package opens /dev/loop-control to
allocate a free /dev/loopN and then opens that block node to attach
the backing file. Neither device is on the sandbox cgroup allowlist,
so both opens fail with EPERM.
This change adds /dev/loop-control (char 10:237) and the /dev/loopN
block nodes (block major 7, any minor) to the sandbox device cgroup
allowlist when sandbox_cgroup_only is true, mirroring the existing
treatment of /dev/null, /dev/urandom and /dev/ptmx. The additions
are gated on SandboxCgroupOnly because that is the only mode in
which the shim itself is constrained by this cgroup.
Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Bump the go version to resolve CVEs:
- GO-2026-4918
- GO-2026-4971
- GO-2026-4976
- GO-2026-4977
- GO-2026-4980
- GO-2026-4981
- GO-2026-4982
- GO-2026-4986
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-by: IBM Bob
The stale issues workflow was using shell syntax ${AGE} instead of
GitHub Actions syntax ${{ env.AGE }} for the days-before-issue-stale
parameter. This prevented the workflow from correctly reading the
calculated AGE value.
Also added days-before-stale: -1 to disable default stale behavior
and ensure only issue-specific settings apply.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
BATS_TEST_COMPLETED is per-test and remains empty in teardown_file.
Track file-level state so successful NIM runs skip the journal dump
while setup or test failures still include node diagnostics.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The DAX header (2 MiB of NVDIMM metadata + a duplicate MBR) is
unconditionally prepended to every image by set_dax_header(). NVIDIA
images use virtio-blk-pci with disable_image_nvdimm=true, so the
kernel reads MBR #1 directly and never touches the DAX metadata --
it is dead weight.
Add a SKIP_DAX_HEADER environment variable (default "no") that, when
set to "yes", skips the DAX header entirely:
- Removes the 2 MiB DAX overhead from image size calculations in
both the erofs and ext4 paths
- Skips the set_dax_header() call, avoiding compilation and
execution of the nsdax tool
- Passes the variable through to containerised builds
Enable SKIP_DAX_HEADER=yes for both install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() in the build pipeline. All
other image builds are unaffected (default remains "no").
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Fedora 42 reaches end-of-life in May 2026. Move the image-builder
container to Fedora 44, which is the current stable release.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Switch the NVIDIA GPU rootfs images (both standard and confidential)
from ext4 to erofs (Enhanced Read-Only File System).
Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path. This eliminates accidental
mutation and reduces the attack surface inside the guest VM, which
is particularly important for confidential workloads using dm-verity.
Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for
both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4
so non-NVIDIA configurations are unaffected.
Update all six NVIDIA GPU configuration templates (base, SNP, TDX
for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global
@DEFROOTFSTYPE@.
Export FS_TYPE=erofs in install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() so the build pipeline
produces erofs images via the image builder.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add full dm-verity and measured rootfs support to
create_erofs_rootfs_image(), bringing it to parity with the ext4 path.
Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path.
This is a natural fit for dm-verity: erofs never attempts writes, so
verity never has to reject anything. With ext4, the kernel must skip
journal replay on verity-protected devices, which is a fragile
assumption.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Extract build_kernel_verity_params() and setup_verity() from the
inline block inside create_rootfs_image() into top-level functions.
This is a pure refactoring with no behavior change. The verity logic
is moved verbatim, with the only difference being that
build_kernel_verity_params() now takes the image path as an explicit
parameter instead of capturing it from the enclosing scope.
The extracted functions will be reused by create_erofs_rootfs_image()
in a subsequent commit to add dm-verity support for erofs images.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Containerd 2.3 (config schema v4) uses the top-level [debug] table
for log level configuration, not plugins."io.containerd.server.v1.debug"
as was the case in the RC builds.
Update containerd_debug_level_toml_path() to use .debug.level for all
schema versions, matching the released containerd behavior.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Place the NIM service into our test namespace. We are still observing
various situations where for some reasons, the NIM service appears in
the default namespace in our CI.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
After #12857, the VFIO-AP hotplug test fails because runtime-rs
unconditionally removes all /dev/vfio/* devices from the OCI spec
before sending it to the kata agent. The agent then rejects
the container creation with:
```
Missing devices in OCI spec
```
Filter devices from the OCI spec conditionally based on the
vfio_mode configuration (e.g. guest-kernel). Also factor the
filtering logic out into a separate function and add unit tests.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>