Module images are now built as part of the kernel-tarball target
via build-kernel.sh build-modules-images, so the separate CI
matrix entry is no longer needed.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Document the kernel_modules_images feature: building modules
volumes, TOML and Helm chart configuration, agent behavior,
and security considerations for both confidential and
non-confidential deployments.
Prominently warns that custom modules will not work with
official Kata kernel releases because the KBUILD_SIGN_PIN
used to sign modules is not public, requiring users to
rebuild the kernel with their own signing key.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
In addition to per-set module images (kata-modules-mlx5.img,
kata-modules-ntfs.img), build a combined image containing all
module sets. This reduces the number of virtio-blk devices and
dm-mod.create kernel command line entries needed when a user
wants all available modules.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add kernel config fragment for the NTFS3 filesystem driver as a
loadable module and register it in the orchestrator script so that
a kata-modules-ntfs.img disk image is produced alongside the MLNX
image in the same CI build.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add config fragment, build script, and CI integration for building
Mellanox MLX5/InfiniBand kernel modules as a standalone disk image.
The orchestrator script (build-kernel-modules-images.sh) builds the
kernel with extra module config fragments, runs modules_install,
filters modules by subsystem into per-set staging trees, and
packages each into its own disk image using build-modules-volume.sh.
Since these modules are built within the Kata CI using the same
KBUILD_SIGN_PIN, they are signed and loadable on the official
released Kata kernel.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Allow deploying kernel modules images via the Helm chart. Users
specify a list of images with paths and optional verity params
in values.yaml. These are rendered as a ConfigMap, mounted into
the kata-deploy pod, and used to generate a TOML drop-in with
[[hypervisor.<name>.kernel_modules_images]] array of tables.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Reorder create_sandbox to call add_storages before
load_kernel_module so that modules on separate volumes are
available when modprobe runs.
After mounting, detect any storages targeting
/lib/modules/kata-modules-* and if present, write a
/etc/depmod.d/kata-modules.conf with search directives for
those directories and run depmod -a to rebuild the module
dependency database.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The kernel modules images feature requires modprobe and depmod
to be available inside the guest VM. Add the kmod package to
the Ubuntu, Alpine, and CentOS rootfs package lists.
Debian inherits from Ubuntu's config so it picks up kmod
automatically. The NVIDIA rootfs already installs kmod
separately in nvidia_chroot.sh.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add support for attaching multiple kernel modules disk images in
the Rust runtime, mirroring the Go runtime implementation.
Each configured image is cold-plugged as a read-only block device
and a Storage entry is sent to the agent to mount it at
/lib/modules/kata-modules-<N>.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add support for attaching multiple kernel modules disk images to
the guest VM as additional block devices. This enables loading
out-of-tree kernel modules from separate, independently managed
volumes without modifying the dm-verity measured rootfs.
Configuration uses TOML array of tables:
[[hypervisor.qemu.kernel_modules_images]]
path = "/path/to/modules-volume-1.img"
verity_params = ""
[[hypervisor.qemu.kernel_modules_images]]
path = "/path/to/modules-volume-2.img"
verity_params = "root_hash=..."
Each image is cold-plugged as a virtio-blk device (vdb, vdc, ...)
and a Storage entry is sent to the agent to mount it read-only at
/lib/modules/kata-modules-<N>.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add build-modules-volume.sh to package signed kernel modules
into a standalone ext4 disk image that can be attached to a
kata guest VM as a secondary block device.
This allows loading out-of-tree modules without modifying the
dm-verity measured rootfs. The rootfs image and its root hash
remain unchanged.
The script optionally supports dm-verity on the modules volume
itself (-V flag), providing defense-in-depth alongside kernel
module signing.
Security risks documented in the script header:
- Without dm-verity, the volume relies solely on kernel module
signing (CONFIG_MODULE_SIG_FORCE) for integrity.
- With dm-verity, the hash must be verified during attestation
to provide actual security benefit.
- Host-side file permissions on the volume image must prevent
unauthorized modification.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Remove CONFIG_MODULES, CONFIG_MODULE_UNLOAD, and CONFIG_MODULE_SIG
from the NVIDIA GPU config fragments (nvidia.x86_64.conf.in and
nvidia.arm64.conf.in) since these are now provided by the shared
common/modules/modules.conf and common/signing/module_signing.conf
fragments, which are always included for confidential builds.
NVIDIA GPU builds always use -x (confidential), so these options
were redundant. CONFIG_FW_LOADER is kept as it is specific to
GPU firmware loading needs.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
For confidential builds (-x), always include modules/modules.conf
(CONFIG_MODULES=y, CONFIG_MODULE_UNLOAD=y) and
signing/module_signing.conf (CONFIG_MODULE_SIG_FORCE=y, etc.).
This enables two important capabilities for confidential guests:
1. Loadable module support: allows out-of-tree kernel modules
to be loaded from separate modules volume images without
modifying the dm-verity measured rootfs.
2. Module signature enforcement: the kernel rejects any unsigned
or wrongly-signed module, maintaining the trust chain from
the attested kernel to loaded modules.
Previously, module signing was only included when KBUILD_SIGN_PIN
was set. For non-confidential builds, that behavior is preserved.
For confidential builds, module signing is now always enabled
since it is essential for the security model.
Security notes:
- CONFIG_MODULE_SIG_FORCE=y ensures the kernel rejects unsigned
modules, preventing arbitrary code execution in the guest.
- The signing key is generated during kernel build. Users need
this key (protected by KBUILD_SIGN_PIN) to sign out-of-tree
modules.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a new conditional kernel config fragment in a subdirectory
(following the pattern of signing/ and confidential_containers/)
so it is not auto-included by the common/*.conf wildcard:
- common/modules/modules.conf: Enables CONFIG_MODULES and
CONFIG_MODULE_UNLOAD for out-of-tree kernel module support.
This is required for loading user-compiled modules delivered
via separate modules volume images.
This fragment will be explicitly included by build-kernel.sh
for confidential builds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
cleanup_and_fail() prints nothing to stdout and returns 1. The
callers used `return "$(cleanup_and_fail ...)"` which expands to
`return ""`, causing bash to error with "numeric argument required".
Replace the command substitution with a compound command that calls
the cleanup function and propagates its exit code via `$?`.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
regorus 0.9.0 introduced a hard, per-engine ceiling on parsed-policy
size (1024 columns / 1 MiB / 20 000 lines, see lexer.rs:30 in
microsoft/regorus). The 1024-column cap rejects realistic policies
emitted by `genpolicy`: the `NVIDIA_REQUIRE_CUDA` environment variable
on `nvcr.io/nvidia/k8s/cuda-sample` is roughly 1.3 KiB on a single line,
so the agent's `set_policy()` returns an error, the agent (PID 1) exits,
the guest kernel reboots, and the runtime eventually times out
connecting to the agent's vsock.
regorus PR #624 ("feat: make policy length limits configurable per
engine") adds `Engine::set_policy_length_config`, but it has not been
released yet -- the latest published version is still 0.9.1, which
predates that change.
Pin `regorus` to the upstream commit that includes #624 and call the
new setter from `AgentPolicy::new_engine()` with values that comfortably
fit any policy we expect to evaluate (64 KiB per line, 16 MiB per file,
200 000 lines) while still rejecting pathological/minified input. Once
a regorus release > 0.9.1 ships with #624, the dependency can be moved
back to crates.io.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The version we used before was released in 2024, it's about time to use
a newer version. The new version of the crate comes with a license,
which addresses a `cargo deny` finding.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
No need to deviate from how other CoCo targets use Trustee and
enables us to add more tests (e.g., RVPS) that ITA Trustee implemention
does not support.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
For HGX B300 systems we need the 595 driver branch, bump
the guest fs driver to support those systems.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Add the nvidia driver version to the artefact cache keys so that
a driver bump triggers image and initrd rebuilds.
Also rename the helper functions to follow a consistent
get_latest_nvidia_* naming convention.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Without this commit any attempt to exec a command in a container will fail
if SELinux is disabled in the guest but an SELinux label is given for
the new process. That will happen pretty much any time SELinux is enabled
on the host (and the container is not privileged).
Signed-off-by: Pavel Mores <pmores@redhat.com>
We'll need to get the `disable_guest_linux` value in the exec handler, too.
This will allow us to avoid duplicating the get.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Simple bump to fix CVE GHSA-82j2-j2ch-gfr8:
Denial of service via panic on malformed CRL BIT STRING
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Device plugins may set PCIDEVICE_* environment variables with
non-PCI identifiers (e.g. "mlx5_core.sf.10" for mlx5 Scalable
Functions). The update_env_pci() function assumed all values were
PCI BDF addresses and failed to parse them, causing container
creation to fail with:
"PCI address mlx5_core.sf.10 should have the format DDDD:BB:SS.F"
Skip PCIDEVICE_* entries whose values don't parse as PCI addresses,
leaving them untouched for the workload. The corresponding _INFO
variable is also left as-is since no mapping is collected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Same fix as the Go runtime: interfaces whose drivers do not register
a specific netlink kind (e.g. mlx5 Scalable Functions) are reported
with the generic type "device", which is not handled by the endpoint
creation match, causing sandbox creation to fail with:
"unsupported link type: device"
Add "device" as an alternative pattern alongside "veth" so these
interfaces are connected through a TAP + TC-filter bridge.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Interfaces whose drivers do not register a specific netlink kind
(e.g. mlx5 Scalable Functions) are reported with the generic type
"device". The endpoint creation code did not handle this type,
causing sandbox creation to fail with:
"Unsupported network interface: device"
This is particularly visible on arm64 with Mellanox ConnectX NICs
using Scalable Functions, where the ethtool BusInfo returns a
non-PCI identifier (e.g. "mlx5_core.sf.4") so isPhysicalIface()
cannot classify the interface as physical either.
Handle "device" type interfaces the same way as veth endpoints,
connecting them through a TAP + TC-filter bridge.
Additionally, relax getLinkForEndpoint() for VethEndpoint so it
accepts the concrete link type returned by the kernel instead of
asserting *netlink.Veth. A "device" type interface wrapped in a
VethEndpoint returns *netlink.Device from LinkByName(), which
would fail the strict type assertion. All callers only need
link.Attrs(), so accepting any link type is safe.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
At first we thought this only happened with AKS, but it seems this is a
change in k8s 1.36.0 as the tests now started failing outside of AKS as
well.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
All the CIs are failing on the tests and in order to avoid blocking
upstream while allowing enough time for the developers to properly fix
it, let's just not execute the test.
This commit should be reverted once a fix is proposed.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Dragonball is only supported on x86_64 and aarch64, so using it as the
default hypervisor means architectures like s390x, powerpc64le, and
riscv64gc have no working default. Switch to QEMU, which is available
across all supported architectures.
Dragonball is still compiled as a feature on x86_64 and aarch64 via
USE_BUILTIN_DB, and users can still override the default with
HYPERVISOR=dragonball.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>