Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.
The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH
configs during build.
This keeps Mariner-specific defaults in explicit config artifacts
instead of ad-hoc runtime mutation.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
TDX QGS takes raw TD report from QEMU/guest VM and signs it in an SGX
enclave. Historically, QGS has supported two transports: vsock and
unix-domain-socket. The former was necessary before the guest kernel
supported the GetQuote "TDVMCALL" hypercall: DCAP library inside the
guest used vsock to talk to QGS directly.
However, with GetQuote, QEMU gets the TDREPORT and sends it to QGS.
In process-to-process communication, unix-domain-socket is a better
approach. This is also the only transport supported by libvirt by default.
With that, align Kata default configuration to use unix-domain-socket
as well. The change in impacts QEMU commandline:
old:
"quote-generation-socket":{"type":"vsock","cid":"2","port":"4050"}
new:
"quote-generation-socket":{"type":"unix","path":"/var/run/tdx-qgs/qgs.socket"}
Host QGS configuration must be changed to listen unix-domain-sockets.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Enable enable_numa=true in the three nvidia-gpu QEMU configuration
templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since
buildNUMATopology() returns nil when there is only one node. On
multi-NUMA hosts it ensures GPU memory accesses are NUMA-local.
Add documentation to all QEMU config templates explaining the VFIO
device NUMA placement validation that occurs when NUMA is enabled.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Switch the NVIDIA GPU rootfs images (both standard and confidential)
from ext4 to erofs (Enhanced Read-Only File System).
Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path. This eliminates accidental
mutation and reduces the attack surface inside the guest VM, which
is particularly important for confidential workloads using dm-verity.
Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for
both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4
so non-NVIDIA configurations are unaffected.
Update all six NVIDIA GPU configuration templates (base, SNP, TDX
for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global
@DEFROOTFSTYPE@.
Export FS_TYPE=erofs in install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() so the build pipeline
produces erofs images via the image builder.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Keep virtio_fs_extra_args support in code, but remove it from default
enable_annotations and add explicit security warnings in Makefiles and
docs.
Release-note note: mirror this hardening in release notes so operators
know this remains opt-in and carries host-side risk when enabled.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
`make vendor` isn't required anymore. People who need vendored code should
use the `tools/packaging/release/generate_vendor.sh` script instead.
Assisted-by: Claude AI
Signed-off-by: Greg Kurz <groug@kaod.org>
Now shipped in the vendored code tarball.
Drop the git tree status check since it isn't needed anymore.
Also stop building with `-mod=vendor`. This requires to
expose GOMODCACHE as suggested by Fabiano Fidêncio.
Signed-off-by: Greg Kurz <groug@kaod.org>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
runtime-rs lacks several features needed for CPU hotplug on ARM:
pflash/UEFI firmware passthrough, SMP topology in -smp, nr_cpus
kernel parameter, and QMP vCPU add handling for the virt machine
type (which requires core-id only placement with socket/thread/die
set to -1).
Without static sandbox resource management, these gaps cause
failures in tests like k8s-memory.bats where the VM is not correctly
sized for the workload.
Enable static_sandbox_resource_mgmt for aarch64 in the QEMU
runtime-rs configuration so the VM is pre-sized at creation time,
sidestepping the need for hotplug entirely.
Together with this we're aligning the go runtime to the very same
behaviour.
Fixes: #10928
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
The attestation-agent no longer sets nvidia devices to ready
automatically. Instead, we should use nvrc for this. Since this is
required for all nvidia workloads, add it to the default nv kernel
params.
With bounce buffers, the timing of attesting a device versus setting it
to ready is not so important.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2ba0cb0d4a7 did the ground work for using OVMF even for the
qemu-nvidia-gpu, but missed actually setting the OVMF path to be used,
which we'e fixing now.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
As the NVIDIA stack has shifted to using an image for both the
confidential and non-confidential variants, we retire the initrd
build.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
* Introduces the `emptydir_mode` config flag to allow instructing the runtime
to create a block device for emptyDir volumes.
* The block device is created in the original emptyDir folder on the host
so that Kubelet can monitors its disk usage and evict the pod if it exceeds
its sizeLimit. This matches runc and virtio-fs.
* The block device's disk image file is sparse to minimize host disk
footprint.
Fixes: #10560
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Sometimes OVFM provides incorrect values to the kernel
we override it by telling the kernel to handle the PCI space setup
like allocating the proper window sizes and assigning the proper busses
to each device.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Different kubernetes distributions, such as k0s, use a different kubelet
root dir location instead of the default /var/lib/kubelet, so ConfigMap
and Secret volume propagation were failing.
This adds a kubelet_root_dir config option that the go runtime uses when
matching volume paths and kata-deploy now sets it automatically for k0s
via a drop-in file.
runtime-rs does not need this option: it identifies ConfigMap/Secret,
projected, and downward-api volumes by volume-type path segment
(kubernetes.io~configmap, etc.), not by kubelet root prefix.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This commit adds logic to properly handle memory hotplug
for QemuCCWVirtio in the ExecMemdevAdd() path.
The new logic is triggered only when virtio-mem is enabled.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
This disables virtio-pmem support for Cloud Hypervisor by changing
Kata config defaults and removing the relevant code paths.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Build a single kernel for both kernel and kernel-confidential on x86_64
and s390x. The kernel is built with TEE support (-x) on those arches only.
This helps to simplilfy and to maintain the code, and having a single
kernel was the original plan since forever.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Build a single kernel for both nvidia-gpu and nvidia-gpu-confidential,
simplifying and reducing code maintenance.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Similar to the kernel_params annotation, add a
kernel_verity_params annotation and add logic to make these
parameters overwritable. For instance, this can be used in test
logic to provide bogus dm-verity hashes for negative tests.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This change introduces the kernel_verity_parameters knob to the
Go based shim, picking up dm-verity information in a new config
field (the corresponding build variable is already produced by
the shim build). The change extends the shim to parse dm-verity
information from this parameter and to construct the kernel command
line appropriately, based on the indicated initramfs or kernelinit
build variant.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This reverts commit 923f97bc66 in
order to re-instantiate the logic from commit
e4a13b9a4a.
The latter commit was previously reverted due to the NVIDIA GPU TEE
handler using an initrd, not an image.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Currently, a working TDX setup expects users to install special
TDX support builds from Canonical/CentOS virt-sig for TDX to
work. kata-deploy configured TDX runtime handler to use QEMU
from the distro's paths.
With TDX support now being available in upstream Linux and
Ubuntu 24.04 having an install candidate (linux-image-generic-6.17)
for a new enough kernel, move TDX configuration to use QEMU from
kata-deploy.
While this is the new default, going back to the original
setup is possible by making manual changes to TDX runtime handlers.
Note: runtime-rs is already using QEMUPATH for TDX.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
1. Add disable_block_device_use to CLH settings file, for parity with
the already existing QEMU settings.
2. Set DEFDISABLEBLOCK := true by default for both QEMU and CLH. After
this change, Kata Guests will use by default virtio-fs to access
container rootfs directories from their Hosts. Hosts that were
designed to use Host block devices attached to the Guests can
re-enable these rootfs block devices by changing the value of
disable_block_device_use back to false in their settings files.
3. Add test using container image without any rootfs layers. Depending
on the container runtime and image snapshotter being used, the empty
container rootfs image might get stored on a host block device that
cannot be safely hotplugged to a guest VM, because the host is using
the same block device.
4. Add block device hotplug safety warning into the Kata Shim
configuration files.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Cameron McDermott <cameron@northflank.com>
Disable NVDIMM. When using GPU passthrough, using NVDIMM would create
a r/o file-backed memory region. When using a GPU, QEMU tries to DMA-
map guest memory for the device, resulting in a mapping error:
memory listener initialization failed: Region mem0:
vfio_container_dma_map ... -22 (Invalid argument).
For the CC configs, NVDIMM is disabled by default in qemu_amd64.go
with a warning, but we also explicitly disable the setting in the
shim configuration file.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Remove the agent hotplug timeout parameter from the kernel
command line. Having shifted to VFIO cold-plug, this parameter is
no longer needed.
Remove the no longer required parameter for TDX and thus align the
SNP and TDX configurations.
Add a parameter to avoid the kernel to mount the /dev tmpfs. NVRC
and later on kata-agent attempt this. While kata-agent does not
panic when mounting /dev fails, NVRC makes mounting /dev a hard
requirement.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
OVMF build for Intel TDX (aka "TDVF") was disabled in favor of Ubuntu/
CentOS pre-upstream releases of Intel TDX.
See 4292c4c3b1.
It's time to re-enable the build and move runtime configurations to
use it (the latter will be done in a later commit).
This is a partial revert of 4292c4c3b with the following changes:
- Stop calling OVMF for Intel TDX "TDVF" and follow the naming distros
use for TDX enabled build: OVMF.inteltdx.fd.
- Single binary OVMF.inteltdx.fd is supported using -bios QEMU param.
- Secure Boot infrastructure is disabled since Kata does not support it.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
This reverts commit e4a13b9a4a, as it
caused some issues with the GPU workflows.
Reverting it is better, as it unblocks other PRs.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Updates to the shim-v2 build and the binaries.sh script.
Makeing sure that both variants "confidential" AND
"nvidia-gpu-confidential" are handled.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
In order to have a better way to set things up using a toml editor, we
should take the containerd approach and actually have everything
uncommnted. This will help us to unify how we deal with such values in
the future from the kata-deploy POV.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This settings is not needed anymore with Ubuntu 25.10
and the newest QEMU releases for TDX by Ubuntu.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Remove the nvrc.smi.srs=1 parameter from the kernel command line.
In CC use cases, the attestation agent is expected to set the GPU
ready state. For the CUDA vectorAdd case where attestation agent
is not being used, we set the ready state by adding the kernel
command line parameter through an annotation.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Utilize Kubelet's Pod Resource API to determine device allocations
for the Pod during sandbox creation. Use CDI files to translate the device
IDs to corresponding device paths and perform device injection.
Fixes#12009
Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
Yes, we're dealing with a combination of large images and image-rs
concurrent image layers being not optimal.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Introduce independent IOThread framework for Kata container.
What is the indep_iothreads:
This new feature introduce a way to pre-alloc IOThreads
for QEMU hypervisor (maybe other hypervisor can support too).
Independent IOThreads enables IO to be processed in a separate thread.
To generally improve the performance of each module, avoid them
running in the QEMU main loop.
Why need indep_iothreads:
In Kata container implementation, many devices based on hotplug
mechanism. The real workload container may not sync the same
lifecycle with the VM. It may require to hotplug/unplug new disks
or other devices without destroying the VM. So we can keep the
IOThread with the VM as a IOThread pool(some devices need multi iothreads
for performance like virtio-blk vq-mapping), the hotplug devices
can attach/detach with the IOThread according to business needs.
At the same time, QEMU also support the "x-blockdev-set-iothread"
to change iothreads(but it need stop VM for data secure).
Current QEMU have many devices support iothread, virtio-blk,
virtio-scsi, virtio-balloon, monitor, colo-compare...etc...
How it works:
Add new item in hypervisor struct named "indep_iothreads" in toml.
The default value is 0, it reused the original "enable_iothreads" as
the switch. If the "indep_iothreads" != 0 and "enable_iothreads" = true
it will add qmp object -iothread indepIOThreadsPrefix_No when VM startup.
The first user is the virtio-blk, it will attach the indep_iothread_0
as default when enable iothread for virtio-blk.
Thanks
Chen
Signed-off-by: zhangchen.kidd <zhangchen.kidd@jd.com>
While the primary goal of this change is to detect regressions to the
NVIDIA SNP GPU scenario, various improvements to reflect a more
realistic CC setting are planned in subsequent changes, such as:
* moving away from the overlayfs snapshotter
* disabling filesystem sharing
* applying a pod security policy
* activating the GPUs only after attestation
* using a refined approach for GPU cold-plugging without requiring
annotations
* revisiting pod timeout and overhead parameters (the podOverhead value
was increased due to CUDA vectorAdd requiring about 6Gi of
podOverhead, as well as the inference and embedqa requiring at least
12Gi, respectively, 14Gi of podOverhead to run without invoking the
host's oom-killer. We will revisit this aspect after addressing
points 1. and 2.)
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>