Commit Graph

18964 Commits

Author SHA1 Message Date
Fabiano Fidêncio
270c8cf8ea ci: adjust workflows to rootfs-image-coco-addon
Add rootfs-image-coco-addon to the build-asset-rootfs matrix in the
amd64, arm64 and s390x static tarball workflows so the CoCo addon image
is built as part of CI alongside the other rootfs images.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-11 19:06:41 +02:00
Fabiano Fidêncio
664c47abe5 build: add CoCo addon image build and remove CoCo deps from confidential image
Add install_image_coco_addon() to kata-deploy-binaries.sh which:
- Unpacks the CoCo guest components and pause image tarballs into a
  temporary rootfs directory (under the repo root so Docker-in-Docker
  volume mounts resolve correctly)
- Calls image_builder.sh with USE_DOCKER=1, FS_TYPE=erofs,
  MEASURED_ROOTFS=yes, SKIP_DAX_HEADER=yes, and SKIP_ROOTFS_CHECK=yes
  to produce kata-containers-coco-addon.img + root_hash_coco-addon.txt

Add the rootfs-image-coco-addon-tarball Makefile target with
dependencies on pause-image-tarball and coco-guest-components-tarball.

Remove pause-image-tarball and coco-guest-components-tarball from the
standard confidential image dependencies -- those components now live
exclusively in the CoCo addon image.  NVIDIA confidential images
retain them until the NVIDIA addon split lands.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-11 19:06:41 +02:00
Fabiano Fidêncio
b30b2e0ab5 config: switch CoCo templates from monolithic image to base + addon
Update all six CoCo configuration templates (coco-dev, snp, tdx for
both Go and Rust runtimes) to use the standard base image instead of
the monolithic confidential image, and add an [[extra_images]] section
for the CoCo addon:

  image = "@IMAGEPATH@"          (was @IMAGECONFIDENTIALPATH@)

  [[hypervisor.qemu.extra_images]]
  name = "coco"
  path = "@COCOIMAGEPATH@"
  verity_params = "@COCOVERITYPARAMS@"

Add COCOIMAGENAME (kata-containers-coco-addon.img), COCOIMAGEPATH, and
COCOVERITYPARAMS to both runtime Makefiles so the placeholders are
substituted at install time.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-11 19:06:41 +02:00
Fabiano Fidêncio
3edbff730d image-builder: support building addon images without /sbin/init check
Addon images (e.g. the CoCo guest components addon) are not full root
filesystems -- they contain only the binaries and configuration that
get bind-mounted into the real rootfs at boot.  The existing
check_rootfs() validation requires /sbin/init and systemd, which are
not present in addon images.

Add a SKIP_ROOTFS_CHECK environment variable that, when set to "yes",
bypasses the check_rootfs() call.  Forward the variable into the
container environment when using the Docker-based build path so it
works in both direct and containerised invocations.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-11 19:06:41 +02:00
Fabiano Fidêncio
e146a24ff5 rootfs-builder: enable kata-addon-mount@coco.service in systemd targets
Create a symlink to enable kata-addon-mount@coco.service in
kata-containers.target.wants during rootfs construction for
systemd-based (non-AGENT_INIT) guests.

The unit's ConditionPathExists guard ensures it only activates when
the virtio-addon-coco block device is actually present in the VM,
so enabling it unconditionally in the base image is safe.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-11 19:06:41 +02:00
Fabiano Fidêncio
7f7d55e2ff agent: add systemd template unit for addon image mount/unmount
Add a systemd template unit kata-addon-mount@.service and companion
helper scripts (kata-addon-mount.sh, kata-addon-umount.sh) that handle
the guest-side setup of addon block device images.

The mount script:
- Discovers the block device via /dev/disk/by-id/virtio-addon-<name>
- Reads verity parameters from kata.addon.<name>.verity_params on the
  kernel command line
- Creates a dm-verity device and mounts the EROFS filesystem
- Bind-mounts addon directories (usr/local/bin, etc, pause_bundle)
  into the rootfs

The unit uses ConditionPathExists to only activate when the
corresponding virtio-blk device is present, and is ordered before
kata-agent.service so addon binaries are available when the agent
starts.

The agent Makefile is updated to install the unit and scripts.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-11 19:06:41 +02:00
Fabiano Fidêncio
f513f60bca runtime: add extra_images cold-plug support for Go runtime
Mirror the Rust runtime's extra_images support in the Go runtime:

- hypervisor.go: add ExtraImage struct and ExtraImages field to
  HypervisorConfig.
- config.go: parse [[hypervisor.qemu.extra_images]] TOML sections
  via toExtraImages(), validating that every addon has a non-empty
  name and valid dm-verity parameters.
- qemu.go: in buildDevices(), attach each extra image as a virtio-blk
  drive with ID addon-<name>; in kernelParameters(), append
  kata.addon.<name>.verity_params=... for each extra image.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-10 22:33:57 +02:00
Fabiano Fidêncio
6b1cf962fd runtime-rs: add serial_override to BlockConfig and wire extra_images cold-plug
Add a serial_override field to BlockConfig and DeviceVirtioBlk so that
addon images get a deterministic QEMU serial (addon-<name>) instead of
the auto-generated image-<id>.  This lets the guest discover each addon
via /dev/disk/by-id/virtio-addon-<name>.

Wire up the full cold-plug path for extra_images:

- sandbox.rs: iterate boot_info.extra_images, create read-only
  BlockConfig entries with the correct serial_override, and push them
  as ResourceConfig::ExtraImage.
- resource lib.rs / manager_inner.rs: handle the new ExtraImage variant
  by delegating to the existing block device path.
- cmdline_generator.rs: use serial_override when present for the QEMU
  serial= parameter; append kata.addon.<name>.verity_params=... to the
  kernel command line for each extra image.
- inner.rs: propagate serial_override from BlockConfig to the cmdline
  generator's add_block_device().

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-10 22:33:57 +02:00
Fabiano Fidêncio
da552c118a kata-types: add ExtraImage struct and extra_images field to BootInfo
Introduce the ExtraImage type to describe addon block-device images
(e.g. CoCo guest components) that are cold-plugged into the VM
alongside the main rootfs.

Each ExtraImage carries a short name (used as the virtio-blk serial
and in kernel cmdline verity parameters), a host-side path, and
dm-verity parameters (required -- addon images must be integrity-
protected).

A new extra_images Vec is added to BootInfo so that both the Rust
and Go runtimes can iterate the configured addons when building the
VM device list and kernel command line.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-10 22:33:57 +02:00
Fabiano Fidêncio
284db5421f image-builder/nvidia: skip DAX header for virtio-blk-pci images
The DAX header (2 MiB of NVDIMM metadata + a duplicate MBR) is
unconditionally prepended to every image by set_dax_header(). NVIDIA
images use virtio-blk-pci with disable_image_nvdimm=true, so the
kernel reads MBR #1 directly and never touches the DAX metadata --
it is dead weight.

Add a SKIP_DAX_HEADER environment variable (default "no") that, when
set to "yes", skips the DAX header entirely:
- Removes the 2 MiB DAX overhead from image size calculations in
  both the erofs and ext4 paths
- Skips the set_dax_header() call, avoiding compilation and
  execution of the nsdax tool
- Passes the variable through to containerised builds

Enable SKIP_DAX_HEADER=yes for both install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() in the build pipeline. All
other image builds are unaffected (default remains "no").

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:16:23 +02:00
Fabiano Fidêncio
e6ab7b84f8 image-builder: bump base image from Fedora 42 to 44
Fedora 42 reaches end-of-life in May 2026. Move the image-builder
container to Fedora 44, which is the current stable release.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:16:23 +02:00
Fabiano Fidêncio
fccbbdf49d nvidia: switch GPU rootfs images to erofs
Switch the NVIDIA GPU rootfs images (both standard and confidential)
from ext4 to erofs (Enhanced Read-Only File System).

Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path. This eliminates accidental
mutation and reduces the attack surface inside the guest VM, which
is particularly important for confidential workloads using dm-verity.

Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for
both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4
so non-NVIDIA configurations are unaffected.

Update all six NVIDIA GPU configuration templates (base, SNP, TDX
for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global
@DEFROOTFSTYPE@.

Export FS_TYPE=erofs in install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() so the build pipeline
produces erofs images via the image builder.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:16:23 +02:00
Fabiano Fidêncio
f35661fac2 image-builder: add erofs dm-verity support and lz4hc compression
Add full dm-verity and measured rootfs support to
create_erofs_rootfs_image(), bringing it to parity with the ext4 path.

Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path.

This is a natural fit for dm-verity: erofs never attempts writes, so
verity never has to reject anything. With ext4, the kernel must skip
journal replay on verity-protected devices, which is a fragile
assumption.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:16:23 +02:00
Fabiano Fidêncio
0e24b72e0d image-builder: refactor dm-verity setup into shared functions
Extract build_kernel_verity_params() and setup_verity() from the
inline block inside create_rootfs_image() into top-level functions.

This is a pure refactoring with no behavior change. The verity logic
is moved verbatim, with the only difference being that
build_kernel_verity_params() now takes the image path as an explicit
parameter instead of capturing it from the enclosing scope.

The extracted functions will be reused by create_erofs_rootfs_image()
in a subsequent commit to add dm-verity support for erofs images.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 17:16:23 +02:00
manuelh-dev
2ffd1538a2 Merge pull request #13021 from fidencio/topic/kata-deploy-log-level-containerd-version-4
kata-deploy: Fix containerd debug level path for config schema v4
2026-05-10 07:28:26 -07:00
Fabiano Fidêncio
341a0d366c kata-deploy: Fix containerd debug level path for config schema v4
Containerd 2.3 (config schema v4) uses the top-level [debug] table
for log level configuration, not plugins."io.containerd.server.v1.debug"
as was the case in the RC builds.

Update containerd_debug_level_toml_path() to use .debug.level for all
schema versions, matching the released containerd behavior.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-10 12:02:24 +02:00
Fabiano Fidêncio
46b46589a6 Merge pull request #13020 from manuelh-dev/mahuber/nim-op-placement
tests: nvidia: place NIM service into namespace
2026-05-10 12:01:58 +02:00
Manuel Huber
1c081ff434 tests: nvidia: place NIM service into namespace
Place the NIM service into our test namespace. We are still observing
various situations where for some reasons, the NIM service appears in
the default namespace in our CI.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-10 07:36:23 +00:00
Fabiano Fidêncio
905303b6b0 Merge pull request #13013 from BbolroC/filter-vfio-gk-only-runtime-rs
runtime-rs: filter VFIO devices only in guest-kernel mode
2026-05-08 23:49:50 +02:00
Fabiano Fidêncio
a447a1fb03 Merge pull request #13015 from stevenhorsman/kernel-6.18.28-bump
version: Bump to latest 6.18 kernel
2026-05-08 21:12:50 +02:00
Fabiano Fidêncio
f7be57efe2 Merge pull request #13007 from manuelh-dev/mahuber/dbg-nim-svc
tests: nvidia: Wait for NIM operator pod and print
2026-05-08 20:58:51 +02:00
stevenhorsman
87664c608d version: Bump to latest 6.18 kernel
Pick up the latest kernel that fixes CVE-2026-43284

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-08 17:15:24 +01:00
Hyounggyu Choi
754707fe83 runtime-rs: filter VFIO devices only in guest-kernel mode
After #12857, the VFIO-AP hotplug test fails because runtime-rs
unconditionally removes all /dev/vfio/* devices from the OCI spec
before sending it to the kata agent. The agent then rejects
the container creation with:

```
Missing devices in OCI spec
```

Filter devices from the OCI spec conditionally based on the
vfio_mode configuration (e.g. guest-kernel). Also factor the
filtering logic out into a separate function and add unit tests.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-05-08 15:39:16 +02:00
Fabiano Fidêncio
8e65e89ade Merge pull request #13011 from kata-containers/fix-warnings
runtime-rs: Fix warnings in rust runtime
2026-05-08 15:12:53 +02:00
Fabiano Fidêncio
a541827a7e Merge pull request #12984 from fidencio/topic/network-pair-use-name-for-lookup
runtime-rs: network: use provided name for virt interface lookup
2026-05-08 14:31:58 +02:00
Fabiano Fidêncio
09bbc70302 Merge pull request #13002 from manuelh-dev/mahuber/unrequire-nim-svc
gatekeeper: Unrequire NVIDIA GPU test (temporary)
2026-05-08 10:02:00 +02:00
Fabiano Fidêncio
2879619d07 Merge pull request #12981 from fidencio/topic/kata-deploy-reduce-memory-consumption
kata-deploy: reduce memory consumption
2026-05-08 09:51:47 +02:00
Alex Lyn
1441b2b84a runtime-rs: Fix warnings in rust runtime
So many unformatted rust codes cause uncommitted change files in
rust runtime and its libs or agent sources, which can be easily
found just by `cargo fmt --all`.

Let's reduce such noisy bad experiences

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-08 14:56:00 +08:00
Manuel Huber
714adec3f8 tests: nvidia: Wait for NIM operator pod and print
Wait for the NIM operator pod to run before deploying NIM services.
Add a temporary debug function to print resource placement into the
different namespaces. Remove this function again when the NIM tests
are stabilized.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-08 06:27:48 +00:00
Manuel Huber
edfb6f5716 gatekeeper: Unrequire NVIDIA GPU test (temporary)
Temporarily unrequire the NVIDIA GPU test. We are experiencing
situations in which two NIM service instances get deployed almost
at the same time into the kata-containers-k8s-tests namespace
(expected current context) and into the default namespace. This
causes the NIM operator to create two deployments in the two
namespaces and to then schedule two pods at the same time. This
usually causes the NIM pod in the default namespace to fail and to
linger.
We can't explain yet why this does not happen in the TEE CI path
and why this is happening at all.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-07 14:39:24 +02:00
Fabiano Fidêncio
8dde5f39b7 tests: dump kata-deploy pod describe+logs on install timeout
When kubectl wait times out the pod never reached Ready, so the
existing log collection (which runs after wait succeeds) produces
"-- No entries --" with zero useful information.

Capture kubectl describe and kubectl logs (including previous
container) immediately on timeout so the next CI run shows exactly
why the pod is stuck (ImagePullBackOff, OOMKilled, probe failures,
containerd restart hang, etc.).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
0f3160276b ci: k8s: skip no-op Helm uninstall on free runners
In cleanup_kata_deploy, bail out early when no kata-deploy Helm release
exists so baremetal-* pre-deploy cleanup on fresh clusters does not
block on helm uninstall --wait (up to 10m).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
f5533950e6 kata-deploy: helm: cap container RSS via resources block
Plumb a resources block into the kata-deploy DaemonSet container in
the Helm chart so the cluster can size its memory footprint
predictably.

Defaults are sized from real /proc/<pid>/status numbers on an
unpatched 3.30.0 build running on a ~220-vCPU GPU node:

  VmRSS:    9944 kB  (~9.7 MiB)   <- actual physical memory
  RssAnon:  2628 kB  (~2.6 MiB)   <- heap + dirty stack pages
  VmData: 464668 kB  (~454 MiB)   <- tokio multi-thread workers'
                                     reserved-but-untouched stacks
  Threads: 225                    <- num_cpus()-driven worker pool

That VmData number is the source of the original "kata-deploy is
using 400 MB" reports: any monitoring layer that surfaces virtual
data size, committed memory, or memory.usage_in_bytes on a kernel
that includes mapped-but-untouched memory will happily reproduce
~400 MB even though only ~10 MiB is ever made resident. The earlier
commits in this series (current_thread tokio, mimalloc, shared kube
client, JSONPath removal, post-install re-exec) collapse VmData into
the tens of MiB and drop the post-install resident set further.

The defaults below are picked accordingly:

  requests:
    cpu: 25m            # install is mostly I/O wait; the post-install
                        # waiter is genuinely idle
    memory: 16Mi        # ~2x headroom over the unpatched VmRSS we
                        # measured, far more over the patched waiter

Operators who hit OOMKilled on unusually large or churny clusters can
override `resources` directly in their Helm values (or set it to {}
to remove all requests and inherit cluster defaults).

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
9e99b21ec5 kata-deploy: re-exec into a tiny post-install waiter
After install completes the kata-deploy DaemonSet pod has nothing else
to do for the rest of its lifetime — it just blocks on SIGTERM and then
runs cleanup. Up to here, the install path has built up substantial
peak heap (kube clients, deserialised Node/RuntimeClass objects, hyper
+ rustls TLS pools, parsed JSON / YAML), and on musl essentially none
of that is ever returned to the kernel. Idling in the same process
therefore pins the pod's RSS at the install peak indefinitely.

Re-exec the binary into a hidden `internal-post-install-wait` action
the moment install succeeds. execve(2) discards the entire address
space, so the waiter starts up holding only the working set it actually
needs (a config struct, the SIGTERM handler, and the health server).

To avoid a probe-availability gap during the handover the install
process clears FD_CLOEXEC on the health listener and passes the raw
FD to the child via KATA_DEPLOY_HEALTH_FD. The child reattaches the
FD as a tokio TcpListener and resumes serving /healthz and /readyz
without ever closing the socket — the kubelet sees no failure.

The detected container runtime is similarly threaded through
KATA_DEPLOY_DETECTED_RUNTIME so the waiter doesn't have to re-query
the apiserver. The new action is tagged `#[clap(hide = true)]` so
`--help` doesn't expose it; users should never invoke it directly.

Add the FD-inheritance helpers in health.rs:

  - prepare_listener_for_exec(): clears FD_CLOEXEC on a listener and
    returns its raw fd number.
  - listener_from_inherited_fd(): wraps an inherited fd back into a
    tokio::net::TcpListener (and re-sets FD_CLOEXEC so future host
    shellouts don't leak the socket).

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
af03ab2228 kata-deploy: replace JSONPath node lookups with typed accessors
The two pieces of node metadata kata-deploy actually reads are
.status.nodeInfo.containerRuntimeVersion and a single label, both of
which were being fetched through a homegrown JSONPath walker:

  - get_node_field() serialised the entire Node object back into a
    serde_json::Value tree on every call,
  - split_jsonpath() / get_jsonpath_value() then walked that tree by
    string key.

Both the deep clone and the helpers themselves are unnecessary — kube's
Node type is already strongly typed. Replace get_node_field() with two
purpose-built accessors that read straight off the Node struct:

  - get_container_runtime_version(): pulls
    status.node_info.container_runtime_version with a clear error if
    the field isn't populated.
  - get_node_label(key): returns Option<String> directly from
    metadata.labels.

Drop split_jsonpath, get_jsonpath_value, and their unit tests (which
existed only to cover the JSONPath walker we no longer have). Update
the three callers (config.rs, runtime/manager.rs, runtime/containerd.rs)
to use the typed accessors.

This removes the entire serde_json::Value clone-and-walk path from the
hot read path and meaningfully cuts allocator churn during install.

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
52e6a19253 kata-deploy: size-optimise the release profile
Apply per-package release-profile overrides for the kata-deploy crate
only:

  opt-level = "z"     # optimise for size, not speed
  codegen-units = 1   # let LLVM see the whole crate when inlining

The binary is throwaway: it runs once at DaemonSet pod start, finishes
the install in seconds, and then sits idle waiting for SIGTERM. There
is no hot path to optimise for speed, so trading a bit of compile time
and a few percent of CPU for a meaningfully smaller text segment is the
right call here.

These overrides live at the workspace root and are scoped via
[profile.release.package."kata-deploy"], so they do not affect the
agent, runtime-rs, dragonball, or any of the libs / tools crates.

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
6cd842494c kata-deploy: cap the tokio worker pool to 2 threads
The default #[tokio::main] expands with flavor = "multi_thread" and
worker_threads = num_cpus::get(). On a typical NVIDIA GPU node
(200+ vCPUs) that allocates 200+ worker threads with ~2 MiB stacks
each, which is the single largest contributor to the DaemonSet pod's
VmData reservation — hundreds of MiB of address space mapped but never
touched, easily reproducing the "kata-deploy is using ~400 MB" reports
on any monitoring layer that surfaces VSZ / committed virtual memory.

Switch to a fixed two-worker multi-thread runtime instead:

  #[tokio::main(flavor = "multi_thread", worker_threads = 2)]

Two workers is exactly the right number for kata-deploy:

  - the install path is overwhelmingly I/O-bound and runs serially;
    one worker is enough to drive the install future itself,
  - install does shell out to `nsenter --target 1 systemctl restart
    containerd` (and friends) via the synchronous std::process::
    Command::output(), which wedges the worker thread it runs on for
    tens of seconds; the second worker keeps the spawned health-server
    task able to answer kubelet probes inside timeoutSeconds while
    the first is blocked.

flavor = "current_thread" would be tighter still on stacks (~4 MiB
saved) but is fundamentally unsafe here: with a single runtime thread,
any blocking host_systemctl call freezes the health server too, the
kubelet fails the readiness probe, and the pod is restarted long
before install completes. The CI lifecycle test reliably reproduces
this as a 15-minute timeout waiting for the kata-deploy DaemonSet pod
to become Ready.

Net result vs. upstream's num_cpus()-driven pool on a 200-vCPU node:
~200 fewer worker threads, ~400 MiB less VmData reservation, while
keeping kubelet probes responsive across the entire install path.

Add the "sync" tokio feature here too so subsequent commits in the
series can use tokio::sync primitives (OnceCell) without another
features bump.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
346119108e kata-deploy: drop unused kube features
The binary doesn't use kube::runtime (controllers, watchers, reflectors)
or kube::derive (the CustomResource macro). Pulling them in only added
transitive deps (kube-runtime, kube-derive, backon, educe, ahash,
async-broadcast, ...) and inflated the binary's static data segment for
no functional gain.

Set default-features = false and select only what the binary actually
calls into: the kube-client surface plus the rustls-tls backend that
hyper-rustls already pulled in transitively. Behaviour is unchanged.

Fixes: https://github.com/kata-containers/kata-containers/discussions/12976

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
114cacf439 Merge pull request #12857 from kata-containers/topic/runtime-rs-coldplug-gpu
runtime-rs: coldplug GPU support
2026-05-07 12:54:04 +02:00
Fabiano Fidêncio
19c194aa94 ci: Add runtime-rs GPU shims to NVIDIA GPU CI workflow
Add qemu-nvidia-gpu-runtime-rs and qemu-nvidia-gpu-snp-runtime-rs to
the NVIDIA GPU test matrix so CI covers the new runtime-rs shims.

Introduce a `coco` boolean field in each matrix entry and use it for
all CoCo-related conditionals (KBS, snapshotter, KBS deploy/cleanup
steps). This replaces fragile name-string comparisons that were already
broken for the runtime-rs variants: `nvidia-gpu (runtime-rs)` was
incorrectly getting KBS steps, and `nvidia-gpu-snp (runtime-rs)` was
not getting the right env vars.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
1682b73e38 kata-deploy: Add qemu-nvidia-gpu-tdx-runtime-rs shim
Register the new qemu-nvidia-gpu-tdx-runtime-rs shim across the kata-deploy
stack so it is built, installed, and exposed as a RuntimeClass.

This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the
runtime-rs binary), SHIMS list, the qemu-tdx-experimental share name
mapping, and the x86_64 default shim set. The Helm chart gets the new
shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the
RuntimeClass overhead definition in runtimeclasses.yaml.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
8a33007806 runtime-rs: Add configuration-qemu-nvidia-gpu-tdx-runtime-rs.toml.in
Add a new runtime-rs configuration template that combines the NVIDIA GPU
cold-plug stack with Intel TDX confidential guest support. This is the
runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-tdx
template.

The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API,
NV-specific kernel/image/firmware, extended timeouts) with TDX confidential
guest settings (confidential_guest, OVMF.inteltdx.fd firmware, TDX Quote
Generation Service socket, confidential NV kernel and image).

The Makefile is updated with the new config file registration and the
FIRMWARETDVFPATH_NV variable pointing to OVMF.inteltdx.fd.

Also removes a stray tdx_quote_generation_service_socket_port setting
from the SNP GPU template where it did not belong.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
2280620cb9 kata-deploy: Add qemu-nvidia-gpu-snp-runtime-rs shim
Register the new qemu-nvidia-gpu-snp-runtime-rs shim across the kata-deploy
stack so it is built, installed, and exposed as a RuntimeClass.

This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the
runtime-rs binary), SHIMS list, the qemu-snp-experimental share name
mapping, and the x86_64 default shim set. The Helm chart gets the new
shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the
RuntimeClass overhead definition in runtimeclasses.yaml.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
e98a864285 runtime-rs: Add configuration-qemu-nvidia-gpu-snp-runtime-rs.toml.in
Add a new runtime-rs configuration template that combines the NVIDIA GPU
cold-plug stack with AMD SEV-SNP confidential guest support. This is the
runtime-rs counterpart of the Go runtime's configuration-qemu-nvidia-gpu-snp
template.

The template merges the GPU NV settings (VFIO cold-plug, Pod Resources API,
NV-specific kernel/image/firmware, extended timeouts) with the SNP
confidential guest settings (confidential_guest, sev_snp_guest, SNP ID
block/auth, guest policy, AMDSEV.fd firmware, confidential NV kernel and
image).

The Makefile is updated with the new config file registration, the
CONFIDENTIAL_NV image/kernel variables, and FIRMWARESNPPATH_NV pointing
to AMDSEV.fd.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
92a8cd56d1 kata-deploy: Add qemu-nvidia-gpu-runtime-rs shim
Register the Rust NVIDIA GPU runtime as a kata-deploy shim so it gets
installed and configured alongside the existing Go-based
qemu-nvidia-gpu shim.

Add qemu-nvidia-gpu-runtime-rs to the RUST_SHIMS list and the default
enabled shims, create its RuntimeClass entry in the Helm chart, and
include it in the try-kata-nvidia-gpu values overlay. The kata-deploy
installer will now copy the runtime-rs configuration and create the
containerd runtime entry for it.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
1ada256581 runtime-rs: Add configuration-qemu-nvidia-gpu-runtime-rs.toml.in
Add a QEMU configuration template for the NVIDIA GPU runtime-rs shim,
mirroring the Go runtime's configuration-qemu-nvidia-gpu.toml.in. The
template uses _NV-suffixed Makefile variables for kernel, image, and
verity params so the GPU-specific rootfs and kernel are selected at
build time.

Wire the new config into the runtime-rs Makefile: define
FIRMWAREPATH_NV with arch-specific OVMF/AAVMF paths (matching the Go
runtime's PR #12780), add EDK2_NAME for x86_64, and register the config
in CONFIGS/CONFIG_PATHS/SYSCONFIG_PATHS so it gets installed alongside
the other runtime-rs configurations.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Alex Lyn
a51e0b630e agent: Update VFIO device handling for GPU cold-plug
Extend the in-guest agent's VFIO device handler to support the cold-plug
flow. When the runtime cold-plugs a GPU before the VM boots, the agent
needs to bind the device to the vfio-pci driver inside the guest and
set up the correct /dev/vfio/ group nodes so the workload can access
the GPU.

This updates the device discovery logic to handle the PCI topology that
QEMU presents for cold-plugged vfio-pci devices and ensures the IOMMU
group is properly resolved from the guest's sysfs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
cb6fb51920 runtime-rs: Do not pass through audio device from IOMMU group
NVIDIA GPUs often have an HDA audio controller (PCI class 0x0403) in the
same IOMMU group. This device should not be passed through to the guest,
just like Host and PCI bridges.

Change filter_bridge_device() to accept a slice of PCI class bitmasks
and add 0x0403 (audio) to the ignore list alongside 0x0600 (host/PCI
bridge). This matches the Go runtime fix from NVIDIA/kata-containers#26.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-07 10:33:26 +02:00
Alex Lyn
7e2dff8179 runtime-rs: Wire BlockDeviceModern into rawblock volume and container
Use BlockCfgModern for rawblock volumes when the hypervisor supports it,
passing logical and physical sector sizes from the volume metadata.

In the container manager, clear Linux.Resources fields (Pids, BlockIO,
Network) that genpolicy expects to be null, and filter VFIO character
devices from Linux.Devices to avoid policy rejection.

Update Dragonball's inner_device to handle the DeviceType::VfioModern
variant in its no-op match arm.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-07 10:33:26 +02:00
Alex Lyn
eecb1a246c runtime-rs: Add resource manager VFIO modern handling and CDI wiring
Extend the resource manager to handle VfioModern and BlockModern device
types when building the agent's device list and storage list. For VFIO
modern devices, the manager resolves the container path and sets the
agent Device.id to match what genpolicy expects.

Rework CDI device annotation handling in container_device.rs:
- Strip the "vfio" prefix from device names when building CDI annotation
  keys (cdi.k8s.io/vfio0, cdi.k8s.io/vfio1, etc.)
- Remove the per-device index suffix that caused policy mismatches
- Add iommufd cdev path support alongside legacy VFIO group paths

Update the vfio driver to detect iommufd cdev vs legacy group from
the CDI device node path.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-07 10:33:26 +02:00