Merge pull request #13227 from fidencio/topic/rfc-composable-vm-images-update

docs: detail composable image runtime contracts in proposal
This commit is contained in:
Fabiano Fidêncio
2026-06-22 21:07:04 +02:00
committed by GitHub
2 changed files with 272 additions and 16 deletions

View File

@@ -222,21 +222,96 @@ The mount script (`kata-addon-mount.sh`) performs the following steps:
4. **EROFS mount** — mounts the verified device (or the raw partition if
verity is not configured) read-only at `/run/kata-addons/<name>/`.
### Agent-side: dynamic path resolution
### Agent-side: data-driven component manifest
The kata-agent resolves CoCo guest component paths at runtime rather than
relying on hardcoded legacy paths. For each component, the agent checks the
addon mount path first, then falls back to the legacy location:
> **Implementation note.** An earlier revision of this proposal described the
> agent as resolving a small, hardcoded list of component paths. While
> implementing and testing the addon we found this too rigid: the addon
> evolved to ship multiple attestation-agent flavours, per-process environment
> requirements, and ordering constraints that the agent should not have to know
> about. The design below — a **data-driven manifest** owned by the addon — is
> what we converged on so that adding or reconfiguring a bundled component
> requires **no** kata-agent code change.
| Component | Addon path | Legacy path |
Each addon ships a manifest at `etc/kata-addons/components.toml`. When the addon
is mounted, the kata-agent reads it and builds its launch plan from it. The
manifest declares the processes to launch and the resources they expose; all
paths are **relative to the addon mount root** (`/run/kata-addons/<name>`).
A process entry carries:
- `id` and a `level` used to order and gate launches.
- `args` (and `optional_args`, which are appended only when a named context
variable is non-empty).
- `config`, `wait_socket` (the agent blocks until the socket appears), and
`timeout_secs`.
- `env` — extra environment variables for the spawned process.
- An optional `select` selector plus `[process.variants.<name>]` tables, so a
single logical component can ship multiple binaries and the consumer picks
one at runtime.
`${...}` tokens in `args`, `config`, `env` values, and variant fields are
substituted by the agent from a runtime context it assembles (socket and config
paths, the addon mount root `${addon_root}`, the selected `${attester_variant}`,
etc.). Introducing a brand-new variable is the only change that ever needs to
touch the agent.
Abridged manifest (see "Attester variant selection" and "Runtime dependencies"
below for why the `nvidia` variant and the `PATH`/`LD_LIBRARY_PATH` entries are
shaped the way they are):
```toml
schema_version = 1
[paths]
"ocicrypt-config" = "etc/ocicrypt_config.json"
"pause-bundle" = "pause_bundle"
[[process]]
id = "attestation-agent"
level = 1
args = ["--attestation_sock", "${aa_attestation_uri}"]
config = "${aa_config_path}"
wait_socket = "${aa_attestation_socket}"
select = "${attester_variant}"
[process.variants.default]
path = "usr/local/bin/attestation-agent"
[process.variants.nvidia]
path = "usr/local/bin/attestation-agent-nv"
env = { LD_LIBRARY_PATH = "${addon_root}/usr/local/lib:/run/kata-addons/gpu/usr/lib" }
[[process]]
id = "confidential-data-hub"
level = 2
path = "usr/local/bin/confidential-data-hub"
config = "${cdh_config_path}"
env = { OCICRYPT_KEYPROVIDER_CONFIG = "${ocicrypt_config_path}", PATH = "${addon_root}/usr/sbin:/bin:/sbin:/usr/bin:/usr/sbin" }
wait_socket = "${cdh_socket}"
[[process]]
id = "api-server-rest"
level = 3
path = "usr/local/bin/api-server-rest"
args = ["--features", "${rest_api_features}"]
```
When **no** addon is mounted, the agent falls back to a built-in launch plan
that reproduces the legacy behaviour (components launched from
`/usr/local/bin/...` in the rootfs). The same dual-path principle applies to
the non-process resources declared under `[paths]`: the agent resolves them
inside the addon first and falls back to the legacy location otherwise:
| Resource | Addon path | Legacy path |
|------------------------|---------------------------------------------------------|---------------------------------|
| attestation-agent | `/run/kata-addons/coco/usr/local/bin/attestation-agent` | `/usr/local/bin/attestation-agent` |
| attestation-agent(-nv) | `/run/kata-addons/coco/usr/local/bin/attestation-agent[-nv]` | `/usr/local/bin/attestation-agent` |
| confidential-data-hub | `/run/kata-addons/coco/usr/local/bin/confidential-data-hub` | `/usr/local/bin/confidential-data-hub` |
| api-server-rest | `/run/kata-addons/coco/usr/local/bin/api-server-rest` | `/usr/local/bin/api-server-rest` |
| ocicrypt_config.json | `/run/kata-addons/coco/etc/ocicrypt_config.json` | `/etc/ocicrypt_config.json` |
| pause_bundle | `/run/kata-addons/coco/pause_bundle` | `/pause_bundle` |
This dual-path resolution approach:
This approach:
- Preserves **backward compatibility** with existing monolithic rootfs images
where CoCo components are baked into the base image.
@@ -244,12 +319,164 @@ This dual-path resolution approach:
stub files or directories for the addon components.
- Works transparently on a **read-only rootfs** — no bind-mounting, no
remounting, no writes to the root filesystem.
- Keeps the agent **generic** — addon-specific names, env, and ordering live
in the manifest, not in agent code.
### Attester variant selection and the NVRC contract
The CoCo addon ships **two** attestation-agent builds: the stock
`attestation-agent` and an NVIDIA-attester build, `attestation-agent-nv`, that
collects GPU evidence in addition to the TEE evidence. Which one runs is chosen
by the manifest's `select = "${attester_variant}"` selector, and the value of
`${attester_variant}` is driven by the guest init:
- On a plain confidential guest the kata-agent runs init itself and the variable
defaults to `default` → the stock attester launches.
- On a GPU guest, **NVRC** (the NVIDIA runtime config that owns early boot)
detects the GPU addon and exports `KATA_ATTESTER_VARIANT=nvidia` before
exec'ing the kata-agent. The agent forwards that into the substitution context
as `${attester_variant}`, so the `nvidia` variant launches.
This is a small **cross-component contract**: the environment variable name
(`KATA_ATTESTER_VARIANT`) and the `nvidia` value are produced by NVRC and
consumed by the kata-agent, which keeps the agent free of any GPU- or
NVIDIA-specific knowledge — it only knows how to forward a selector into the
manifest. We arrived at this split after first trying to special-case the
attester inside the agent; pushing the decision out to NVRC + the manifest kept
both the agent and the addon generic.
#### Why one addon with two builds (and not two addons)
Shipping **two** attestation-agent builds inside a **single** CoCo addon is a
deliberate choice, and it is worth being explicit about it because addons are
otherwise meant to *eliminate* duplication.
- A single CoCo addon serves **both** plain confidential guests (TEE evidence
only) and confidential GPU guests (TEE + GPU evidence). The only thing that
differs between them is which attestation-agent binary runs; everything else
in the addon (confidential-data-hub, api-server-rest, ocicrypt config, pause
bundle, `cryptsetup`) is shared verbatim. Splitting the NVIDIA attester into
its own addon would duplicate that shared payload and force every confidential
GPU guest to compose two CoCo-flavoured addons instead of one.
- The **cost** of keeping both builds together is precisely the manifest's
`select`/`variants` machinery: the manifest has to be aware that the
attestation-agent comes in two flavours and pick one at runtime. We consider
that a fair trade — the complexity is confined to data (the manifest), the
agent stays generic, and the addon stays a single, coherent "CoCo" unit.
- A separate addon only pays off when its contents are **substantially**
different (e.g. the GPU addon, which carries the driver userspace), not for
two near-identical builds of the same component. No further attester variants
are planned today, but if one appeared it would be another `[process.variants.<name>]`
entry — not a new image.
### Runtime dependencies: dynamic linking and secure-mount tooling
Two classes of runtime dependency surfaced only once the components actually ran
inside a composed VM. Both are resolved by the manifest `env` entries above
rather than by agent code, and both informed where binaries and libraries must
physically live.
#### NVIDIA attester dynamic libraries
`attestation-agent-nv` links `libnvat.so` (the NVIDIA Attestation SDK), which in
turn:
- `dlopen`s `libnvidia-ml.so.1` (NVML) at runtime to gather GPU evidence. NVML
is part of the **GPU addon**, mounted at `/run/kata-addons/gpu` with its driver
libraries under `usr/lib`.
- pulls in a closure of non-glibc libraries (libxml2, zlib, lzma, the C++
runtime, ...) that the guest rootfs does not otherwise ship.
Neither set is present in a stock guest, so:
- The CoCo addon build bundles `libnvat.so` **and every non-glibc transitive
dependency** next to it under `usr/local/lib`.
- The `nvidia` **manifest variant** (the `[process.variants.nvidia]` entry, not
a separate image) sets an `LD_LIBRARY_PATH` that lists **both** the CoCo
addon's `usr/local/lib` (for `libnvat` and its closure) **and** the GPU
addon's `usr/lib` (for NVML). Setting only the first was the cause of an
`NVAT Error 500: NVML Initialization Failed` we hit during bring-up; the
RCAR handshake then never produced GPU evidence and attestation failed.
Because the agent applies manifest `env` on top of the inherited environment
(without clearing it), and because the `nvidia` variant only ever runs when the
GPU addon is present, referencing the GPU addon's well-known mount path here is
safe.
#### CDH secure-mount tooling (encrypted vs plain storage)
The confidential-data-hub `secure_mount` feature shells out — by `PATH`
lookup — to external tools, and these split cleanly into two buckets that belong
in two different places:
- **Plain storage setup** — `mke2fs`/`mkfs.ext4` (and `dd`, plus
`/etc/mke2fs.conf`) format the scratch volume. This is needed for *unencrypted*
ephemeral storage too, so it belongs in the **base image** and ships there
unconditionally.
- **Encrypted storage** — `cryptsetup` LUKS-formats and opens the volume. This is
a CoCo-only capability, so it belongs with the CoCo guest components in the
**coco addon**.
This is the same `veritysetup`-vs-`cryptsetup` reasoning already applied to addon
mounting: the base must always carry `veritysetup` (it opens *every* addon as a
dm-verity device before mounting), and `cryptsetup` shares an identical
shared-library closure, so wherever `veritysetup` lives the libraries for
`cryptsetup` are already present.
Because CDH runs in the **base rootfs namespace** but `cryptsetup` lives in the
addon (which is not on the default search path), the manifest sets CDH's `PATH`
to `${addon_root}/usr/sbin:/bin:/sbin:/usr/bin:/usr/sbin` — the addon's
`cryptsetup` first, then the base directories that carry `mke2fs`/`mkfs.ext4`/`dd`.
(The kata-agent launches components with `PATH=/bin:/sbin:/usr/bin:/usr/sbin`;
since setting any `env` value replaces that variable wholesale, the base
directories are restored explicitly.)
How each tool is provisioned depends on the **base flavour** (see
"Base image flavours" below), but the placement *contract* is identical:
| Tool | Bucket | Full-distro base (Ubuntu) | Distroless base (chiseled NVIDIA + NVRC) |
|------------------------------|--------------------|----------------------------------------------------------|---------------------------------------------------------|
| `veritysetup` | base, always | `cryptsetup-bin` (unconditional in `ubuntu/config.sh`) | copied into the base layout unconditionally |
| `cryptsetup` | coco addon | also present in the full-distro base via `cryptsetup-bin`| binary bundled in the coco addon; libs come from base |
| `mke2fs`/`mkfs.ext4`/`dd` | base, for CoCo | `e2fsprogs` (on `CONFIDENTIAL_GUEST=yes`) + `coreutils` | copied into the base layout |
The distroless path needs explicit copying because nothing lands in a chiseled
image unless placed there deliberately, and the NVIDIA base is never built with
`CONFIDENTIAL_GUEST=yes`. The full-distro base is built with
`CONFIDENTIAL_GUEST=yes`, so the same tools arrive as ordinary packages. In both
cases the addon's `cryptsetup` resolves its libraries against the base, which
requires the base and the coco-addon builder to stay on the **same distro/ABI**
(Ubuntu 24.04 "noble" today).
### Image build pipeline
#### Base image flavours
Kata base images come in two flavours, distinguished by **who owns early boot**.
This distinction — not "Ubuntu vs NVIDIA" — is what drives the differences in
how tooling is provisioned and who mounts the addons:
- **Full-distro base** — ships a complete distribution with **systemd** as init.
systemd discovers and mounts the addons (via `kata-addon-mount@.service`), and
the binaries/libraries the guest needs arrive as ordinary distribution
packages. The standard confidential `kata-containers.img` (Ubuntu) is today's
instance.
- **Distroless base** — a minimal, chiseled image with no full init system. A
dedicated early-boot component takes over the responsibilities systemd would
otherwise have (addon discovery and mounting, attester selection,
orchestration), and any tooling must be placed into the image deliberately
rather than pulled in as packages. The chiseled `base-nvidia` driven by
**NVRC** is today's instance.
These are the only two flavours today and no others are planned, but a new base
would fall into one of these categories and follow the same mechanisms (systemd
units for a full-distro base; an NVRC-like early-boot owner for a distroless
one). The sections below describe the build for both; where they diverge it is
because of the flavour, not because the NVIDIA image is a special case.
#### Base image
The standard `kata-containers.img` is built as before, but **without** CoCo
The full-distro `kata-containers.img` is built as before, but **without** CoCo
guest components (attestation-agent, confidential-data-hub, api-server-rest,
ocicrypt config, pause bundle). It includes:
@@ -257,15 +484,36 @@ ocicrypt config, pause bundle). It includes:
- systemd and the `kata-addon-mount@.service` template unit
- `cryptsetup-bin` (provides `veritysetup`) — required unconditionally so
that the base image can mount verity-protected addons regardless of whether
the base itself was built with `CONFIDENTIAL_GUEST=yes`.
the base itself was built with `CONFIDENTIAL_GUEST=yes`. On Ubuntu this same
package also provides `cryptsetup`, so the full-distro base happens to carry
the encrypted-storage binary too.
- The plain-storage tooling for CDH `secure_mount``mke2fs`/`mkfs.ext4`,
`dd`, and `/etc/mke2fs.conf`. On the standard base these come from
`e2fsprogs` (added when `CONFIDENTIAL_GUEST=yes`) and `coreutils`. See
"Runtime dependencies" for why the encrypted-storage `cryptsetup` lives in
the addon instead.
The **distroless base** (`base-nvidia`, driven by NVRC) is a chiseled,
driver-agnostic image rather than a full distro, so the items above do not
arrive as packages — they are copied into the base layout explicitly:
`veritysetup` and its library closure unconditionally, and the
`mke2fs`/`mkfs.ext4`/`dd`/`mke2fs.conf` plain-storage tooling alongside it.
#### CoCo addon image
The `kata-containers-coco-addon.img` is built by:
1. Unpacking the CoCo guest components tarball into a temporary rootfs.
1. Unpacking the CoCo guest components tarball into a temporary rootfs. Besides
the agent-launched binaries (attestation-agent, attestation-agent-nv,
confidential-data-hub, api-server-rest) this tarball also carries:
- `cryptsetup` under `usr/sbin` — the encrypted-storage binary for CDH
`secure_mount` (its shared libraries are resolved against the base; see
"Runtime dependencies").
- the NVIDIA attester libraries under `usr/local/lib``libnvat.so` plus its
non-glibc transitive closure (libxml2, zlib, lzma, the C++ runtime).
2. Unpacking the pause image tarball into the same rootfs.
3. Running the image builder with:
3. Writing the component manifest to `etc/kata-addons/components.toml`.
4. Running the image builder with:
- `FS_TYPE=erofs` — EROFS filesystem for compact, read-only storage.
- `MEASURED_ROOTFS=yes` — creates a dm-verity hash partition.
- `SKIP_DAX_HEADER=yes` — no DAX header (virtio-blk, not NVDIMM).
@@ -378,12 +626,16 @@ layer, not at the VM assembly layer.
### Additional addon types
The architecture is designed to support multiple addon images. Planned
extensions include:
The architecture is designed to support multiple addon images:
- **GPU addon** — NVIDIA GPU support components (NVRC, vGPU drivers)
in a separate addon image, enabling GPU support without bloating
the base or CoCo images.
- **GPU addon** — the NVIDIA GPU userspace (driver libraries, NVML, the
container-toolkit binaries, kernel modules) lives in a `gpu-addon` image
mounted at `/run/kata-addons/gpu`, carved out of the same build as the
driver-agnostic `base-nvidia` image. NVRC orchestrates early boot, loads the
modules from the addon, and composes the GPU addon with the CoCo addon on
confidential GPU guests. This addon is implemented; its interplay with the
CoCo addon (NVML resolution, attester selection) is covered in
"Runtime dependencies" and "Attester variant selection" above.
- **Custom addons** — users can build their own addon images for
workload-specific libraries, models, or configurations.

View File

@@ -15,6 +15,7 @@ virtcontainers
# Open source projects
bootc
distroless
materialx
ostree
@@ -53,6 +54,7 @@ NCCL
NRAS
NVCR
NVLINK
NVML
nvswitch
nvswitches
NVRC
@@ -60,6 +62,7 @@ nvlib
pgpu
PPCIE
Quadro
RCAR
# Storage & Filesystems
AUFS
@@ -95,6 +98,7 @@ ttrpc
vsock
# Security & Encryption
LUKS
ocicrypt
# Container, Runtime & Misc terms