Rename all host-visible names of the nydus-snapshotter instance managed
by kata-deploy from the generic "nydus-snapshotter" to "nydus-for-kata-tee".
This covers the systemd service name, the containerd proxy plugin key,
the runtime class snapshotter field, the data directory
(/var/lib/nydus-for-kata-tee), the socket path (/run/nydus-for-kata-tee/),
and the host install subdirectory.
The rename makes it immediately clear that this nydus-snapshotter instance
is the one deployed and managed by kata-deploy specifically for Kata TEE
use cases, rather than any general-purpose nydus-snapshotter that might
be present on the host.
Because the old code operated under a completely separate set of paths
(nydus-snapshotter.*), any previously deployed installation continues
to run without interference during the transition to this new naming.
CI pipelines and operators can upgrade kata-deploy on their own schedule
without having to coordinate an atomic cutover: the old service keeps
serving its existing workloads until it is explicitly replaced, and the
new deployment lands cleanly alongside it.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Removing /var/lib/nydus-snapshotter during install or uninstall creates
a split-brain state: the nydus backend starts empty while containerd's
BoltDB (meta.db) still holds snapshot records from the previous run.
Any subsequent image pull then fails with:
"unable to prepare extraction snapshot:
target snapshot \"sha256:...\": already exists"
An earlier attempt cleaned up containerd's BoltDB via `ctr snapshots rm`
before wiping the directory, but that cleanup is inherently fragile:
- It requires the nydus gRPC service to be reachable at cleanup time.
If the service is stopped, crashed, or not yet running, every `ctr`
call silently fails and the stale records remain.
- Any workload still actively using a snapshot blocks the entire
cleanup, making it impossible to guarantee a clean state.
The correct invariant is that meta.db and the nydus backend always
agree. Preserving the data directory unconditionally guarantees this:
- Fresh install: data directory does not exist, nydus starts empty.
- Reinstall: existing snapshots and nydus.db are preserved, meta.db
and backend remain in sync, new binary starts cleanly.
- After uninstall: containerd is reconfigured without the nydus
proxy_plugins entry and restarted, so the snapshot records in
meta.db are completely dormant — nothing will use them. If nydus
is reinstalled later, the data directory is still present and both
sides remain in sync, so no split-brain can occur.
Any stale snapshots from previous workloads are garbage-collected by
containerd once the images referencing them are removed.
This also removes the cleanup_containerd_nydus_snapshots,
cleanup_nydus_snapshots, and cleanup_nydus_containers helpers that
were introduced by the earlier (fragile) attempt.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
When /var/lib/nydus-snapshotter is removed, containerd's BoltDB
(meta.db at /var/lib/containerd/) still holds snapshot records for
the nydus snapshotter. On the next install these stale records cause
image pulls to fail with:
"unable to prepare extraction snapshot:
target snapshot \"sha256:...\": already exists"
The failure path in core/unpack/unpacker.go:
1. sn.Prepare() → metadata layer finds the target chainID in BoltDB
→ returns AlreadyExists without touching the nydus backend.
2. sn.Stat() → metadata layer finds the BoltDB record, then calls
s.Snapshotter.Stat(bkey) on the nydus gRPC backend → NotFound
(backend was wiped).
3. The unpacker treats NotFound as a transient key-collision race and
retries 3 times; all 3 attempts hit the same dead end, and the
pull is aborted.
The commit message of 62ad0814c ("nydus: Always start from a clean
state") assumed "containerd will re-pull/re-unpack when it finds non-
existent snapshots", but that is not what happens: the metadata layer
intercepts the Prepare call in BoltDB before the backend is ever
consulted.
Fix: call cleanup_containerd_nydus_snapshots() before stopping the
nydus service (and thus before wiping its data directory) in both
install_nydus_snapshotter and uninstall_nydus_snapshotter.
The cleanup must run while the service is still up because ctr
snapshots rm goes through the metadata layer which calls the nydus
gRPC backend to physically remove the snapshot; if the service is
already stopped the backend call fails and the BoltDB record remains.
The cleanup:
- Discovers all containerd namespaces via `ctr namespaces ls -q`
(falls back to k8s.io if that fails).
- Removes containers whose Snapshotter field matches the nydus plugin
name; these become dangling references once snapshots are gone and
can confuse container reconciliation after an aborted CI run.
- Removes snapshots round by round (leaf-first) until either the list
is empty or no progress can be made (see below).
Note: containerd's GC cannot substitute for this explicit cleanup.
The image record (a GC root) references content blobs which reference
the snapshots via gc.ref labels, keeping the entire chain alive in
the GC graph even after the nydus backend is wiped.
Snapshot removal rounds
-----------------------
Snapshot chains are linear: an image with N layers produces a chain
of N snapshots, each parented on the previous. Only the current leaf
can be removed each round, so N layers require exactly N rounds.
There is no fixed round cap — the loop terminates when either the
list reaches zero (success) or a round removes nothing at all
(all remaining snapshots are actively in use by running workloads).
Active workload safety
----------------------
If active workloads still hold nydus snapshots (e.g. during a live
upgrade), no progress is made in a round and cleanup_nydus_snapshots
returns false. Both install_nydus_snapshotter and
uninstall_nydus_snapshotter gate the fs::remove_dir_all on that
return value:
- true → proceed as before: stop service, wipe data dir.
- false → stop service, skip data dir removal, log a warning.
The new nydus instance starts on the existing backend
state; running containers are left intact.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
We've been using `experimental_force_guest_pull`, but now that we have a
containerd release that should work more reliably with the multi
snapshotter setup, we want to give it a try.
Note: We need containerd 2.2.2+.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
With debug/ebpf updates in place, let's bump the kata config version.
Signed-off-by: Agam Dua <agam_dua@apple.com>
Co-authored-by: Eric Ernst <eric_ernst@apple.com>
Add missing terms to the spell check dictionary to fix CI failures
for kernel debug documentation:
- eBPF
- dwarves: Linux package with DWARF/BTF tools (pahole) required for
CONFIG_DEBUG_INFO_BTF kernel option
Also fix the casing of "ebpf" to "eBPF" in the kernel README to match
the official naming convention.
Signed-off-by: Agam Dua <agam_dua@apple.com>
Fixed a bug with the debug kernel build where common/ was repeated
after the common path variable, resulting in the debug
confs never being picked up.
This exposed a subsequent bug where the debug conf
was included in other builds, this is also fixed by creating a
separate directory for debug confs with one file at the moment,
debug.conf that contains debug configurations and bpf specific
configs.
To enable kernel builds (specifically for bpf) the dwarves package was added
to the kernel dockerfile for the pahole package.
Signed-off-by: Agam Dua <agam_dua@apple.com>
Add the debug kernel to the kata tarball alongside the other kernels.
Also update the kernel README documentation to describe the new debug
kernel build process.
Signed-off-by: Agam Dua <agam_dua@apple.com>
Adds a BPF section in the debug.conf kernel configuration options
to enable eBPF and BTF support for debug kernel builds.
Signed-off-by: Agam Dua <agam_dua@apple.com>
This supersedes https://github.com/kata-containers/kata-containers/pull/12622.
I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material
created after mkdocs-material was put into maintenance mode. We'll use this
platform until Zensical is more feature complete.
Added a few of the existing docs into the site to make a more user-friendly flow.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
As the NVIDIA stack has shifted to using an image for both the
confidential and non-confidential variants, we retire the initrd
build.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
With the new CDH version, the secure_mount API changes.
Further, the new CDH version no longer uses the luks-encrypt-storage
script but utilizes libcryptsetup as well as mkfs.ext4 and dd. Hence, adapt
some of the CDH and Kata components build steps
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Moving the genpolicy crate into the root workspace causes the build
outputs to go into the root workspace's target directory, instead of
src/tools/genpolicy/target, invalidating assumptions made by the
kata-deploy-binaries script.
This commit adds a special case for the lookup path of the genpolicy
binary, and fixes two bugs that made identifying this problem harder.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Let's update the nvidia-container-toolkit to 1.18.1 (from 1.17.6).
We're, from now on, relying on the version set in the versions.yaml
file.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Allow users to override the default RuntimeClass pod overhead for
any shim via shims.<name>.runtimeClass.overhead.{memory,cpu}.
When the field is absent the existing hardcoded defaults from the
dict are used, so this is fully
backward compatible.
Signed-off-by: Zachary Spar <zspar@coreweave.com>
Remove # !confidential from mmio.conf so CONFIG_VIRTIO_MMIO and
CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES are included when building the
unified x86_64/s390x kernel with -x
Firecracker requires virtio-mmio for block devices; without it the
guest kernel panics (no /dev/vda).
Fixes: #12581
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Allow genpolicy -j to accept a directory instead of a single file.
When given a directory, genpolicy loads genpolicy-settings.json from it
and applies all genpolicy-settings.d/*.json files (sorted by name) as
RFC 6902 JSON Patches. This gives precise control over settings with
explicit operations (add, remove, replace, move, copy, test), including
array index manipulation and assertions.
Ship composable drop-in examples in drop-in-examples/:
- 10-* files set platform base settings (non-CoCo, AKS, CBL-Mariner)
- 20-* files overlay specific adjustments (OCI version, guest pull)
Users copy the combination they need into genpolicy-settings.d/.
Replace the old adapt_common_policy_settings_* jq-patching functions
in tests_common.sh with install_genpolicy_drop_ins(), which copies the
right combination of 10-* and 20-* drop-ins for the CI scenario.
Tests still generate 99-test-overrides.json on the fly for per-test
request/exec overrides.
Packaging installs 10-* and 20-* drop-ins from drop-in-examples/ into
the tarball; the default genpolicy-settings.d/ is left empty.
Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
kata-deploy restarts the CRI runtime (k3s/containerd) during install,
which can kill the verification job pod or cause transient API server
errors. Bump backoffLimit from 0 to 3 so the job can retry after being
killed, and add a retry loop around kubectl rollout status to handle
transient connection failures.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Move the cleanup logic from a preStop lifecycle hook (separate exec)
into the main process's SIGTERM handler. This simplifies the
architecture: the install process now handles its own teardown when
the pod is terminated.
The SIGTERM handler is registered before install begins, and
tokio::select! races install against SIGTERM so cleanup always runs
even if SIGTERM arrives mid-install (e.g. helm uninstall while the
container is restarting after a failed install attempt).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Check the rendered containerd config for the versioned drop-in dir import
(config.toml.d or config-v3.toml.d) and bail with a clear error if it is
missing.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When I added this I had in mind the period that we still relied on the
SEV module being generated, which we don't do for quite a long time.
This wrong assumption caused the cache to **ALWAYS** fail, increasing
our build time considerably for no reason.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
get_kernel() thinks it knows when it needs to skip sha256sum validation for
RC kernels since sha256sums.asc is not available:
INFO: Config version: 176
INFO: Kernel version: 6.18-rc5
INFO: kernel path does not exist, will download kernel
INFO: Release candidate kernels are not part of the official sha256sums.asc -- skipping sha256sum validation
But continues to check it anyway since ${rc} matches
with -n. sha256sum should only be checked when ${rc} is NOT
set.
Fixes a problem where downloaded RC kernels are always removed
and downloaded again.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
On containerd v3 config, disable_snapshot_annotations must be set under the
images plugin, not the runtime plugin.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Different kubernetes distributions, such as k0s, use a different kubelet
root dir location instead of the default /var/lib/kubelet, so ConfigMap
and Secret volume propagation were failing.
This adds a kubelet_root_dir config option that the go runtime uses when
matching volume paths and kata-deploy now sets it automatically for k0s
via a drop-in file.
runtime-rs does not need this option: it identifies ConfigMap/Secret,
projected, and downward-api volumes by volume-type path segment
(kubernetes.io~configmap, etc.), not by kubelet root prefix.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When image.reference or kubectlImage.reference already contains a digest
(e.g. quay.io/...@sha256:...), use the reference as-is instead of
appending :tag. This avoids invalid image strings like 'image@sha256🔤'
when tag is empty and allows users to pin by digest.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>