containerd uses the proxy plugin root export when reporting CRI image
filesystem paths. Without this export, the CRI plugin falls back to
/var/lib/containerd/io.containerd.snapshotter.v1.<snapshotter>.
For nydus-for-kata-tee this fallback does not match the actual
snapshotter root under /var/lib/nydus-for-kata-tee.
Kubelet/cAdvisor then fails stats collection when it tries to inspect
the nonexistent fallback path.
Export the nydus proxy snapshotter root so containerd reports the real
filesystem path for resource accounting.
When using trusted ephemeral storage or a new ephemeral storage wip
feature for providing plain disks, resource accounting would not kick
in and pods which exhausted their emptyDir sizeLimits would not get
evicted.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The kata-deploy main image pinned its gcr.io/distroless/static-debian13
base by sha256 digest. distroless does not publish versioned tags, so a
pinned digest just goes stale with no clear upgrade path. Track the
rolling tag instead (guarded with a hadolint DL3007 ignore plus a comment
explaining why), matching the kata-deploy-job-dispatcher image base.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
The verification Job assumed the DaemonSet model: it waited for the
DaemonSet to exist, for its pods, and for `rollout status daemonset/...`,
then required every node in the cluster to be labeled. None of that holds
for deploymentMode: job, where install happens via the dispatcher and the
per-node Jobs it fans out, and only the targeted (worker) nodes get
labeled.
Make the hook mode-aware:
- Hook weight: in job mode the install dispatcher runs as a
post-install hook at weight 5, so verification now runs at weight 10
(after it); daemonset mode keeps weight 0 (the DaemonSet is a normal
resource).
- Readiness wait: in job mode, wait for the install dispatcher Job to
complete and then for the per-node install Jobs
(kata-deploy/stage=install) to finish (with the same CRI-restart
retry logic) instead of a DaemonSet rollout.
- Label check: in job mode, verify exactly the nodes the dispatcher
targeted are labeled, rather than comparing the labeled count against
all nodes in the cluster.
- Grant the verification ClusterRole read access to batch/jobs (used by
the job-mode waits; harmless in daemonset mode).
The daemonset code path is unchanged and the default render (no
verification.pod) is byte-for-byte identical.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add the uninstall counterpart to the install dispatcher for
deploymentMode: job. On `helm uninstall`, a single pre-delete hook Job
runs the kata-deploy-job-dispatcher, which enumerates the targeted nodes
live and fans out one node-pinned cleanup Job per node that runs the
install pipeline in reverse and exits:
unlabel -> revert-cri (initContainers, run sequentially)
remove-artifacts (main container)
Running as a pre-delete hook means the dispatcher ServiceAccount/RBAC and
the kata-deploy host-mutation RBAC still exist while the Jobs run, so the
unlabel stage retains node get/patch access. revert-cri and
remove-artifacts are host-only operations (privileged nsenter / host
mount) and need no extra cluster RBAC.
Ordering mirrors install in reverse: unlabel first so the scheduler stops
placing kata workloads here, then revert the CRI config + restart the
runtime, then remove the on-host artifacts. Each stage is idempotent and
skips when already undone, so partially-installed nodes and re-runs are
safe.
Uninstall node selection is deliberately SEPARATE from install (a
dedicated job.cleanup.* block) and defaults to every node carrying the
katacontainers.io/kata-runtime label (set by the install label stage)
rather than re-evaluating the install selector. Because the cleanup
dispatcher resolves nodes live when it runs, this stays robust to
install-time selector drift (relabeled nodes, etc.) while remaining fully
overridable via job.cleanup.nodes / job.cleanup.nodeSelector /
job.cleanup.nodeSelectorExpressions. The default (daemonset) mode is
unaffected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Phase 2 of the DaemonSet -> staged-Job migration: add an opt-in
`deploymentMode: job` that installs Kata via short-lived, per-node
install Jobs instead of the long-running DaemonSet. The DaemonSet remains
the default and is now gated behind `deploymentMode == daemonset`.
Rather than render one Job per node into the Helm release (which grows
the release secret O(nodes) and offers no rollout pacing), job mode ships
a single tiny post-install/post-upgrade hook Job that runs the
kata-deploy-job-dispatcher. The dispatcher enumerates the selected nodes
LIVE from the API server and stamps out one node-pinned install Job per
node from a constant-size ConfigMap of Job templates, keeping at most
`job.parallelism` in flight and refilling as they finish. This guarantees
per-node coverage with a paced rollout while the Helm release stays O(1)
regardless of fleet size. New nodes are picked up by re-running
`helm upgrade`; there is no always-on component.
Each per-node Job runs the staged install pipeline as ordered
initContainers and exits:
host-check -> artifacts -> cri (initContainers, run sequentially)
label (main container)
The privilege split is explicit: the dispatcher pod is a pure
control-plane client (lists nodes, manages Jobs in its own namespace) and
runs fully unprivileged under a dedicated, least-privilege ServiceAccount
(kata-rbac.yaml); only the per-node Jobs it creates carry the privileged
kata-deploy host-mutation rights.
Node selection (templates/_helpers.tpl: nodeLabelSelector / perNodeJob):
- job.nodes: explicit node-name list passed to the dispatcher, and
- job.nodeSelector (equality map) ANDed with
- job.nodeSelectorExpressions (k8s label-selector requirements:
In / NotIn / Exists / DoesNotExist),
compiled into a single label-selector string the dispatcher resolves
live. The default expressions target worker (non-control-plane) nodes, so
no custom node labeling is required; set the expressions to [] to target
all discovered nodes.
Reuses the commonEnv/commonVolume* helpers and adds the stageContainer,
serviceAccountName, dispatcherServiceAccountName, dispatcherImage and
perNodeJob helpers shared by the dispatcher and the staged Jobs. The
default (daemonset) render is unchanged.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Pull the kata-deploy container's environment block and host
volume/volumeMount definitions out of the DaemonSet template into
reusable named templates in _helpers.tpl:
- kata-deploy.commonEnv
- kata-deploy.commonVolumeMounts
- kata-deploy.commonVolumes
These are derived purely from chart values and are independent of the
deployment model, so they can be shared verbatim by upcoming per-node
install/cleanup Jobs without duplicating the (large) env wiring.
Pure refactor: the rendered DaemonSet is byte-for-byte identical to
before (verified via normalized `helm template` diff across default and
multiInstallSuffix/userDropIn/customRuntimes permutations).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Phase 1 of migrating kata-deploy from a DaemonSet to a staged JobSet
workflow: refactor the binary's install/cleanup flows into discrete,
independently invocable stages while keeping the existing DaemonSet
path fully working.
Add new staged subcommands that each run one step and exit, so a JobSet
can drive them as ordered initContainers/Jobs per node:
install: host-check -> artifacts -> cri -> label
cleanup (reverse): unlabel -> revert-cri -> remove-artifacts
`install` becomes a compatibility wrapper composing the install stages
in the canonical order, so the DaemonSet deployment model is unchanged.
The DaemonSet `cleanup` (with its DaemonSet-presence gating) is left
intact; the staged cleanup actions are added alongside it and skip that
gating since the JobSet workflow only schedules them on a real uninstall.
Each stage has an idempotent skip check so reruns are safe:
- install label / cleanup unlabel: short-circuit via the node label
- cleanup remove-artifacts: skip when the install dir is already gone
- cleanup revert-cri: skip the disruptive runtime restart when the CRI
drop-ins are already absent (new cri_drop_in_present helper)
Introduce a shared KATA_RUNTIME_LABEL constant and add rstest-based
tests covering the subcommand-name -> Action mapping, rejection of
unknown actions, and the visible/hidden help semantics.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Package and ship the dispatcher built in the previous commit so the
job-mode Helm chart has an image to run.
- Dockerfile.components: build kata-deploy and kata-deploy-job-dispatcher
from the same rust-builder stage (one compile), and run fmt/clippy/
test for both crates.
- job-dispatcher/Dockerfile: a minimal distroless/static image containing
only the dispatcher binary and CA certs - it is an API client, so it
needs nothing from the host.
- local-build: kata-deploy-job-dispatcher becomes its own build component
with its own static tarball
(kata-deploy-static-kata-deploy-job-dispatcher.tar.zst); the shared
rust-builder output is reused so the two components do not recompile
the workspace locally. The payload script builds and pushes a separate
"<kata-deploy registry>-job-dispatcher" image with the same tag scheme,
and release.sh publishes its multi-arch manifest symmetrically.
- CI: add kata-deploy-job-dispatcher to the build-kata-deploy-components
matrices (its tarball is picked up by the existing kata-artifacts-*
glob), and gate it in the kata-deploy rust static checks.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add a small, deployment-agnostic dispatcher binary that runs exactly one
Kubernetes Job per selected node and paces the rollout, so callers get
guaranteed per-node coverage without encoding the fan-out in Helm.
Motivation: templating one Job per node into a Helm release does not
scale (the release Secret hits etcd's 1 MiB limit and hooks run
sequentially), and a single Indexed Job cannot guarantee per-node
coverage when paced - the scheduler ignores completed pods when
evaluating topology spread, so nodes get uneven numbers of pods. A tiny
dispatcher that enumerates nodes live and creates node-pinned Jobs itself
sidesteps both problems and keeps the Helm release O(1) in fleet size.
The dispatcher:
- enumerates target nodes live (explicit --nodes list or
--node-selector label selector), paginating the API;
- stamps out one Job per node from a YAML template, pinning it with
nodeName and an owner label for server-side filtering;
- keeps at most --parallelism Jobs in flight, refilling as they finish,
and sets an OwnerReference to the owner Job so the per-node Jobs are
garbage-collected with it;
- is a plain API client (kube): it never touches the host, so it can
run fully unprivileged.
Node membership is resolved live on each run, not frozen at Helm
template-render time: re-running the dispatcher (e.g. via `helm upgrade`)
picks up nodes added since the last run and skips ones already done, as
the per-node stages are idempotent. The dispatcher is one-shot, however
- it does not watch the API, so nodes added while it is not running are
only covered by the next run.
job.rs holds the pure helpers (node-name sanitization, deterministic Job
naming, template instantiation, status interpretation) with rstest unit
tests; main.rs wires up the CLI and the fan-out loop.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
While checking the content of the vendor tarball artifact in the 3.31.0
release page, I realized that it is lacking most of the rust code and
all the go code. It turns out that the script is badly broken in many
ways :
1. Cargo workspace conflicts: Vendored dependencies were treated as
workspace members, causing "current package believes it's in a
workspace when it's not" errors. Fixed by adding vendor directory
exclusions to root Cargo.toml.
2. Missing Go vendoring: Script only searched for Cargo.lock files,
never processing go.mod files despite having a case statement for
them. Fixed by adding go.mod to the find command with '-o -name go.mod'.
3. Wrong tar execution directory: Script ran tar from release/ directory
but vendor_dir_list contained paths relative to repo root (./vendor,
./src/agent/vendor, etc.), causing "Cannot stat" errors. Fixed by
moving tar command before final popd.
4. Relative tarball path: Since tar now runs from repo root, converted
tarball path to absolute to ensure it's created in the release
directory.
5. Vendored go.mod pollution: Added '-path ./vendor -prune' to find
command to exclude vendor directory, preventing the script from
finding go.mod files inside vendored Rust dependencies.
The fixes are simple enough they can be squashed into a single
commit.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.
The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
publish-kata-deploy-payload got renamed in #13107, which broke the CI.
Now, instead of tracking all those intermediate steps, let's make sure
we only track the tests themselves.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a disabled-by-default kata-monitor DaemonSet to the kata-deploy Helm chart,
including image/configuration values so operators can enable monitor shipping as
part of the same deployment workflow when needed.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Build kata-monitor images by extracting the binary from the
shim-v2-go tarball and shipping it on top of
gcr.io/distroless/static-debian13.
Because the binary is built inside an Ubuntu (glibc) toolchain it
cannot run on a pure musl/alpine base — users hit __fprintf_chk /
__vfprintf_chk relocation errors. To get a small, distroless
runtime image we use the same pattern as
tools/packaging/kata-deploy/Dockerfile: copy the glibc libraries
the binary needs (plus the dynamic linker) via ldd from a glibc
base image.
In order to do so, we also added a helper script to build and
publish architecture-specific monitor images from tarball
artifacts.
Reported-by: Steve Linde <stevenlinde@google.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Make the chart pass --log-level debug automatically when debug=true so
CI and troubleshooting runs emit full rendered config dumps without
requiring a separate log-level override.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
Allow operators to force kata-deploy log verbosity and emit the fully
rendered containerd/CRI-O config and drop-in files in debug mode so
install troubleshooting can rely on exact effective configuration.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
The containerd_version matrix values were renamed from lts/active to
minimum/latest, which changes the generated CI job names. Update the
required-tests list so the gatekeeper waits on the checks that are
actually produced.
The amd64 run-containerd-stability, run-nydus, run-cri-containerd and
free-runner run-k8s-tests jobs map lts -> minimum and active -> latest.
The s390x cri-containerd job maps active -> latest, matching its
updated matrix.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
Allow operators to provide per-shim drop-in TOML for built-in runtimes
and reconcile stale override files so upgrades and migrations remain
safe when drop-ins are added or removed.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex
Default custom runtime RuntimeClass overhead.podFixed to the selected
baseConfig values, so equivalent runtimes behave consistently without
repeating boilerplate.
In case the user wants to enforce that no overhead is set on the custom
RuntimeClass, disable inheritance with inheritBaseOverhead=false.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
For containerd v2.2+, the flow assumes that the imports directive would be present.
It is better to check it and add if it doesn't exist.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
The kata-monitor test is currently failing and is running a very EoL
version of cri-o. This area is being actively reworked in #13107,
so remove this and then once kata-monitor tests are stable we
can re-add the new versions
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Bump the go version to resolve CVEs:
- GO-2026-5037
- GO-2026-5038
- GO-2026-5039
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Since the conf.d migration (containerd >= 2.2.0), kata-deploy writes its
drop-in to the auto-imported /etc/containerd/conf.d/ and no longer manages
the main config's `imports` array. A node upgraded from a pre-conf.d
kata-deploy keeps the legacy `{dest_dir}/containerd/config.d/kata-deploy.toml`
entry in `imports`, since the new code neither adds nor removes it.
On uninstall, remove_artifacts() deletes the artifacts dir (including the
file that import still points at) and then restarts containerd, which fails
to load the now-dangling import and wedges the node: pods get stuck
Terminating and new pods cannot start. This broke the lifecycle-manager E2E
tests (TC-02..TC-07) which repeatedly upgrade then reinstall across the
3.30.0 -> latest version boundary.
Defensively scrub the legacy import from the main containerd config in both
configure_containerd (at conf.d migration time) and cleanup_containerd
(before artifacts are removed and containerd is restarted). The helper is a
no-op when the config is absent, has no `imports` array, or does not contain
the legacy entry.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enables automations to determine version with a simple read RBAC
on the runtime class. Helpful when versions need to match with other
tools (e.g. genpolicy) or when simple version determination is needed
for other reasons.
Fixes#13123
Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
Mirror the CI payload publish flow in local builds, including image and
helm chart publishing, while reusing the same chart upload helper in
payload-after-push to avoid duplicated chart packaging logic.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Build and package kata-deploy binary and nydus snapshotter component
tarballs as part of nvgpu-tarball so local publish can consume a single
kata-static.tar.zst without rebuilding extra artifacts.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
- Add FAKE_SE_IMAGE mode support in SE image build scripts for CI without real SE setup
- Simplify workflow by removing build-asset-boot-image-se job
- Integrate fake-boot-image-se into build matrix instead of separate job
- Skip attestation for fake-boot-image-se builds
- Update qemu-se and qemu-se-runtime-rs shim components to use:
- rootfs-initrd-confidential instead of rootfs-image-confidential
- boot-image-se component
This change streamlines the s390x SE build process and makes it easier
to test without requiring actual Secure Execution infrastructure.
This fixes deployment issues on non-TEE systems where TEE-specific artifacts
(like boot-image-se for IBM SEL) are not included in the kata-deploy image,
while ensuring TEE systems still get all required components.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Drop cloud-hypervisor-glibc from local and CI kata-deploy build targets
now that Azure CLH uses the standard cloud-hypervisor artifact set.
This removes obsolete build matrix entries and installer target
handling.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Switch AKS Mariner matrix entries to clh-azure handlers and remove the
temporary host-OS based helm value overrides.
Update integration test wiring and required test labels so CI tracks the
new runtime names.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add clh-azure and clh-azure-runtime-rs as first-class shims across
installer logic, helm defaults, runtimeclass overhead mapping, and shim
component catalogs.
This aligns deploy payload selection with the new native Azure-specific
CLH configs.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>