Update the two affected entries in required-tests.yaml accordingly
so the gatekeeper keeps matching them instead of blocking subsequent
PRs after this one merges.
Co-authored-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The deploy will read EROFS_SNAPSHOTTER_MODE and EROFS_DMVERITY from
the environment to enable dmverity_mode and enable_dmverity in the
containerd erofs snapshotter/differ config.
Add validation for the mode value and use an explicit 300s timeout
for node-readiness checks during kata-deply in github CI.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Install libdevmapper-dev and pkg-config in the agent build container
so devicemapper-sys can link against libdevmapper. Add the GNU libc
rustup target alongside musl since USE_DEVMAPPER forces LIBC=gnu.
Forward USE_DEVMAPPER through build.sh and build-static-agent.sh.
And you can compile the device mapper in kata-agent as below:
```
$ make LIBC=gnu USE_DEVMAPPER=yes
```
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Pass USE_DEVMAPPER through the Docker environment in local build
scripts. Extract the OCI tag sanitization logic into a public helper
of sanitize_tag_component to keep push and pull paths consistent.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Expose erofsSnapshotterMode in the helm chart values and render it as
the EROFS_SNAPSHOTTER_MODE environment variable in the kata-deploy pod.
Update gha-run-k8s-common.sh to load dm-mod/dm-verity kernel modules
and configure the erofs default size when the mode is "integrity".
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
With sandbox_cgroup_only the shim, QEMU and virtiofsd run inside the
pod's memory cgroup, whose limit is the workload limit plus the
RuntimeClass pod overhead. On aarch64 the VMM host footprint is much
larger than on x86 (QEMU's own anon RSS is ~160Mi+ before any guest
RAM, on top of the shmem-backed guest memory), so the 160Mi overhead
is too small: small-memory-limit pods get their qemu-system process
OOM-killed by the pod cgroup (CONSTRAINT_MEMCG), and the agent vsock
never comes up (ENODEV), so the sandbox fails to start.
Raise the pod overhead to 320Mi for the qemu shims that run on
aarch64 (qemu, qemu-runtime-rs, qemu-coco-dev-runtime-rs). The value
is applied on all architectures for simplicity; x86 is over-provisioned
by ~160Mi, which is acceptable. The TEE/GPU shims already carry far
larger overhead and amd64-only shims (clh*, dragonball, fc) are
unaffected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
This skill will inform AI agents how to properly write and format
docs in the new docs system. There is nothing too fancy, just reminding
agents to use mkdocs-materialx features instead of treating the
markdown like the legacy Github-based format.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>
Add k8s-vm-templating-test.bats which exercises pod create
with the factory initialized on the target node.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Kata sizes VM CPU and memory from OCI limits, not Kubernetes resource
requests. Requests are consumed by the Kubernetes control plane, but
they do not drive Kata VM or sandbox sizing today.
Convert the straightforward Kata workload manifests and kata-deploy
examples from resource requests to limits so the declared resources
match the values Kata uses for VM provisioning. Keep requests where the
fixture intentionally validates Kubernetes request/limit behavior.
Update fixture expectations affected by the conversion. The LimitRange
fixture is limit-only at 500m.
Raise the policy deployment limits to 500m and 800Mi. These tests boot
CoCo/runtime-rs sandboxes with policy/initdata, and the former
100m/100Mi values became real runtime limits after the conversion,
which is too constrained for the CI environments.
Leave PVC storage requests, explicit request/limit validation fixtures,
the env resourceFieldRef request, and non-Kata workload examples
unchanged where requests are handled outside the Kata shim resource
sizing path.
If Kata later grows request-aware sandbox sizing, for example through
Sandbox API based resource plumbing, these requests can be reintroduced
where they carry the intended semantics.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
The shared _publish_multiarch_manifest() helper always derived a
"-job-dispatcher" registry from the registries it was given. However, the
dispatcher is a kata-deploy-specific sidecar image, so when the helper
was reused to publish the kata-monitor multi-arch manifest it wrongly
tried to push a non-existent kata-monitor-job-dispatcher image.
Let's gate the dispatcher derivation behind
KATA_DEPLOY_PUBLISH_JOB_DISPATCHER (defaulting to true so the
kata-deploy path is unchanged) and opt out of it when publishing the
kata-monitor manifest.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Replace `type=local` with `type=tar` in kata-deploy build to reduce
export time and avoid build hangs during the export-to-client-directory
phase.
Update callers to extract binaries directly from the tar archive instead
of copying from an intermediate directory.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Switch the NVIDIA GPU example values file to install Kata via the
Job-based deployment mode (deploymentMode: job) instead of the
always-on, privileged DaemonSet, so that nothing keeps running on the
node once the install completes.
To exercise this in our CI, make the helm_helper aware of the deployment
mode coming from the (base) values file:
- In "job" mode, clear job.nodeSelectorExpressions so the dispatcher
targets every discovered node. Our CI clusters are typically
single-node, where the only node carries the control-plane label,
and the default selector excludes control-plane/master nodes.
- There is no always-on DaemonSet to wait on in "job" mode. The
dispatcher runs as a blocking post-install hook and the final
per-node stage labels the node, so wait until at least one node
carries the katacontainers.io/kata-runtime label as the
"install complete" signal (dumping Job/pod logs on timeout).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
containerd uses the proxy plugin root export when reporting CRI image
filesystem paths. Without this export, the CRI plugin falls back to
/var/lib/containerd/io.containerd.snapshotter.v1.<snapshotter>.
For nydus-for-kata-tee this fallback does not match the actual
snapshotter root under /var/lib/nydus-for-kata-tee.
Kubelet/cAdvisor then fails stats collection when it tries to inspect
the nonexistent fallback path.
Export the nydus proxy snapshotter root so containerd reports the real
filesystem path for resource accounting.
When using trusted ephemeral storage or a new ephemeral storage wip
feature for providing plain disks, resource accounting would not kick
in and pods which exhausted their emptyDir sizeLimits would not get
evicted.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The kata-deploy main image pinned its gcr.io/distroless/static-debian13
base by sha256 digest. distroless does not publish versioned tags, so a
pinned digest just goes stale with no clear upgrade path. Track the
rolling tag instead (guarded with a hadolint DL3007 ignore plus a comment
explaining why), matching the kata-deploy-job-dispatcher image base.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
The verification Job assumed the DaemonSet model: it waited for the
DaemonSet to exist, for its pods, and for `rollout status daemonset/...`,
then required every node in the cluster to be labeled. None of that holds
for deploymentMode: job, where install happens via the dispatcher and the
per-node Jobs it fans out, and only the targeted (worker) nodes get
labeled.
Make the hook mode-aware:
- Hook weight: in job mode the install dispatcher runs as a
post-install hook at weight 5, so verification now runs at weight 10
(after it); daemonset mode keeps weight 0 (the DaemonSet is a normal
resource).
- Readiness wait: in job mode, wait for the install dispatcher Job to
complete and then for the per-node install Jobs
(kata-deploy/stage=install) to finish (with the same CRI-restart
retry logic) instead of a DaemonSet rollout.
- Label check: in job mode, verify exactly the nodes the dispatcher
targeted are labeled, rather than comparing the labeled count against
all nodes in the cluster.
- Grant the verification ClusterRole read access to batch/jobs (used by
the job-mode waits; harmless in daemonset mode).
The daemonset code path is unchanged and the default render (no
verification.pod) is byte-for-byte identical.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add the uninstall counterpart to the install dispatcher for
deploymentMode: job. On `helm uninstall`, a single pre-delete hook Job
runs the kata-deploy-job-dispatcher, which enumerates the targeted nodes
live and fans out one node-pinned cleanup Job per node that runs the
install pipeline in reverse and exits:
unlabel -> revert-cri (initContainers, run sequentially)
remove-artifacts (main container)
Running as a pre-delete hook means the dispatcher ServiceAccount/RBAC and
the kata-deploy host-mutation RBAC still exist while the Jobs run, so the
unlabel stage retains node get/patch access. revert-cri and
remove-artifacts are host-only operations (privileged nsenter / host
mount) and need no extra cluster RBAC.
Ordering mirrors install in reverse: unlabel first so the scheduler stops
placing kata workloads here, then revert the CRI config + restart the
runtime, then remove the on-host artifacts. Each stage is idempotent and
skips when already undone, so partially-installed nodes and re-runs are
safe.
Uninstall node selection is deliberately SEPARATE from install (a
dedicated job.cleanup.* block) and defaults to every node carrying the
katacontainers.io/kata-runtime label (set by the install label stage)
rather than re-evaluating the install selector. Because the cleanup
dispatcher resolves nodes live when it runs, this stays robust to
install-time selector drift (relabeled nodes, etc.) while remaining fully
overridable via job.cleanup.nodes / job.cleanup.nodeSelector /
job.cleanup.nodeSelectorExpressions. The default (daemonset) mode is
unaffected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Phase 2 of the DaemonSet -> staged-Job migration: add an opt-in
`deploymentMode: job` that installs Kata via short-lived, per-node
install Jobs instead of the long-running DaemonSet. The DaemonSet remains
the default and is now gated behind `deploymentMode == daemonset`.
Rather than render one Job per node into the Helm release (which grows
the release secret O(nodes) and offers no rollout pacing), job mode ships
a single tiny post-install/post-upgrade hook Job that runs the
kata-deploy-job-dispatcher. The dispatcher enumerates the selected nodes
LIVE from the API server and stamps out one node-pinned install Job per
node from a constant-size ConfigMap of Job templates, keeping at most
`job.parallelism` in flight and refilling as they finish. This guarantees
per-node coverage with a paced rollout while the Helm release stays O(1)
regardless of fleet size. New nodes are picked up by re-running
`helm upgrade`; there is no always-on component.
Each per-node Job runs the staged install pipeline as ordered
initContainers and exits:
host-check -> artifacts -> cri (initContainers, run sequentially)
label (main container)
The privilege split is explicit: the dispatcher pod is a pure
control-plane client (lists nodes, manages Jobs in its own namespace) and
runs fully unprivileged under a dedicated, least-privilege ServiceAccount
(kata-rbac.yaml); only the per-node Jobs it creates carry the privileged
kata-deploy host-mutation rights.
Node selection (templates/_helpers.tpl: nodeLabelSelector / perNodeJob):
- job.nodes: explicit node-name list passed to the dispatcher, and
- job.nodeSelector (equality map) ANDed with
- job.nodeSelectorExpressions (k8s label-selector requirements:
In / NotIn / Exists / DoesNotExist),
compiled into a single label-selector string the dispatcher resolves
live. The default expressions target worker (non-control-plane) nodes, so
no custom node labeling is required; set the expressions to [] to target
all discovered nodes.
Reuses the commonEnv/commonVolume* helpers and adds the stageContainer,
serviceAccountName, dispatcherServiceAccountName, dispatcherImage and
perNodeJob helpers shared by the dispatcher and the staged Jobs. The
default (daemonset) render is unchanged.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Pull the kata-deploy container's environment block and host
volume/volumeMount definitions out of the DaemonSet template into
reusable named templates in _helpers.tpl:
- kata-deploy.commonEnv
- kata-deploy.commonVolumeMounts
- kata-deploy.commonVolumes
These are derived purely from chart values and are independent of the
deployment model, so they can be shared verbatim by upcoming per-node
install/cleanup Jobs without duplicating the (large) env wiring.
Pure refactor: the rendered DaemonSet is byte-for-byte identical to
before (verified via normalized `helm template` diff across default and
multiInstallSuffix/userDropIn/customRuntimes permutations).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Phase 1 of migrating kata-deploy from a DaemonSet to a staged JobSet
workflow: refactor the binary's install/cleanup flows into discrete,
independently invocable stages while keeping the existing DaemonSet
path fully working.
Add new staged subcommands that each run one step and exit, so a JobSet
can drive them as ordered initContainers/Jobs per node:
install: host-check -> artifacts -> cri -> label
cleanup (reverse): unlabel -> revert-cri -> remove-artifacts
`install` becomes a compatibility wrapper composing the install stages
in the canonical order, so the DaemonSet deployment model is unchanged.
The DaemonSet `cleanup` (with its DaemonSet-presence gating) is left
intact; the staged cleanup actions are added alongside it and skip that
gating since the JobSet workflow only schedules them on a real uninstall.
Each stage has an idempotent skip check so reruns are safe:
- install label / cleanup unlabel: short-circuit via the node label
- cleanup remove-artifacts: skip when the install dir is already gone
- cleanup revert-cri: skip the disruptive runtime restart when the CRI
drop-ins are already absent (new cri_drop_in_present helper)
Introduce a shared KATA_RUNTIME_LABEL constant and add rstest-based
tests covering the subcommand-name -> Action mapping, rejection of
unknown actions, and the visible/hidden help semantics.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Package and ship the dispatcher built in the previous commit so the
job-mode Helm chart has an image to run.
- Dockerfile.components: build kata-deploy and kata-deploy-job-dispatcher
from the same rust-builder stage (one compile), and run fmt/clippy/
test for both crates.
- job-dispatcher/Dockerfile: a minimal distroless/static image containing
only the dispatcher binary and CA certs - it is an API client, so it
needs nothing from the host.
- local-build: kata-deploy-job-dispatcher becomes its own build component
with its own static tarball
(kata-deploy-static-kata-deploy-job-dispatcher.tar.zst); the shared
rust-builder output is reused so the two components do not recompile
the workspace locally. The payload script builds and pushes a separate
"<kata-deploy registry>-job-dispatcher" image with the same tag scheme,
and release.sh publishes its multi-arch manifest symmetrically.
- CI: add kata-deploy-job-dispatcher to the build-kata-deploy-components
matrices (its tarball is picked up by the existing kata-artifacts-*
glob), and gate it in the kata-deploy rust static checks.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add a small, deployment-agnostic dispatcher binary that runs exactly one
Kubernetes Job per selected node and paces the rollout, so callers get
guaranteed per-node coverage without encoding the fan-out in Helm.
Motivation: templating one Job per node into a Helm release does not
scale (the release Secret hits etcd's 1 MiB limit and hooks run
sequentially), and a single Indexed Job cannot guarantee per-node
coverage when paced - the scheduler ignores completed pods when
evaluating topology spread, so nodes get uneven numbers of pods. A tiny
dispatcher that enumerates nodes live and creates node-pinned Jobs itself
sidesteps both problems and keeps the Helm release O(1) in fleet size.
The dispatcher:
- enumerates target nodes live (explicit --nodes list or
--node-selector label selector), paginating the API;
- stamps out one Job per node from a YAML template, pinning it with
nodeName and an owner label for server-side filtering;
- keeps at most --parallelism Jobs in flight, refilling as they finish,
and sets an OwnerReference to the owner Job so the per-node Jobs are
garbage-collected with it;
- is a plain API client (kube): it never touches the host, so it can
run fully unprivileged.
Node membership is resolved live on each run, not frozen at Helm
template-render time: re-running the dispatcher (e.g. via `helm upgrade`)
picks up nodes added since the last run and skips ones already done, as
the per-node stages are idempotent. The dispatcher is one-shot, however
- it does not watch the API, so nodes added while it is not running are
only covered by the next run.
job.rs holds the pure helpers (node-name sanitization, deterministic Job
naming, template instantiation, status interpretation) with rstest unit
tests; main.rs wires up the CLI and the fan-out loop.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
While checking the content of the vendor tarball artifact in the 3.31.0
release page, I realized that it is lacking most of the rust code and
all the go code. It turns out that the script is badly broken in many
ways :
1. Cargo workspace conflicts: Vendored dependencies were treated as
workspace members, causing "current package believes it's in a
workspace when it's not" errors. Fixed by adding vendor directory
exclusions to root Cargo.toml.
2. Missing Go vendoring: Script only searched for Cargo.lock files,
never processing go.mod files despite having a case statement for
them. Fixed by adding go.mod to the find command with '-o -name go.mod'.
3. Wrong tar execution directory: Script ran tar from release/ directory
but vendor_dir_list contained paths relative to repo root (./vendor,
./src/agent/vendor, etc.), causing "Cannot stat" errors. Fixed by
moving tar command before final popd.
4. Relative tarball path: Since tar now runs from repo root, converted
tarball path to absolute to ensure it's created in the release
directory.
5. Vendored go.mod pollution: Added '-path ./vendor -prune' to find
command to exclude vendor directory, preventing the script from
finding go.mod files inside vendored Rust dependencies.
The fixes are simple enough they can be squashed into a single
commit.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.
The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
publish-kata-deploy-payload got renamed in #13107, which broke the CI.
Now, instead of tracking all those intermediate steps, let's make sure
we only track the tests themselves.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a disabled-by-default kata-monitor DaemonSet to the kata-deploy Helm chart,
including image/configuration values so operators can enable monitor shipping as
part of the same deployment workflow when needed.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Build kata-monitor images by extracting the binary from the
shim-v2-go tarball and shipping it on top of
gcr.io/distroless/static-debian13.
Because the binary is built inside an Ubuntu (glibc) toolchain it
cannot run on a pure musl/alpine base — users hit __fprintf_chk /
__vfprintf_chk relocation errors. To get a small, distroless
runtime image we use the same pattern as
tools/packaging/kata-deploy/Dockerfile: copy the glibc libraries
the binary needs (plus the dynamic linker) via ldd from a glibc
base image.
In order to do so, we also added a helper script to build and
publish architecture-specific monitor images from tarball
artifacts.
Reported-by: Steve Linde <stevenlinde@google.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Make the chart pass --log-level debug automatically when debug=true so
CI and troubleshooting runs emit full rendered config dumps without
requiring a separate log-level override.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
Allow operators to force kata-deploy log verbosity and emit the fully
rendered containerd/CRI-O config and drop-in files in debug mode so
install troubleshooting can rely on exact effective configuration.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
The containerd_version matrix values were renamed from lts/active to
minimum/latest, which changes the generated CI job names. Update the
required-tests list so the gatekeeper waits on the checks that are
actually produced.
The amd64 run-containerd-stability, run-nydus, run-cri-containerd and
free-runner run-k8s-tests jobs map lts -> minimum and active -> latest.
The s390x cri-containerd job maps active -> latest, matching its
updated matrix.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
Allow operators to provide per-shim drop-in TOML for built-in runtimes
and reconcile stale override files so upgrades and migrations remain
safe when drop-ins are added or removed.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex
Default custom runtime RuntimeClass overhead.podFixed to the selected
baseConfig values, so equivalent runtimes behave consistently without
repeating boilerplate.
In case the user wants to enforce that no overhead is set on the custom
RuntimeClass, disable inheritance with inheritBaseOverhead=false.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
For containerd v2.2+, the flow assumes that the imports directive would be present.
It is better to check it and add if it doesn't exist.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>