In cleanup_kata_deploy, bail out early when no kata-deploy Helm release
exists so baremetal-* pre-deploy cleanup on fresh clusters does not
block on helm uninstall --wait (up to 10m).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Plumb a resources block into the kata-deploy DaemonSet container in
the Helm chart so the cluster can size its memory footprint
predictably.
Defaults are sized from real /proc/<pid>/status numbers on an
unpatched 3.30.0 build running on a ~220-vCPU GPU node:
VmRSS: 9944 kB (~9.7 MiB) <- actual physical memory
RssAnon: 2628 kB (~2.6 MiB) <- heap + dirty stack pages
VmData: 464668 kB (~454 MiB) <- tokio multi-thread workers'
reserved-but-untouched stacks
Threads: 225 <- num_cpus()-driven worker pool
That VmData number is the source of the original "kata-deploy is
using 400 MB" reports: any monitoring layer that surfaces virtual
data size, committed memory, or memory.usage_in_bytes on a kernel
that includes mapped-but-untouched memory will happily reproduce
~400 MB even though only ~10 MiB is ever made resident. The earlier
commits in this series (current_thread tokio, mimalloc, shared kube
client, JSONPath removal, post-install re-exec) collapse VmData into
the tens of MiB and drop the post-install resident set further.
The defaults below are picked accordingly:
requests:
cpu: 25m # install is mostly I/O wait; the post-install
# waiter is genuinely idle
memory: 16Mi # ~2x headroom over the unpatched VmRSS we
# measured, far more over the patched waiter
Operators who hit OOMKilled on unusually large or churny clusters can
override `resources` directly in their Helm values (or set it to {}
to remove all requests and inherit cluster defaults).
Fixes: https://github.com/kata-containers/kata-containers/discussions/12976
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Register the new qemu-nvidia-gpu-tdx-runtime-rs shim across the kata-deploy
stack so it is built, installed, and exposed as a RuntimeClass.
This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the
runtime-rs binary), SHIMS list, the qemu-tdx-experimental share name
mapping, and the x86_64 default shim set. The Helm chart gets the new
shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the
RuntimeClass overhead definition in runtimeclasses.yaml.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Register the new qemu-nvidia-gpu-snp-runtime-rs shim across the kata-deploy
stack so it is built, installed, and exposed as a RuntimeClass.
This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the
runtime-rs binary), SHIMS list, the qemu-snp-experimental share name
mapping, and the x86_64 default shim set. The Helm chart gets the new
shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the
RuntimeClass overhead definition in runtimeclasses.yaml.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Register the Rust NVIDIA GPU runtime as a kata-deploy shim so it gets
installed and configured alongside the existing Go-based
qemu-nvidia-gpu shim.
Add qemu-nvidia-gpu-runtime-rs to the RUST_SHIMS list and the default
enabled shims, create its RuntimeClass entry in the Helm chart, and
include it in the try-kata-nvidia-gpu values overlay. The kata-deploy
installer will now copy the runtime-rs configuration and create the
containerd runtime entry for it.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The kata-deploy DaemonSet pod had no Kubernetes health probes, so the
kubelet could not distinguish between "still installing" and "crashed",
and rolling updates would proceed to the next node before install
actually finished.
Add a lightweight HTTP health server (built on raw tokio TcpListener,
no new crate dependencies) that starts immediately in the install path:
/healthz — liveness: returns 200 as soon as the server binds
/readyz — readiness: returns 503 while installing, 200 after
install completes (artifacts extracted, CRI restarted,
node labeled)
Wire the Helm chart with startup, liveness, and readiness probes
(all individually toggleable). The startup probe allows up to 10
minutes for install to complete before the liveness probe takes over.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We should not ship configurations that we do not actively test.
This commit drops the following from the kata-deploy helm chart:
values.yaml:
- arm64 from supportedArches for the clh shim
- arm64 from supportedArches for the cloud-hypervisor shim
- arm64 from supportedArches for the dragonball shim
- arm64 from supportedArches for the fc shim
- arm64 from supportedArches for the qemu-nvidia-gpu shim
- the entire qemu-cca shim definition
try-kata-tee.values.yaml:
- CCA from the file description comment
- qemu-cca from the TEE shims list comment
- the entire qemu-cca shim definition
- arm64: qemu-cca from the defaultShim mapping, replaced with
arm64: qemu-coco-dev-runtime-rs (which is tested)
try-kata-nvidia-gpu.values.yaml:
- arm64 from supportedArches for the qemu-nvidia-gpu shim
- arm64: qemu-nvidia-gpu from the defaultShim mapping
Once arm64 and qemu-cca support are properly tested, they can be
re-added.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add aarch64/arm64 to the list of supported architectures for
qemu-coco-dev and qemu-coco-dev-runtime-rs shims across kata-deploy
configuration, Helm chart values, and test helper scripts.
Note that guest-components and the related build dependencies are not
yet wired for arm64 in these configurations; those will be addressed
separately.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add two new Helm values under `containerd`:
- `configDir`: overrides the host directory where the containerd
config lives, taking precedence over the k8sDistribution-based
auto-detection.
- `configFileName`: overrides the containerd config file name,
propagated to the kata-deploy binary via the new
CONTAINERD_CONFIG_FILE_NAME environment variable.
These are useful for non-standard containerd setups that don't match
any of the built-in k8sDistribution presets (k8s, k3s, rke2, k0s,
microk8s).
The config file name override only affects the default runtime branch
in get_containerd_paths(). The k0s/microk8s/k3s/rke2 branches are
left untouched since those runtimes have mandatory file naming
conventions.
Also fixes a spurious leading space in the k3s containerdConfPath
branch.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add a global and per-shim configurable switch to enable/disable
the overhead section in generated RuntimeClasses. This allows users
to omit overhead when it's not needed or managed externally.
Priority: per-shim > global > default(true).
Signed-off-by: LizZhang315 <123134987@qq.com>
Joji's added the labels for the default values.yaml, but we missed
adding those to the nvidia specific values.yaml file.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When a kata-deploy DaemonSet pod is restarted (e.g. due to a label
change or rolling update), the SIGTERM handler runs cleanup which
unconditionally removes kata artifacts and restarts containerd. This
causes containerd to lose the kata shim binary, crashing all running
kata pods on the node.
Fix this by implementing a three-stage cleanup decision:
1. If this pod's owning DaemonSet still exists (exact name match via
DAEMONSET_NAME env var), this is a pod restart — skip all cleanup.
The replacement pod will re-run install, which is idempotent.
2. If this DaemonSet is gone but other kata-deploy DaemonSets still
exist (multi-install scenario), perform instance-specific cleanup
only (snapshotters, CRI config, artifacts) but skip shared
resources (node label removal, CRI restart) to avoid disrupting
the other instances.
3. If no kata-deploy DaemonSets remain, perform full cleanup including
node label removal and CRI restart.
The Helm chart injects a DAEMONSET_NAME environment variable with the
exact DaemonSet name (including any multi-install suffix), ensuring
instance-aware lookup rather than broadly matching any DaemonSet
containing "kata-deploy".
Fixes: #12761
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We've been using `experimental_force_guest_pull`, but now that we have a
containerd release that should work more reliably with the multi
snapshotter setup, we want to give it a try.
Note: We need containerd 2.2.2+.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This supersedes https://github.com/kata-containers/kata-containers/pull/12622.
I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material
created after mkdocs-material was put into maintenance mode. We'll use this
platform until Zensical is more feature complete.
Added a few of the existing docs into the site to make a more user-friendly flow.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Allow users to override the default RuntimeClass pod overhead for
any shim via shims.<name>.runtimeClass.overhead.{memory,cpu}.
When the field is absent the existing hardcoded defaults from the
dict are used, so this is fully
backward compatible.
Signed-off-by: Zachary Spar <zspar@coreweave.com>
kata-deploy restarts the CRI runtime (k3s/containerd) during install,
which can kill the verification job pod or cause transient API server
errors. Bump backoffLimit from 0 to 3 so the job can retry after being
killed, and add a retry loop around kubectl rollout status to handle
transient connection failures.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Move the cleanup logic from a preStop lifecycle hook (separate exec)
into the main process's SIGTERM handler. This simplifies the
architecture: the install process now handles its own teardown when
the pod is terminated.
The SIGTERM handler is registered before install begins, and
tokio::select! races install against SIGTERM so cleanup always runs
even if SIGTERM arrives mid-install (e.g. helm uninstall while the
container is restarting after a failed install attempt).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When image.reference or kubectlImage.reference already contains a digest
(e.g. quay.io/...@sha256:...), use the reference as-is instead of
appending :tag. This avoids invalid image strings like 'image@sha256🔤'
when tag is empty and allows users to pin by digest.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
getDefaultShimForArch previously returned whatever string was set in
defaultShim.<arch> without any validation. A typo, a non-existent shim,
or a shim that is disabled via disableAll would all silently produce a
bogus DEFAULT_SHIM_* env var, causing kata-deploy to fail at runtime.
Guard the return value by checking whether the configured shim is
present in the list of shims that are both enabled and support the
requested architecture. If not, return empty string so the env var is
simply omitted.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Using `$runtime.containerd.snapshotter` and `$runtime.crio.pullType`
panics with a nil pointer error when the containerd or crio block is
absent from the custom runtime definition.
Let's use the `dig` function which safely traverses nested keys and
returns an empty string as the default when any key in the path is
missing.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When kata-deploy is deployed with cloud-api-adaptor, it
defaults to qemu instead of configuring the remote shim.
Support ppc64le to enable it correctly when shims.remote.enabled=true
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
The default RuntimeClass (e.g. kata) is meant to point at the default shim
handler (e.g. kata-qemu-$tee). We were building it in a separate block and
only sometimes adding the same TEE nodeSelectors as the shim-specific
RuntimeClass, leading to kata ending up without the SE/SNP/TDX
nodeSelector while kata-qemu-$tee had it.
The fix is to stop duplicating the RuntimeClass definition, having a
single template that renders one RuntimeClass (name, handler, overhead,
nodeSelectors).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When NFD is detected (deployed by the chart or existing in the cluster),
apply shim-specific nodeSelectors only for TEE runtime classes (snp,
tdx, and se).
Non-TEE shims keep existing behavior (e.g. runtimeClass.nodeSelector for
nvidia GPU from f3bba0885 is unchanged).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We depend on GPU Operator v26.3 release, which is not out yet.
Although we have been testing with it, it's not yet publicly available,
which would break anyone actually trying to use the GPU runtime classes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The CC runtime classes kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu-tdx
are mutually exclusive with kata-qemu-nvidia-gpu, as dictated by the gpu
cc mode setting. In order to properly support a cluster that has both CC and
non-CC nodes, we use a node selector so the scheduling is consistent with the
GPU mode. The GPU operator sets a label nvidia.com/cc.ready.state=[true, false]
to indicate the gpu mode setting
Fixes#12431
Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
On helm uninstall let's rely on a preStop hook to run kata-deploy
cleanup so each pod cleans its node before exiting.
We **must** keep RBAC (resource-policy: keep) so pods retain API access
during termination, and then can properly delete the NodeFeatureRules
and remove the labels from the nodes.
The post-delete hook Job, which runs on a single node, now is only
responsible for cleaning the kept RBAC (cluster-wide resource) after
uninstall, not leaving any resource or artefact behind.
The changes on this commit lead to a "resouerces were kept" message when
running `helm uninstall`, which document as being normal, as the
post-delete job will remove those.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
- Trim trailing whitespace and ensure final newline in non-vendor files
- Add .editorconfig-checker.json excluding vendor dirs, *.patch, *.img,
*.dtb, *.drawio, *.svg, and pkg/cloud-hypervisor/client so CI only
checks project code
- Leave generated and binary assets unchanged (excluded from checker)
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When kata-deploy installs Kata Containers, the base configuration files
should not be modified directly. This change adds documentation explaining
how to use drop-in configuration files for customization, and prepends a
warning comment to all deployed configuration files reminding users to use
drop-in files instead.
The warning is added to both standard shim configurations and custom
runtime configurations. It includes a brief explanation of how drop-in
files work and points users to the documentation for more details.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This Allows the updateStrategy to be configured for the kata-deploy helm
chart, this is enabling administrators to control the aggressiveness of
updates. For a less aggressive approach, the strategy can be set to
`OnDelete`. Alternatively, the update process can be made more
aggressive by adjusting the `maxUnavailable` parameter.
Signed-off-by: Nikolaj Lindberg Lerche <nlle@ambu.com>
Update the Helm chart README to document the new shims.disableAll
option and simplify the examples that previously required listing
every shim to disable.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Simplify the example values files by using the new shims.disableAll
option instead of listing every shim to disable.
Before (try-kata-nvidia-gpu.values.yaml):
shims:
clh:
enabled: false
cloud-hypervisor:
enabled: false
# ... 15 more lines ...
After:
shims:
disableAll: true
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a new `shims.disableAll` option that disables all standard shims
at once. This is useful when:
- Enabling only specific shims without listing every other shim
- Using custom runtimes only mode (no standard Kata shims)
Usage:
shims:
disableAll: true
qemu:
enabled: true # Only qemu is enabled
All helper templates are updated to check for this flag before
iterating over shims.
One thing that's super important to note here is that helm recursively
merges user values with chart defaults, making a simple
`disableAll` flag problematic: if defaults have `enabled: true`, user's
`disableAll: true` gets merged with those defaults, resulting in all
shims still being enabled.
The workaround found is to use null (`~`) as the default for `enabled`
field. The template logic interprets null differently based on
disableAll:
| enabled value | disableAll: false | disableAll: true |
|---------------|-------------------|------------------|
| ~ (null) | Enabled | Disabled |
| true | Enabled | Enabled |
| false | Disabled | Disabled |
This is backward compatible:
- Default behavior unchanged: all shims enabled when disableAll: false
- Users can set `disableAll: true` to disable all, then explicitly
enable specific shims with `enabled: true`
- Explicit `enabled: false` always disables, regardless of disableAll
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add the necessary configuration and code changes to support QEMU
on arm64 architecture in runtime-rs.
Changes:
- Set MACHINETYPE to "virt" for arm64
- Add machine accelerators "usb=off,gic-version=host" required for
proper arm64 virtualization
- Add arm64-specific kernel parameter "iommu.passthrough=0"
- Guard vIOMMU (Intel IOMMU) to skip on arm64 since it's not supported
These changes align runtime-rs with the Go runtime's arm64 QEMU support.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
The verification job mounts a ConfigMap containing the pod spec for
the Kata runtime test. Previously, both the ConfigMap and the Job were
Helm hooks with different weights (-5 and 0 respectively).
On k3s, a race condition was observed where the Job pod would be
scheduled before the kubelet's informer cache had registered the
ConfigMap, causing a FailedMount error:
MountVolume.SetUp failed for volume "pod-spec": object
"kube-system"/"kata-deploy-verification-spec" not registered
This happened because k3s's lightweight architecture schedules pods
very quickly, and the hook weight difference only controls Helm's
ordering, not actual timing between resource creation and cache sync.
By making the ConfigMap a regular chart resource (removing hook
annotations), it is created during the main chart installation phase,
well before any post-install hooks run. This guarantees the ConfigMap
is fully propagated to all kubelets before the verification Job starts.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The verification job needs to list nodes to check for the
katacontainers.io/kata-runtime label and list events to detect
FailedCreatePodSandBox errors during pod creation.
This was discovered when testing with k0s, where the service account
lacked the required cluster-scope permissions to list nodes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The verification job now supports configurable timeouts to accommodate
different environments and network conditions. The daemonset timeout
defaults to 1200 seconds (20 minutes) to allow for large image downloads,
while the verification pod timeout defaults to 180 seconds.
The job now waits for the DaemonSet to exist, pods to be scheduled,
rollout to complete, and nodes to be labeled before creating the
verification pod. A 15-second delay is added after node labeling to
allow kubelet time to refresh runtime information.
Retry logic with 3 attempts and a 10-second delay handles transient
FailedCreatePodSandBox errors that can occur during runtime
initialization. The job only fails on pod errors after a 30-second
grace period to avoid false positives from timing issues.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>