kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-05-18 13:46:06 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	0f3160276b	ci: k8s: skip no-op Helm uninstall on free runners In cleanup_kata_deploy, bail out early when no kata-deploy Helm release exists so baremetal-* pre-deploy cleanup on fresh clusters does not block on helm uninstall --wait (up to 10m). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	f5533950e6	kata-deploy: helm: cap container RSS via resources block Plumb a resources block into the kata-deploy DaemonSet container in the Helm chart so the cluster can size its memory footprint predictably. Defaults are sized from real /proc/<pid>/status numbers on an unpatched 3.30.0 build running on a ~220-vCPU GPU node: VmRSS: 9944 kB (~9.7 MiB) <- actual physical memory RssAnon: 2628 kB (~2.6 MiB) <- heap + dirty stack pages VmData: 464668 kB (~454 MiB) <- tokio multi-thread workers' reserved-but-untouched stacks Threads: 225 <- num_cpus()-driven worker pool That VmData number is the source of the original "kata-deploy is using 400 MB" reports: any monitoring layer that surfaces virtual data size, committed memory, or memory.usage_in_bytes on a kernel that includes mapped-but-untouched memory will happily reproduce ~400 MB even though only ~10 MiB is ever made resident. The earlier commits in this series (current_thread tokio, mimalloc, shared kube client, JSONPath removal, post-install re-exec) collapse VmData into the tens of MiB and drop the post-install resident set further. The defaults below are picked accordingly: requests: cpu: 25m # install is mostly I/O wait; the post-install # waiter is genuinely idle memory: 16Mi # ~2x headroom over the unpatched VmRSS we # measured, far more over the patched waiter Operators who hit OOMKilled on unusually large or churny clusters can override `resources` directly in their Helm values (or set it to {} to remove all requests and inherit cluster defaults). Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	1682b73e38	kata-deploy: Add qemu-nvidia-gpu-tdx-runtime-rs shim Register the new qemu-nvidia-gpu-tdx-runtime-rs shim across the kata-deploy stack so it is built, installed, and exposed as a RuntimeClass. This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the runtime-rs binary), SHIMS list, the qemu-tdx-experimental share name mapping, and the x86_64 default shim set. The Helm chart gets the new shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the RuntimeClass overhead definition in runtimeclasses.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	2280620cb9	kata-deploy: Add qemu-nvidia-gpu-snp-runtime-rs shim Register the new qemu-nvidia-gpu-snp-runtime-rs shim across the kata-deploy stack so it is built, installed, and exposed as a RuntimeClass. This adds the shim to the Rust binary's RUST_SHIMS list (so it uses the runtime-rs binary), SHIMS list, the qemu-snp-experimental share name mapping, and the x86_64 default shim set. The Helm chart gets the new shim entry in values.yaml, try-kata-nvidia-gpu.values.yaml, and the RuntimeClass overhead definition in runtimeclasses.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	92a8cd56d1	kata-deploy: Add qemu-nvidia-gpu-runtime-rs shim Register the Rust NVIDIA GPU runtime as a kata-deploy shim so it gets installed and configured alongside the existing Go-based qemu-nvidia-gpu shim. Add qemu-nvidia-gpu-runtime-rs to the RUST_SHIMS list and the default enabled shims, create its RuntimeClass entry in the Helm chart, and include it in the try-kata-nvidia-gpu values overlay. The kata-deploy installer will now copy the runtime-rs configuration and create the containerd runtime entry for it. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-07 10:33:26 +02:00
Fabiano Fidêncio	49396b7991	kata-deploy: add HTTP health probes (healthz/readyz) The kata-deploy DaemonSet pod had no Kubernetes health probes, so the kubelet could not distinguish between "still installing" and "crashed", and rolling updates would proceed to the next node before install actually finished. Add a lightweight HTTP health server (built on raw tokio TcpListener, no new crate dependencies) that starts immediately in the install path: /healthz — liveness: returns 200 as soon as the server binds /readyz — readiness: returns 503 while installing, 200 after install completes (artifacts extracted, CRI restarted, node labeled) Wire the Helm chart with startup, liveness, and readiness probes (all individually toggleable). The startup probe allows up to 10 minutes for install to complete before the liveness probe takes over. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-03 22:09:08 +02:00
stevenhorsman	a1a6a9a150	release: Bump version to 3.30.0 Bump VERSION and helm-charts versions. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-02 17:57:39 +01:00
Aurélien Bombo	e4fbddb91a	ci: rename cloud-hypervisor to clh-runtime-rs This aligns on qemu-runtime-rs and makes more sense. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-04-28 10:58:01 -05:00
Fabiano Fidêncio	ed3f8b4efe	release: Bump version to 3.29.0 Bump VERSION and helm-charts versions. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-22 15:57:39 +02:00
Fabiano Fidêncio	9b62021049	kata-deploy: Remove untested arm64 and qemu-cca shim support We should not ship configurations that we do not actively test. This commit drops the following from the kata-deploy helm chart: values.yaml: - arm64 from supportedArches for the clh shim - arm64 from supportedArches for the cloud-hypervisor shim - arm64 from supportedArches for the dragonball shim - arm64 from supportedArches for the fc shim - arm64 from supportedArches for the qemu-nvidia-gpu shim - the entire qemu-cca shim definition try-kata-tee.values.yaml: - CCA from the file description comment - qemu-cca from the TEE shims list comment - the entire qemu-cca shim definition - arm64: qemu-cca from the defaultShim mapping, replaced with arm64: qemu-coco-dev-runtime-rs (which is tested) try-kata-nvidia-gpu.values.yaml: - arm64 from supportedArches for the qemu-nvidia-gpu shim - arm64: qemu-nvidia-gpu from the defaultShim mapping Once arm64 and qemu-cca support are properly tested, they can be re-added. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-22 10:55:29 +02:00
Fabiano Fidêncio	588a67a3fb	kata-deploy: add arm64 support for qemu-coco-dev shims Add aarch64/arm64 to the list of supported architectures for qemu-coco-dev and qemu-coco-dev-runtime-rs shims across kata-deploy configuration, Helm chart values, and test helper scripts. Note that guest-components and the related build dependencies are not yet wired for arm64 in these configurations; those will be addressed separately. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-18 00:48:13 +02:00
Fabiano Fidêncio	df1d02d3cf	kata-deploy: Allow overriding containerd config path and file name Add two new Helm values under `containerd`: - `configDir`: overrides the host directory where the containerd config lives, taking precedence over the k8sDistribution-based auto-detection. - `configFileName`: overrides the containerd config file name, propagated to the kata-deploy binary via the new CONTAINERD_CONFIG_FILE_NAME environment variable. These are useful for non-standard containerd setups that don't match any of the built-in k8sDistribution presets (k8s, k3s, rke2, k0s, microk8s). The config file name override only affects the default runtime branch in get_containerd_paths(). The k0s/microk8s/k3s/rke2 branches are left untouched since those runtimes have mandatory file naming conventions. Also fixes a spurious leading space in the k3s containerdConfPath branch. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-13 22:31:55 +02:00
LizZhang315	2312f67c9b	helm: add overheadEnabled switch for runtimeclass Add a global and per-shim configurable switch to enable/disable the overhead section in generated RuntimeClasses. This allows users to omit overhead when it's not needed or managed externally. Priority: per-shim > global > default(true). Signed-off-by: LizZhang315 <123134987@qq.com>	2026-04-10 10:26:11 +02:00
Fabiano Fidêncio	bc719a66eb	kata-deploy: nvidia: Align force_guest_pull with default values.yaml The defdault is already false, but let's keep those aligned on explicitly setting the default. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 14:41:21 +02:00
Fabiano Fidêncio	78f02f2155	kata-deploy: nvidia: Align labels with default values.yaml Joji's added the labels for the default values.yaml, but we missed adding those to the nvidia specific values.yaml file. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 14:41:21 +02:00
Fabiano Fidêncio	f00b589ccd	Revert "kata-deploy: Temporarily comment GPU specific labels" This reverts commit `02c9a4b23c`, as GPU Operator v26.3.0 is out, and becomes a requirement. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 14:41:21 +02:00
Fabiano Fidêncio	47770daa3b	helm: Align values.yaml with try-kata-nvidia-gpu.values.yaml We've switched to nydus there, but never did for the values.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-06 18:51:54 +02:00
Fabiano Fidêncio	b4b62417ed	kata-deploy: skip cleanup on pod restart to avoid crashing kata pods When a kata-deploy DaemonSet pod is restarted (e.g. due to a label change or rolling update), the SIGTERM handler runs cleanup which unconditionally removes kata artifacts and restarts containerd. This causes containerd to lose the kata shim binary, crashing all running kata pods on the node. Fix this by implementing a three-stage cleanup decision: 1. If this pod's owning DaemonSet still exists (exact name match via DAEMONSET_NAME env var), this is a pod restart — skip all cleanup. The replacement pod will re-run install, which is idempotent. 2. If this DaemonSet is gone but other kata-deploy DaemonSets still exist (multi-install scenario), perform instance-specific cleanup only (snapshotters, CRI config, artifacts) but skip shared resources (node label removal, CRI restart) to avoid disrupting the other instances. 3. If no kata-deploy DaemonSets remain, perform full cleanup including node label removal and CRI restart. The Helm chart injects a DAEMONSET_NAME environment variable with the exact DaemonSet name (including any multi-install suffix), ensuring instance-aware lookup rather than broadly matching any DaemonSet containing "kata-deploy". Fixes: #12761 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 15:20:52 +02:00
Fabiano Fidêncio	514a2b1a7c	Merge pull request #12264 from fidencio/topic/nvidia-gpu-cc-use-nydus-snapshotter nvidia: cc: Use nydus-snapshotter	2026-03-23 12:50:15 +01:00
Fabiano Fidêncio	6194510e90	nvidia: cc: Use nydus-snapshotter We've been using `experimental_force_guest_pull`, but now that we have a containerd release that should work more reliably with the multi snapshotter setup, we want to give it a try. Note: We need containerd 2.2.2+. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-22 10:10:34 +01:00
LandonTClipp	795869152d	docs: Move to mkdocs-material, port Helm to docs site This supersedes https://github.com/kata-containers/kata-containers/pull/12622. I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material created after mkdocs-material was put into maintenance mode. We'll use this platform until Zensical is more feature complete. Added a few of the existing docs into the site to make a more user-friendly flow. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-20 14:51:39 -05:00
Manuel Huber	5210584f95	release: Bump version to 3.28.0 Bump VERSION and helm-charts versions. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:52:35 -07:00
Zvonko Kaiser	99f32de1e5	kata-deploy: Update RuntimeClass PodOverhead Align the podOverhead with the default_memory updated in the previous commit. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Zachary Spar	bda9f6491f	kata-deploy: add per-shim configurable pod overhead Allow users to override the default RuntimeClass pod overhead for any shim via shims.<name>.runtimeClass.overhead.{memory,cpu}. When the field is absent the existing hardcoded defaults from the dict are used, so this is fully backward compatible. Signed-off-by: Zachary Spar <zspar@coreweave.com>	2026-03-05 08:00:01 +01:00
Fabiano Fidêncio	ebe75cc3e3	kata-deploy: make verification job resilient to CRI runtime restarts kata-deploy restarts the CRI runtime (k3s/containerd) during install, which can kill the verification job pod or cause transient API server errors. Bump backoffLimit from 0 to 3 so the job can retry after being killed, and add a retry loop around kubectl rollout status to handle transient connection failures. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	7a08ef2f8d	kata-deploy: run cleanup on SIGTERM instead of preStop hook Move the cleanup logic from a preStop lifecycle hook (separate exec) into the main process's SIGTERM handler. This simplifies the architecture: the install process now handles its own teardown when the pod is terminated. The SIGTERM handler is registered before install begins, and tokio::select! races install against SIGTERM so cleanup always runs even if SIGTERM arrives mid-install (e.g. helm uninstall while the container is restarting after a failed install attempt). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	8c91e7889c	helm-chart: support digest pinning for images When image.reference or kubectlImage.reference already contains a digest (e.g. quay.io/...@sha256:...), use the reference as-is instead of appending :tag. This avoids invalid image strings like 'image@sha256🔤' when tag is empty and allows users to pin by digest. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-26 13:39:51 +01:00
Mathieu Parent	b61d169472	kata-deploy: allow to configure kubectl image This can be used to: - pin tag (current is 20260112) - pin digest - use another image Signed-off-by: Mathieu Parent <mathieu.parent@insee.fr>	2026-02-26 13:12:03 +01:00
Fabiano Fidêncio	b082cf1708	kata-deploy: validate defaultShim is enabled before propagating it getDefaultShimForArch previously returned whatever string was set in defaultShim.<arch> without any validation. A typo, a non-existent shim, or a shim that is disabled via disableAll would all silently produce a bogus DEFAULT_SHIM_* env var, causing kata-deploy to fail at runtime. Guard the return value by checking whether the configured shim is present in the list of shims that are both enabled and support the requested architecture. If not, return empty string so the env var is simply omitted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-21 14:01:11 +01:00
Fabiano Fidêncio	4ff7f67278	kata-deploy: fix nil pointer when custom runtime omits containerd/crio Using `$runtime.containerd.snapshotter` and `$runtime.crio.pullType` panics with a nil pointer error when the containerd or crio block is absent from the custom runtime definition. Let's use the `dig` function which safely traverses nested keys and returns an empty string as the default when any key in the path is missing. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-21 13:59:41 +01:00
Fabiano Fidêncio	855f4dc7fa	release: Bump version to 3.27.0 Bump VERSION and helm-charts versions. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-19 14:01:26 +01:00
Amulyam24	a22c59a204	kata-deploy: enable kata-remote for ppc64le When kata-deploy is deployed with cloud-api-adaptor, it defaults to qemu instead of configuring the remote shim. Support ppc64le to enable it correctly when shims.remote.enabled=true Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>	2026-02-19 11:14:27 +01:00
Fabiano Fidêncio	0e8e30d6b5	kata-deploy: fix default RuntimeClass + nodeSelectors The default RuntimeClass (e.g. kata) is meant to point at the default shim handler (e.g. kata-qemu-$tee). We were building it in a separate block and only sometimes adding the same TEE nodeSelectors as the shim-specific RuntimeClass, leading to kata ending up without the SE/SNP/TDX nodeSelector while kata-qemu-$tee had it. The fix is to stop duplicating the RuntimeClass definition, having a single template that renders one RuntimeClass (name, handler, overhead, nodeSelectors). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-16 13:09:03 +01:00
Fabiano Fidêncio	80a175d09b	kata-deploy: Add TEE nodeSelectors for TEE shims when NFD is detected When NFD is detected (deployed by the chart or existing in the cluster), apply shim-specific nodeSelectors only for TEE runtime classes (snp, tdx, and se). Non-TEE shims keep existing behavior (e.g. runtimeClass.nodeSelector for nvidia GPU from `f3bba0885` is unchanged). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-16 12:07:51 +01:00
Fabiano Fidêncio	02c9a4b23c	kata-deploy: Temporarily comment GPU specific labels We depend on GPU Operator v26.3 release, which is not out yet. Although we have been testing with it, it's not yet publicly available, which would break anyone actually trying to use the GPU runtime classes. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-14 09:25:14 +01:00
Joji Mekkattuparamban	f3bba08851	kata-deploy: add node selector to nvidia runtime classes The CC runtime classes kata-qemu-nvidia-gpu-snp and kata-qemu-nvidia-gpu-tdx are mutually exclusive with kata-qemu-nvidia-gpu, as dictated by the gpu cc mode setting. In order to properly support a cluster that has both CC and non-CC nodes, we use a node selector so the scheduling is consistent with the GPU mode. The GPU operator sets a label nvidia.com/cc.ready.state=[true, false] to indicate the gpu mode setting Fixes #12431 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2026-02-13 15:58:06 +01:00
Fabiano Fidêncio	50923b6d62	kata-deploy: run cleanup on uninstall via DaemonSet preStop On helm uninstall let's rely on a preStop hook to run kata-deploy cleanup so each pod cleans its node before exiting. We must keep RBAC (resource-policy: keep) so pods retain API access during termination, and then can properly delete the NodeFeatureRules and remove the labels from the nodes. The post-delete hook Job, which runs on a single node, now is only responsible for cleaning the kept RBAC (cluster-wide resource) after uninstall, not leaving any resource or artefact behind. The changes on this commit lead to a "resouerces were kept" message when running `helm uninstall`, which document as being normal, as the post-delete job will remove those. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-11 22:05:10 +01:00
Fabiano Fidêncio	5c0269881e	tests: Make editorconfig-checker happy - Trim trailing whitespace and ensure final newline in non-vendor files - Add .editorconfig-checker.json excluding vendor dirs, .patch, .img, .dtb, .drawio, *.svg, and pkg/cloud-hypervisor/client so CI only checks project code - Leave generated and binary assets unchanged (excluded from checker) Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-10 21:58:28 +01:00
Fabiano Fidêncio	4cb2aea9dd	kata-deploy: Document drop-in configuration and add warning to config files When kata-deploy installs Kata Containers, the base configuration files should not be modified directly. This change adds documentation explaining how to use drop-in configuration files for customization, and prepends a warning comment to all deployed configuration files reminding users to use drop-in files instead. The warning is added to both standard shim configurations and custom runtime configurations. It includes a brief explanation of how drop-in files work and points users to the documentation for more details. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-10 18:12:17 +01:00
Nikolaj Lindberg Lerche	6e98df2bac	kata-deploy: Make update strategy configurable for kata-deploy DaemonSet This Allows the updateStrategy to be configured for the kata-deploy helm chart, this is enabling administrators to control the aggressiveness of updates. For a less aggressive approach, the strategy can be set to `OnDelete`. Alternatively, the update process can be made more aggressive by adjusting the `maxUnavailable` parameter. Signed-off-by: Nikolaj Lindberg Lerche <nlle@ambu.com>	2026-02-01 20:14:29 +01:00
Fabiano Fidêncio	b85393e70b	release: Bump version to 3.26.0 Bump VERSION and helm-charts versions. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-29 00:23:26 +01:00
Fabiano Fidêncio	04f45a379c	kata-deploy: docs: Document shims.disableAll option Update the Helm chart README to document the new shims.disableAll option and simplify the examples that previously required listing every shim to disable. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	c9e9a682ab	kata-deploy: Use disableAll in example values files Simplify the example values files by using the new shims.disableAll option instead of listing every shim to disable. Before (try-kata-nvidia-gpu.values.yaml): shims: clh: enabled: false cloud-hypervisor: enabled: false # ... 15 more lines ... After: shims: disableAll: true Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	cfe9bcbaf1	kata-deploy: Add shims.disableAll option to Helm chart Add a new `shims.disableAll` option that disables all standard shims at once. This is useful when: - Enabling only specific shims without listing every other shim - Using custom runtimes only mode (no standard Kata shims) Usage: shims: disableAll: true qemu: enabled: true # Only qemu is enabled All helper templates are updated to check for this flag before iterating over shims. One thing that's super important to note here is that helm recursively merges user values with chart defaults, making a simple `disableAll` flag problematic: if defaults have `enabled: true`, user's `disableAll: true` gets merged with those defaults, resulting in all shims still being enabled. The workaround found is to use null (`~`) as the default for `enabled` field. The template logic interprets null differently based on disableAll: \| enabled value \| disableAll: false \| disableAll: true \| \|---------------\|-------------------\|------------------\| \| ~ (null) \| Enabled \| Disabled \| \| true \| Enabled \| Enabled \| \| false \| Disabled \| Disabled \| This is backward compatible: - Default behavior unchanged: all shims enabled when disableAll: false - Users can set `disableAll: true` to disable all, then explicitly enable specific shims with `enabled: true` - Explicit `enabled: false` always disables, regardless of disableAll Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	3be57bb501	kata-deploy: Add Helm chart support for custom runtimes Add Helm chart configuration for defining custom RuntimeClasses with base configuration and drop-in overrides. Usage: helm install kata-deploy ./kata-deploy \ -f custom-runtimes.values.yaml Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	5b82b160e2	runtime-rs: Add arm64 QEMU support Add the necessary configuration and code changes to support QEMU on arm64 architecture in runtime-rs. Changes: - Set MACHINETYPE to "virt" for arm64 - Add machine accelerators "usb=off,gic-version=host" required for proper arm64 virtualization - Add arm64-specific kernel parameter "iommu.passthrough=0" - Guard vIOMMU (Intel IOMMU) to skip on arm64 since it's not supported These changes align runtime-rs with the Go runtime's arm64 QEMU support. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>	2026-01-23 19:48:31 +01:00
Fabiano Fidêncio	dacb14619d	kata-deploy: Make verification ConfigMap a regular resource The verification job mounts a ConfigMap containing the pod spec for the Kata runtime test. Previously, both the ConfigMap and the Job were Helm hooks with different weights (-5 and 0 respectively). On k3s, a race condition was observed where the Job pod would be scheduled before the kubelet's informer cache had registered the ConfigMap, causing a FailedMount error: MountVolume.SetUp failed for volume "pod-spec": object "kube-system"/"kata-deploy-verification-spec" not registered This happened because k3s's lightweight architecture schedules pods very quickly, and the hook weight difference only controls Helm's ordering, not actual timing between resource creation and cache sync. By making the ConfigMap a regular chart resource (removing hook annotations), it is created during the main chart installation phase, well before any post-install hooks run. This guarantees the ConfigMap is fully propagated to all kubelets before the verification Job starts. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-21 20:14:33 +01:00
Fabiano Fidêncio	89e287c3b2	kata-deploy: Add more permissions to verification job's RBAC The verification job needs to list nodes to check for the katacontainers.io/kata-runtime label and list events to detect FailedCreatePodSandBox errors during pod creation. This was discovered when testing with k0s, where the service account lacked the required cluster-scope permissions to list nodes. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-21 20:14:33 +01:00
Fabiano Fidêncio	86e0b08b13	kata-deploy: Improve verification job timing and failure detection The verification job now supports configurable timeouts to accommodate different environments and network conditions. The daemonset timeout defaults to 1200 seconds (20 minutes) to allow for large image downloads, while the verification pod timeout defaults to 180 seconds. The job now waits for the DaemonSet to exist, pods to be scheduled, rollout to complete, and nodes to be labeled before creating the verification pod. A 15-second delay is added after node labeling to allow kubelet time to refresh runtime information. Retry logic with 3 attempts and a 10-second delay handles transient FailedCreatePodSandBox errors that can occur during runtime initialization. The job only fails on pod errors after a 30-second grace period to avoid false positives from timing issues. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-21 20:14:33 +01:00
Fabiano Fidêncio	5aff81198f	helm-chart: Fix warnings on README nydus -> `nydus` erofs -> `erofs` Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-19 22:41:50 +01:00

1 2 3

121 Commits