kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-02-22 06:43:41 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	622b912369	release: Add kata-lifecycle-manager chart to release process Update the release workflow and scripts to package and publish the kata-lifecycle-manager Helm chart alongside kata-deploy. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 12:00:19 +01:00
Fabiano Fidêncio	62fef5a5e4	fixup! helm: Add kata-lifecycle-manager chart for Argo Workflows-based upgrades	2026-02-05 11:55:44 +01:00
Fabiano Fidêncio	2e9ed9aa4c	helm: Add kata-lifecycle-manager chart for Argo Workflows-based upgrades This chart installs an Argo WorkflowTemplate for orchestrating controlled, node-by-node upgrades of kata-deploy with verification and automatic rollback on failure. The workflow processes nodes sequentially rather than in parallel to ensure fleet consistency. This design choice prevents ending up with a mixed-version fleet where some nodes run the new version while others remain on the old version. If verification fails on any node, the workflow stops immediately before touching remaining nodes. Alternative approaches considered: - withParam loop with semaphore (max-concurrent: 1): Provides cleaner UI with all nodes visible at the same level, but Argo's semaphore only controls concurrency, not failure propagation. When one node fails and releases the lock, other nodes waiting on the semaphore still proceed. - withParam with failFast: true: Would be ideal, but Argo only supports failFast for DAG tasks, not for steps with withParam. Attempting to use it results in "unknown field" errors. - Single monolithic script: Would guarantee sequential execution and fail-fast, but loses per-node visibility in the Argo UI and makes debugging harder. The chosen approach uses recursive Argo templates (upgrade-node-chain) which naturally provides fail-fast behavior because if any step in the chain fails, the recursion stops. Despite the nesting in the Argo UI, each node's upgrade steps remain visible for monitoring. A verification pod is required to validate that Kata is functioning correctly on each node after upgrade. The chart will fail to install without one. Users must provide the verification pod when installing kata-lifecycle-manager using --set-file defaults.verificationPod=./pod.yaml. The pod can also be overridden at workflow submission time using a base64-encoded workflow parameter. When passing the verification pod as a workflow parameter, base64 encoding is required because multi-line YAML with special characters does not survive the journey through Argo CLI and shell script parsing. The workflow validates prerequisites before touching any nodes. If no verification pod is configured, the workflow fails immediately with a clear error message. This prevents partial upgrades that would leave the cluster in an inconsistent state. During helm upgrade, kata-deploy's verification is explicitly disabled (--set verification.pod="") because: - kata-deploy's verification is cluster-wide, designed for initial install - kata-lifecycle-manager does per-node verification with proper placeholder substitution (${NODE}, ${TEST_POD}) - Running kata-deploy's verification on each node would be redundant and could fail due to unsubstituted placeholders On verification failure, the workflow triggers an automatic helm rollback, waits for kata-deploy to stabilize, uncordons the node, and marks it with a rolled-back status annotation. The workflow then exits with an error so the failure is clearly visible. The upgrade flow per node: 1. Prepare: Annotate node with upgrade status 2. Cordon: Mark node unschedulable 3. Drain (optional): Evict pods if enabled 4. Upgrade: Run helm upgrade with --reuse-values 5. Wait: Wait for kata-deploy DaemonSet pod ready 6. Verify: Run verification pod with substituted placeholders 7. Complete: Uncordon and update annotations Draining is disabled by default because running Kata VMs continue using their in-memory binaries after upgrade. Only new workloads use the upgraded binaries. Users who prefer to evict all workloads before maintenance can enable draining. Known limitations: - Fleet consistency during rollback: Because kata-deploy uses a DaemonSet that is updated cluster-wide, nodes that pass verification are uncordoned and can accept new workloads before all nodes are verified. If a later node fails verification and triggers a rollback, workloads that started on already-verified nodes continue running with the new version's in-memory binaries while the cluster reverts to the old version. This is generally acceptable since running VMs continue functioning and new workloads use the rolled-back version. A future improvement could implement a two-phase approach that cordons all nodes upfront and only uncordons after all verifications pass. The chart requires Argo Workflows v3.4+ and uses multi-arch container images supporting amd64, arm64, s390x, and ppc64le. Usage: # Install kata-lifecycle-manager with verification pod (required) helm install kata-lifecycle-manager ./kata-lifecycle-manager \ --set-file defaults.verificationPod=./my-verification-pod.yaml # Label nodes for upgrade kubectl label node worker-1 katacontainers.io/kata-lifecycle-manager-window=true # Trigger upgrade argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \ -p target-version=3.25.0 \ -p node-selector="katacontainers.io/kata-lifecycle-manager-window=true" \ -p helm-namespace=kata-system # Monitor progress argo watch @latest -n argo Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 11:50:04 +01:00
Fabiano Fidêncio	e8a896aaa2	packaging: Add multi-arch helm container image Add a Dockerfile and GitHub Actions workflow to build and publish a multi-arch helm container image to quay.io/kata-containers/helm. The image is based on quay.io/kata-containers/kubectl and adds: - helm (latest stable version) The image supports the following architectures: - linux/amd64 - linux/arm64 - linux/s390x - linux/ppc64le The workflow runs: - Weekly (every Sunday at 12:00 UTC, 12 hours after kubectl image) - On manual trigger - When the Dockerfile or workflow changes Image tags: - latest - Date-based (YYYYMMDD) - Helm version (e.g., v3.17.0) - Git SHA This image is used by the kata-upgrade Helm chart for orchestrating kata-deploy upgrades via Argo Workflows. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 11:49:24 +01:00
Greg Kurz	e430b2641c	Merge pull request #12435 from bpradipt/crio-annotation shim: Add CRI-O annotation support for device cold plug	2026-02-05 09:29:19 +01:00
Alex Lyn	e257430976	Merge pull request #12433 from manuelh-dev/mahuber/cfg-sanitize-whitespaces runtimes: Sanitize trailing whitespaces	2026-02-05 09:31:21 +08:00
Fabiano Fidêncio	dda1b30c34	tests: nvidia-nim: Use sealed secrets for NGC_API_KEY Convert the NGC_API_KEY from a regular Kubernetes secret to a sealed secret for the CC GPU tests. This ensures the API key is only accessible within the confidential enclave after successful attestation. The sealed secret uses the "vault" type which points to a resource stored in the Key Broker Service (KBS). The Confidential Data Hub (CDH) inside the guest will unseal this secret by fetching it from KBS after attestation. The initdata file is created AFTER create_tmp_policy_settings_dir() copies the empty default file, and BEFORE auto_generate_policy() runs. This allows genpolicy to add the generated policy.rego to our custom CDH configuration. The sealed secret format follows the CoCo specification: sealed.<JWS header>.<JWS payload>.<signature> Where the payload contains: - version: "0.1.0" - type: "vault" (pointer to KBS resource) - provider: "kbs" - resource_uri: KBS path to the actual secret Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:34:44 +01:00
Fabiano Fidêncio	c9061f9e36	tests: kata-deploy: Increase post-deployment wait time Increase the sleep time after kata-deploy deployment from 10s to 60s to give more time for runtimes to be configured. This helps avoid race conditions on slower K8s distributions like k3s where the RuntimeClass may not be immediately available after the DaemonSet rollout completes. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	0fb2c500fd	tests: kata-deploy: Merge E2E tests to avoid timing issues Merge the two E2E tests ("Custom RuntimeClass exists with correct properties" and "Custom runtime can run a pod") into a single test, as those 2 are very much dependent of each other. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	fef93f1e08	tests: kata-deploy: Use die() instead of fail() for error handling Replace fail() calls with die() which is already provided by common.bash. The fail() function doesn't exist in the test infrastructure, causing "command not found" errors when tests fail. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	f90c12d4df	kata-deploy: Avoid text file busy error with nydus-snapshotter We cannot overwrtie a binary that's currently in use, and that's the reason that elsewhere we remove / unlink the binary (the running process keeps its file descriptor, so we're good doing that) and only then we copy the binary. However, we missed doing this for the nydus-snapshotter deployment. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 10:24:49 +01:00
Manuel Huber	30c7325e75	runtimes: Sanitize trailing whitespaces Clean up trailing whitespaces, making life easier for those who have configured their IDE to clean these up. Suggest to not add new code with trailing whitespaces etc. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-02-03 11:46:30 -08:00
Steve Horsman	30494abe48	Merge pull request #12426 from kata-containers/dependabot/github_actions/zizmorcore/zizmor-action-0.4.1 build(deps): bump zizmorcore/zizmor-action from 0.2.0 to 0.4.1	2026-02-03 14:38:54 +00:00
Pradipta Banerjee	8a449d358f	shim: Add CRI-O annotation support for device cold plug Add support for CRI-O annotations when fetching pod identifiers for device cold plug. The code now checks containerd CRI annotations first, then falls back to CRI-O annotations if they are empty. This enables device cold plug to work with both containerd and CRI-O container runtimes. Annotations supported: - containerd: io.kubernetes.cri.sandbox-name, io.kubernetes.cri.sandbox-namespace - CRI-O: io.kubernetes.cri-o.KubeName, io.kubernetes.cri-o.Namespace Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>	2026-02-03 04:51:15 +00:00
Steve Horsman	6bb77a2f13	Merge pull request #12390 from mythi/tdx-updates-2026-2 runtime: tdx QEMU configuration changes	2026-02-02 16:58:44 +00:00
Zvonko Kaiser	6702b48858	Merge pull request #12428 from fidencio/topic/nydus-snapshotter-start-from-a-clean-state kata-deploy: nydus: Always start from a clean state	2026-02-02 11:21:26 -05:00
Steve Horsman	0530a3494f	Merge pull request #12415 from nlle/make-helm-updatestrategy-configurable kata-deploy: Make update strategy configurable for kata-deploy DaemonSet	2026-02-02 10:29:01 +00:00
Steve Horsman	93dcaee965	Merge pull request #12423 from manuelh-dev/mahuber/pause-build-fix packaging: Delete pause_bundle dir before unpack	2026-02-02 10:26:30 +00:00
Fabiano Fidêncio	62ad0814c5	kata-deploy: nydus: Always start from a clean state Clean up existing nydus-snapshotter state to ensure fresh start with new version. This is safe across all K8s distributions (k3s, rke2, k0s, microk8s, etc.) because we only touch the nydus data directory, not containerd's internals. When containerd tries to use non-existent snapshots, it will re-pull/re-unpack. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-02 11:06:37 +01:00
Mikko Ylinen	870630c421	kata-deploy: drop custom TDX installation steps As we have moved to use QEMU (and OVMF already earlier) from kata-deploy, the custom tdx configurations and distro checks are no longer needed. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-02-02 11:11:26 +02:00
Mikko Ylinen	927be7b8ad	runtime: tdx: move to use QEMU from kata-deploy Currently, a working TDX setup expects users to install special TDX support builds from Canonical/CentOS virt-sig for TDX to work. kata-deploy configured TDX runtime handler to use QEMU from the distro's paths. With TDX support now being available in upstream Linux and Ubuntu 24.04 having an install candidate (linux-image-generic-6.17) for a new enough kernel, move TDX configuration to use QEMU from kata-deploy. While this is the new default, going back to the original setup is possible by making manual changes to TDX runtime handlers. Note: runtime-rs is already using QEMUPATH for TDX. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-02-02 11:10:52 +02:00
Nikolaj Lindberg Lerche	6e98df2bac	kata-deploy: Make update strategy configurable for kata-deploy DaemonSet This Allows the updateStrategy to be configured for the kata-deploy helm chart, this is enabling administrators to control the aggressiveness of updates. For a less aggressive approach, the strategy can be set to `OnDelete`. Alternatively, the update process can be made more aggressive by adjusting the `maxUnavailable` parameter. Signed-off-by: Nikolaj Lindberg Lerche <nlle@ambu.com>	2026-02-01 20:14:29 +01:00
Dan Mihai	d7ff54769c	tests: policy: remove the need for using sudo Modify the copy of root user's settings file, instead of modifying the original file. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-01 20:09:50 +01:00
Dan Mihai	4d860dcaf5	tests: policy: avoid redundant debug output Avoid redundant and confusing teardown_common() debug output for k8s-policy-pod.bats and k8s-policy-pvc.bats. The Policy tests skip the Message field when printing information about their pods, because unfortunately that field might contain a truncated Policy log - for the test cases that intentiocally cause Policy failures. The non-truncated Policy log is already available from other "kubectl describe" fields. So, avoid the redundant pod information from teardown_common(), that also included the confusing Message field. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-01 20:09:50 +01:00
dependabot[bot]	dc8d9e056d	build(deps): bump zizmorcore/zizmor-action from 0.2.0 to 0.4.1 Bumps [zizmorcore/zizmor-action](https://github.com/zizmorcore/zizmor-action) from 0.2.0 to 0.4.1. - [Release notes](https://github.com/zizmorcore/zizmor-action/releases) - [Commits](`e673c3917a...135698455d`) --- updated-dependencies: - dependency-name: zizmorcore/zizmor-action dependency-version: 0.4.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2026-02-01 15:08:10 +00:00
Manuel Huber	8b0c199f43	packaging: Delete pause_bundle dir before unpack Delete the pause_bundle directory before running the umoci unpack operation. This will make builds idempotent and not fail with errors like "create runtime bundle: config.json already exists in .../build/pause-image/destdir/pause_bundle". This will make life better when building locally. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-31 19:43:11 +01:00
Steve Horsman	4d1095e653	Merge pull request #12350 from manuelh-dev/mahuber/term-grace-period tests: Remove terminationGracePeriod in manifests	2026-01-29 15:17:17 +00:00
Fabiano Fidêncio	b85393e70b	release: Bump version to 3.26.0 Bump VERSION and helm-charts versions. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> 3.26.0	2026-01-29 00:23:26 +01:00
Fabiano Fidêncio	500146bfee	versions: Bump Go to 1.24.12 Update Go from 1.24.11 to 1.24.12 to address security vulnerabilities in the standard library: - GO-2026-4342: Excessive CPU consumption in archive/zip - GO-2026-4341: Memory exhaustion in net/url query parsing - GO-2026-4340: TLS handshake encryption level issue in crypto/tls Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-29 00:23:26 +01:00
Dan Mihai	20ca4d2d79	runtime: DEFDISABLEBLOCK := true 1. Add disable_block_device_use to CLH settings file, for parity with the already existing QEMU settings. 2. Set DEFDISABLEBLOCK := true by default for both QEMU and CLH. After this change, Kata Guests will use by default virtio-fs to access container rootfs directories from their Hosts. Hosts that were designed to use Host block devices attached to the Guests can re-enable these rootfs block devices by changing the value of disable_block_device_use back to false in their settings files. 3. Add test using container image without any rootfs layers. Depending on the container runtime and image snapshotter being used, the empty container rootfs image might get stored on a host block device that cannot be safely hotplugged to a guest VM, because the host is using the same block device. 4. Add block device hotplug safety warning into the Kata Shim configuration files. Signed-off-by: Dan Mihai <dmihai@microsoft.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Cameron McDermott <cameron@northflank.com>	2026-01-28 19:47:49 +01:00
Manuel Huber	5e60d384a2	kata-deploy: Update for mariner in all target Remove the initrd function and add the image function to align with the actually existing functions in this file. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-28 08:58:45 -08:00
Greg Kurz	ea627166b9	Merge pull request #12389 from ldoktor/ci-helm ci.ocp: Use 0.0.0-dev tagged helm chart	2026-01-28 17:20:07 +01:00
Manuel Huber	0d8fbdef07	kernel: Readjust kernel version after decrement Readjust the kata_config_version counter after it was accidentally decremented in commit `c7f5ff4`. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-28 10:48:12 +01:00
Joji Mekkattuparamban	1440dd7468	shim: enforce iommufd for confidential guest vfio Confidential guests cannot use traditional IOMMU Group based VFIO. Instead, they need to use IMMUFD. This is mainly because the group abstraction is incompatible with a confidential device model. If traditional VFIO is specified for a confidential guest, detect the error and bail out early. Fixes #12393 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2026-01-28 00:11:38 +01:00
stevenhorsman	c7bc428e59	versions: Bump guest-components Bump guest-components to 9aae2eae to pick up the latest security fixes and toolchain bump Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-01-28 00:05:58 +01:00
Aurélien Bombo	932920cb86	Merge pull request #11959 from houstar/main agent: remove redundant func comment	2026-01-27 12:01:04 -06:00
Lukáš Doktor	5250d4bacd	ci.ocp: Use 0.0.0-dev tagged helm chart in CI we are testing the latest kata-deploy, which requires the latest helm chart. The previous query doesn't work anymore, but these days we should be able to rely on the "0.0.0-dev" tag and on helm to print the to-be-installed version into console. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>	2026-01-27 14:58:46 +01:00
Steve Horsman	eb3d204ff3	Merge pull request #12274 from ldoktor/pp-images ci.ocp: Two little fixes regarding the openshift-ci	2026-01-27 11:31:51 +00:00
Lukáš Doktor	971b096a1f	ci.ocp: Update cleanup.sh to cope with helm deployment replaces the old kata-deploy and uses "helm uninstall" instead. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>	2026-01-27 07:59:13 +01:00
Lukáš Doktor	272ff9c568	ci.ocp: Add notes about where to get other podvm images I keep struggling finding the debug images, let's include them in the peer-pods-azure.sh script so people can find them easier. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>	2026-01-27 07:59:12 +01:00
Qingyuan Hou	ca43a8cbb8	agent: remove redundant func comment This comment was first introduced in `e111093` with secure_join() but then we forgot to remove it when we switched to the safe-path lib in `c0ceaf6` Signed-off-by: Qingyuan Hou <lenohou@gmail.com>	2026-01-27 03:07:57 +00:00
Alex Lyn	6c0ae4eb04	Merge pull request #11585 from Apokleos/enhance-qmp runtime-rs: Make QMP init robust by retrying handshake with deadline	2026-01-27 09:11:19 +08:00
Zvonko Kaiser	a59f791bf5	gpu: Move CUDA repo selection to versions.yaml We want to enable local and remote CUDA repository builds. Moving the cuda and tools repo to versions.yaml with a unified build for both types. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-01-26 22:19:40 +01:00
Fabiano Fidêncio	d0fe60e784	tests: Fix empty string handling for helm Fix empty string handling in format conversion When HELM_ALLOWED_HYPERVISOR_ANNOTATIONS, HELM_AGENT_HTTPS_PROXY, or HELM_AGENT_NO_PROXY are empty, the pattern matching condition `!= :` or `!= =` evaluates to true, causing the conversion loop to create invalid entries like "qemu-tdx: qemu-snp:". Add -n checks to ensure conversion only runs when variables are non-empty. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	4b2d4e96ae	tests: Add qemu-{tdx,snp}-runtime-rs to the list of tee shims We missed doing this as part of `b5a986eacf`. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	26c534d610	tests: Use shims.disableAll in test helpers Update the CI and functional test helpers to use the new shims.disableAll option instead of iterating over every shim to disable them individually. Also adds helm repo for node-feature-discovery before building dependencies to fix CI failures on some distributions. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	04f45a379c	kata-deploy: docs: Document shims.disableAll option Update the Helm chart README to document the new shims.disableAll option and simplify the examples that previously required listing every shim to disable. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	c9e9a682ab	kata-deploy: Use disableAll in example values files Simplify the example values files by using the new shims.disableAll option instead of listing every shim to disable. Before (try-kata-nvidia-gpu.values.yaml): shims: clh: enabled: false cloud-hypervisor: enabled: false # ... 15 more lines ... After: shims: disableAll: true Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	cfe9bcbaf1	kata-deploy: Add shims.disableAll option to Helm chart Add a new `shims.disableAll` option that disables all standard shims at once. This is useful when: - Enabling only specific shims without listing every other shim - Using custom runtimes only mode (no standard Kata shims) Usage: shims: disableAll: true qemu: enabled: true # Only qemu is enabled All helper templates are updated to check for this flag before iterating over shims. One thing that's super important to note here is that helm recursively merges user values with chart defaults, making a simple `disableAll` flag problematic: if defaults have `enabled: true`, user's `disableAll: true` gets merged with those defaults, resulting in all shims still being enabled. The workaround found is to use null (`~`) as the default for `enabled` field. The template logic interprets null differently based on disableAll: \| enabled value \| disableAll: false \| disableAll: true \| \|---------------\|-------------------\|------------------\| \| ~ (null) \| Enabled \| Disabled \| \| true \| Enabled \| Enabled \| \| false \| Disabled \| Disabled \| This is backward compatible: - Default behavior unchanged: all shims enabled when disableAll: false - Users can set `disableAll: true` to disable all, then explicitly enable specific shims with `enabled: true` - Explicit `enabled: false` always disables, regardless of disableAll Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	d8a3272f85	kata-deploy: Add tests for custom runtimes Helm templates Add Bats tests to verify the custom runtimes Helm template rendering, and that the we can start a pod with the custom runtime. Tests were written with Cursor's help. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00

1 2 3 4 5 ...

17798 Commits