release: Add kata-lifecycle-manager chart to release process

Update the release workflow and scripts to package and publish the kata-lifecycle-manager Helm chart alongside kata-deploy. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
fixup! helm: Add kata-lifecycle-manager chart for Argo Workflows-based upgrades
2026-03-17 18:22:14 +00:00 · 2026-02-05 12:00:19 +01:00 · 2026-02-05 11:55:44 +01:00 · 2026-02-05 11:50:04 +01:00 · 2026-02-05 11:49:24 +01:00 · 2026-02-05 09:29:19 +01:00
114 changed files with 2415 additions and 438 deletions
--- a/.github/workflows/build-helm-image.yaml
+++ b/.github/workflows/build-helm-image.yaml
@@ -0,0 +1,75 @@
+name: Build helm multi-arch image
+
+on:
+  schedule:
+    # Run every Sunday at 12:00 UTC (12 hours after kubectl image build)
+    - cron: '0 12 * * 0'
+  workflow_dispatch:
+    # Allow manual triggering
+  push:
+    branches:
+      - main
+    paths:
+      - 'tools/packaging/helm/Dockerfile'
+      - '.github/workflows/build-helm-image.yaml'
+
+permissions: {}
+
+env:
+  REGISTRY: quay.io
+  IMAGE_NAME: kata-containers/helm
+
+jobs:
+  build-and-push:
+    name: Build and push multi-arch image
+    runs-on: ubuntu-24.04
+    permissions:
+      contents: read
+      packages: write
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+        with:
+          persist-credentials: false
+
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@29109295f81e9208d7d86ff1c6c12d2833863392 # v3.6.0
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@b5ca514318bd6ebac0fb2aedd5d36ec1b5c232a2 # v3.10.0
+
+      - name: Login to Quay.io
+        uses: docker/login-action@74a5d142397b4f367a81961eba4e8cd7edddf772 # v3.4.0
+        with:
+          registry: ${{ env.REGISTRY }}
+          username: ${{ vars.QUAY_DEPLOYER_USERNAME }}
+          password: ${{ secrets.QUAY_DEPLOYER_PASSWORD }}
+
+      - name: Get helm version
+        id: helm-version
+        run: |
+          HELM_VERSION=$(curl -s https://api.github.com/repos/helm/helm/releases/latest | grep '"tag_name"' | sed -E 's/.*"([^"]+)".*/\1/')
+          echo "version=${HELM_VERSION}" >> "$GITHUB_OUTPUT"
+
+      - name: Generate image metadata
+        id: meta
+        uses: docker/metadata-action@902fa8ec7d6ecbf8d84d538b9b233a880e428804 # v5.7.0
+        with:
+          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
+          tags: |
+            type=raw,value=latest
+            type=raw,value={{date 'YYYYMMDD'}}
+            type=raw,value=${{ steps.helm-version.outputs.version }}
+            type=sha,prefix=
+
+      - name: Build and push multi-arch image
+        uses: docker/build-push-action@ca052bb54ab0790a636c9b5f226502c73d547a25 # v5.4.0
+        with:
+          context: tools/packaging/helm/
+          file: tools/packaging/helm/Dockerfile
+          platforms: linux/amd64,linux/arm64,linux/s390x,linux/ppc64le
+          push: true
+          tags: ${{ steps.meta.outputs.tags }}
+          labels: ${{ steps.meta.outputs.labels }}
+          cache-from: type=gha
+          cache-to: type=gha,mode=max
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -284,11 +284,15 @@ jobs:
          echo "${QUAY_DEPLOYER_PASSWORD}" | helm registry login quay.io --username "${QUAY_DEPLOYER_USERNAME}" --password-stdin
          echo "${GITHUB_TOKEN}" | helm registry login ghcr.io --username "${GITHUB_ACTOR}" --password-stdin

-      - name: Push helm chart to the OCI registries
+      - name: Push helm charts to the OCI registries
        run: |
          release_version=$(./tools/packaging/release/release.sh release-version)
+          # Push kata-deploy chart
          helm push "kata-deploy-${release_version}.tgz" oci://quay.io/kata-containers/kata-deploy-charts
          helm push "kata-deploy-${release_version}.tgz" oci://ghcr.io/kata-containers/kata-deploy-charts
+          # Push kata-lifecycle-manager chart
+          helm push "kata-lifecycle-manager-${release_version}.tgz" oci://quay.io/kata-containers/kata-deploy-charts
+          helm push "kata-lifecycle-manager-${release_version}.tgz" oci://ghcr.io/kata-containers/kata-deploy-charts

  publish-release:
    name: publish-release
--- a/.github/workflows/zizmor.yaml
+++ b/.github/workflows/zizmor.yaml
@@ -21,7 +21,7 @@ jobs:
          persist-credentials: false

      - name: Run zizmor
-        uses: zizmorcore/zizmor-action@e673c3917a1aef3c65c972347ed84ccd013ecda4 # v0.2.0
+        uses: zizmorcore/zizmor-action@135698455da5c3b3e55f73f4419e481ab68cdd95 # v0.4.1
        with:
          advanced-security: false
          annotations: true
--- a/docs/Kata-Containers-Lifecycle-Management.md
+++ b/docs/Kata-Containers-Lifecycle-Management.md
@@ -0,0 +1,118 @@
+# Kata Containers Lifecycle Management
+
+## Overview
+
+Kata Containers lifecycle management in Kubernetes consists of two operations:
+
+1. **Installation** - Deploy Kata Containers to cluster nodes
+2. **Upgrades** - Update Kata Containers to newer versions without disrupting workloads
+
+The Kata Containers project provides two Helm charts to address these needs:
+
+| Chart | Purpose |
+|-------|---------|
+| `kata-deploy` | Initial installation and configuration |
+| `kata-lifecycle-manager` | Orchestrated rolling upgrades with verification |
+
+---
+
+## Installation with kata-deploy
+
+The `kata-deploy` Helm chart installs Kata Containers across all (or selected) nodes using a Kubernetes DaemonSet. When deployed, it:
+
+- Installs Kata runtime binaries on each node
+- Configures the container runtime (containerd) to use Kata
+- Registers RuntimeClasses (`kata-qemu-nvidia-gpu-snp`, `kata-qemu-nvidia-gpu-tdx`, `kata-qemu-nvidia-gpu`, etc.)
+
+After installation, workloads can use Kata isolation by specifying `runtimeClassName: kata-qemu-nvidia-gpu-snp` (or another Kata RuntimeClass) in their pod spec.
+
+---
+
+## Upgrades with kata-lifecycle-manager
+
+### The Problem
+
+Standard `helm upgrade kata-deploy` updates all nodes simultaneously via the DaemonSet. This approach:
+
+- Provides no per-node verification
+- Offers no controlled rollback mechanism
+- Can leave the cluster in an inconsistent state if something fails
+
+### The Solution
+
+The `kata-lifecycle-manager` Helm chart uses Argo Workflows to orchestrate upgrades with the following guarantees:
+
+| Guarantee | Description |
+|-----------|-------------|
+| **Sequential Processing** | Nodes are upgraded one at a time |
+| **Per-Node Verification** | A user-provided pod validates Kata functionality after each node upgrade |
+| **Fail-Fast** | If verification fails, the workflow stops immediately |
+| **Automatic Rollback** | On failure, Helm rollback is executed and the node is restored |
+
+### Upgrade Flow
+
+For each node in the cluster:
+
+1. **Cordon** - Mark node as unschedulable
+2. **Drain** (optional) - Evict existing workloads
+3. **Upgrade** - Run `helm upgrade kata-deploy` targeting this node
+4. **Wait** - Ensure kata-deploy DaemonSet pod is ready
+5. **Verify** - Run verification pod to confirm Kata works
+6. **Uncordon** - Mark node as schedulable again
+
+If verification fails on any node, the workflow:
+- Rolls back the Helm release
+- Uncordons the node
+- Stops processing (remaining nodes are not upgraded)
+
+### Verification Pod
+
+Users must provide a verification pod that tests Kata functionality. This pod:
+
+- Uses a Kata RuntimeClass
+- Is scheduled on the specific node being verified
+- Runs whatever validation logic the user requires (smoke tests, attestation checks, etc.)
+
+**Basic GPU Verification Example:**
+
+For clusters with NVIDIA GPUs, the CUDA VectorAdd sample provides a more comprehensive verification:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: ${TEST_POD}
+spec:
+  runtimeClassName: kata-qemu-nvidia-gpu-snp # or kata-qemu-nvidia-gpu-tdx
+  restartPolicy: Never
+  nodeSelector:
+    kubernetes.io/hostname: ${NODE}
+  containers:
+  - name: cuda-vectoradd
+    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04
+    resources:
+      limits:
+        nvidia.com/pgpu: "1"
+        memory: 16Gi
+```
+
+This verifies that GPU passthrough works correctly with the upgraded Kata runtime.
+
+The placeholders `${NODE}` and `${TEST_POD}` are substituted at runtime.
+
+---
+
+## Demo Recordings
+
+| Demo | Description | Link |
+|------|-------------|------|
+| Sunny Path | Successful upgrade from 3.24.0 to 3.25.0 | [TODO] |
+| Rainy Path | Failed verification triggers rollback | [TODO] |
+
+---
+
+## References
+
+- [kata-deploy Helm Chart](tools/packaging/kata-deploy/helm-chart/README.md)
+- [kata-lifecycle-manager Helm Chart](tools/packaging/kata-deploy/helm-chart/kata-lifecycle-manager/README.md)
+- [kata-lifecycle-manager Design Document](docs/design/kata-lifecycle-manager-design.md)
--- a/docs/Release-Process.md
+++ b/docs/Release-Process.md
@@ -28,13 +28,15 @@ Bug fixes are released as part of `MINOR` or `MAJOR` releases only. `PATCH` is a

 ## Release Process

-### Bump the `VERSION` and `Chart.yaml` file
+### Bump the `VERSION` and `Chart.yaml` files

 When the `kata-containers/kata-containers` repository is ready for a new release,
 first create a PR to set the release in the [`VERSION`](./../VERSION) file and update the
-`version` and `appVersion` in the
-[`Chart.yaml`](./../tools/packaging/kata-deploy/helm-chart/kata-deploy/Chart.yaml) file and
-have it merged.
+`version` and `appVersion` in the following `Chart.yaml` files:
+- [`kata-deploy/Chart.yaml`](./../tools/packaging/kata-deploy/helm-chart/kata-deploy/Chart.yaml)
+- [`kata-lifecycle-manager/Chart.yaml`](./../tools/packaging/kata-deploy/helm-chart/kata-lifecycle-manager/Chart.yaml)
+
+Have the PR merged before proceeding.

 ### Lock the `main` branch

--- a/docs/design/README.md
+++ b/docs/design/README.md
@@ -19,6 +19,7 @@ Kata Containers design documents:
 - [Design for direct-assigned volume](direct-blk-device-assignment.md)
 - [Design for core-scheduling](core-scheduling.md)
 - [Virtualization Reference Architecture](kata-vra.md)
+- [Design for kata-lifecycle-manager Helm chart](kata-lifecycle-manager-design.md)
 ---

 - [Design proposals](proposals)
--- a/docs/design/kata-lifecycle-manager-design.md
+++ b/docs/design/kata-lifecycle-manager-design.md
@@ -0,0 +1,502 @@
+# Kata Containers Lifecycle Manager Design
+
+## Summary
+
+This document proposes a Helm chart-based orchestration solution for Kata Containers that
+enables controlled, node-by-node upgrades with verification and rollback capabilities
+using Argo Workflows.
+
+## Motivation
+
+### Problem Statement
+
+Upgrading Kata Containers in a production Kubernetes cluster presents several challenges:
+
+1. **Workload Scheduling Control**: New Kata workloads should not be scheduled on a node
+   during upgrade until the new runtime is verified.
+
+2. **Verification Gap**: There is no standardized way to verify that Kata is working correctly
+   after an upgrade before allowing workloads to return to the node. This solution addresses
+   the gap by running a user-provided verification pod on each upgraded node.
+
+3. **Rollback Complexity**: If an upgrade fails, administrators must manually coordinate
+   rollback across multiple nodes.
+
+4. **Controlled Rollout**: Operators need the ability to upgrade nodes incrementally
+   (canary approach) with fail-fast behavior if any node fails verification.
+
+5. **Multi-Architecture Support**: The upgrade tooling must work across all architectures
+   supported by Kata Containers (amd64, arm64, s390x, ppc64le).
+
+### Current State
+
+The `kata-deploy` Helm chart provides installation and configuration of Kata Containers,
+including a post-install verification job. However, there is no built-in mechanism for
+orchestrating upgrades across nodes in a controlled manner.
+
+## Goals
+
+1. Provide a standardized, automated way to upgrade Kata Containers node-by-node
+2. Ensure each node is verified before returning to service
+3. Support user-defined verification logic
+4. Automatically rollback if verification fails
+5. Work with the existing `kata-deploy` Helm chart
+6. Support all Kata-supported architectures
+
+## Non-Goals
+
+1. Initial Kata Containers installation (use kata-deploy Helm chart for that)
+2. Managing Kubernetes cluster upgrades
+3. Providing Kata-specific verification logic (this is user responsibility)
+4. Managing Argo Workflows installation
+
+## Argo Workflows Dependency
+
+### What Works Without Argo
+
+The following components work independently of Argo Workflows:
+
+| Component | Description |
+|-----------|-------------|
+| **kata-deploy Helm chart** | Full installation, configuration, `RuntimeClasses` |
+| **Post-install verification** | Helm hook runs verification pod after install |
+| **Label-gated deployment** | Progressive rollout via node labels |
+| **Manual upgrades** | User can script: cordon, helm upgrade, verify, `uncordon` |
+
+Users who do not want Argo can still:
+- Install and configure Kata via kata-deploy
+- Perform upgrades manually or with custom scripts
+- Use the verification pod pattern in their own automation
+
+### What Requires Argo
+
+The kata-lifecycle-manager Helm chart provides orchestration via Argo Workflows:
+
+| Feature | Description |
+|---------|-------------|
+| **Automated node-by-node upgrades** | Sequential processing with fail-fast |
+| **Taint-based node selection** | Select nodes by taint key/value |
+| **`WorkflowTemplate`** | Reusable upgrade workflow |
+| **Rollback entrypoint** | `argo submit --entrypoint rollback-node` |
+| **Status tracking** | Node annotations updated at each phase |
+
+### For Users Already Using Argo
+
+If your cluster already has Argo Workflows installed:
+
+```bash
+# Install kata-lifecycle-manager - integrates with your existing Argo installation
+helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-deploy-charts/kata-lifecycle-manager \
+  --set argoNamespace=argo \
+  --set-file defaults.verificationPod=./verification-pod.yaml
+
+# Trigger upgrades via argo CLI or integrate with existing workflows
+argo submit -n argo --from workflowtemplate/kata-lifecycle-manager -p target-version=3.25.0
+```
+
+kata-lifecycle-manager can also be triggered by other Argo workflows, CI/CD pipelines, or `GitOps`
+tools that support Argo.
+
+### For Users Not Wanting Argo
+
+If you prefer not to use Argo Workflows:
+
+1. **Use kata-deploy directly** - handles installation and basic verification
+2. **Script your own orchestration** - example approach:
+
+```bash
+#!/bin/bash
+# Manual upgrade script (no Argo required)
+set -euo pipefail
+
+VERSION="3.25.0"
+
+# Upgrade each node with Kata runtime
+kubectl get nodes -l katacontainers.io/kata-runtime=true -o name | while read -r node_path; do
+  NODE="${node_path#node/}"
+  echo "Upgrading $NODE..."
+  kubectl cordon "$NODE"
+  
+  helm upgrade kata-deploy oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
+    --namespace kube-system \
+    --version "$VERSION" \
+    --reuse-values \
+    --wait
+  
+  # Wait for DaemonSet pod on this node
+  kubectl rollout status daemonset/kata-deploy -n kube-system
+  
+  # Run verification (apply your pod, wait, check exit code)
+  kubectl apply -f verification-pod.yaml
+  kubectl wait pod/kata-verify --for=jsonpath='{.status.phase}'=Succeeded --timeout=180s
+  kubectl delete pod/kata-verify
+  
+  kubectl uncordon "$NODE"
+  echo "$NODE upgraded successfully"
+done
+```
+
+This approach requires more manual effort but avoids the Argo dependency.
+
+## Proposed Design
+
+### Architecture Overview
+
+```text
+┌─────────────────────────────────────────────────────────────────┐
+│                    Argo Workflows Controller                    │
+│                         (pre-installed)                         │
+└────────────────────────────┬────────────────────────────────────┘
+                             │
+                             ▼
+┌──────────────────────────────────────────────────────────────┐
+│                    kata-lifecycle-manager Helm Chart                   │
+│  ┌────────────────────────────────────────────────────────┐  │
+│  │                   WorkflowTemplate                     │  │
+│  │  - upgrade-all-nodes (entrypoint)                      │  │
+│  │  - upgrade-single-node (per-node steps)                │  │
+│  │  - rollback-node (manual recovery)                     │  │
+│  └────────────────────────────────────────────────────────┘  │
+│  ┌────────────────────────────────────────────────────────┐  │
+│  │                   RBAC Resources                       │  │
+│  │  - ServiceAccount                                      │  │
+│  │  - ClusterRole (node, pod, helm operations)            │  │
+│  │  - ClusterRoleBinding                                  │  │
+│  └────────────────────────────────────────────────────────┘  │
+└──────────────────────────────────────────────────────────────┘
+                             │
+                             ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                    kata-deploy Helm Chart                       │
+│                   (existing installation)                       │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Upgrade Flow
+
+For each node selected by the upgrade label:
+
+```text
+┌────────────┐    ┌──────────────┐    ┌────────────┐    ┌────────────┐
+│  Prepare   │───▶│  Cordon      │───▶│  Upgrade   │───▶│Wait Ready  │
+│ (annotate) │    │  (mark       │    │  (helm     │    │(kata-deploy│
+│            │    │unschedulable)│    │ upgrade)   │    │ DaemonSet) │
+└────────────┘    └──────────────┘    └────────────┘    └────────────┘
+                                                               │
+                                                               ▼
+                  ┌────────────┐    ┌──────────────┐    ┌────────────┐
+                  │  Complete  │◀───│   Uncordon   │◀───│  Verify    │
+                  │ (annotate  │    │  (mark       │    │  (user pod)│
+                  │  version)  │    │schedulable)  │    │            │
+                  └────────────┘    └──────────────┘    └────────────┘
+```
+
+**Note:** Drain is not required for Kata upgrades. Running Kata VMs continue using
+the in-memory binaries. Only new workloads use the upgraded binaries. Cordon ensures
+the verification pod runs before any new workloads are scheduled with the new runtime.
+
+**Optional Drain:** For users who prefer to evict workloads before any maintenance
+operation, an optional drain step can be enabled via `drain-enabled=true`. When
+enabled, an additional drain step runs after cordon and before upgrade.
+
+### Node Selection Model
+
+Nodes can be selected for upgrade using **labels**, **taints**, or **both**.
+
+**Label-based selection:**
+
+```bash
+# Select nodes by label
+argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
+  -p target-version=3.25.0 \
+  -p node-selector="katacontainers.io/kata-lifecycle-manager-window=true"
+```
+
+**Taint-based selection:**
+
+Some organizations use taints to mark nodes for maintenance. The workflow supports
+selecting nodes by taint key and optionally taint value:
+
+```bash
+# Select nodes with a specific taint
+kubectl taint nodes worker-1 kata-lifecycle-manager=pending:NoSchedule
+
+argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
+  -p target-version=3.25.0 \
+  -p node-taint-key=kata-lifecycle-manager \
+  -p node-taint-value=pending
+```
+
+**Combined selection:**
+
+Labels and taints can be used together for precise targeting:
+
+```bash
+argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
+  -p target-version=3.25.0 \
+  -p node-selector="node-pool=kata-pool" \
+  -p node-taint-key=maintenance
+```
+
+This allows operators to:
+1. Upgrade a single canary node first
+2. Gradually add nodes to the upgrade window
+3. Control upgrade timing via `GitOps` or automation
+4. Integrate with existing taint-based maintenance workflows
+
+### Node Pool Support
+
+The node selector and taint selector parameters enable basic node pool targeting:
+
+```bash
+# Upgrade only nodes matching a specific node pool label
+argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
+  -p target-version=3.25.0 \
+  -p node-selector="node-pool=kata-pool"
+```
+
+**Current Capabilities:**
+
+| Feature | Status | Chart | Notes |
+|---------|--------|-------|-------|
+| Label-based selection | Supported | kata-lifecycle-manager | Works with any label combination |
+| Taint-based selection | Supported | kata-lifecycle-manager | Select by taint key/value |
+| Sequential upgrades | Supported | kata-lifecycle-manager | One node at a time with fail-fast |
+| Pool-specific verification pods | Not supported | kata-lifecycle-manager | Same verification for all nodes |
+| Pool-ordered upgrades | Not supported | kata-lifecycle-manager | Upgrade pool A before pool B |
+
+See the [Potential Enhancements](#potential-enhancements) section for future work.
+
+### Verification Model
+
+**Verification runs on each node that is upgraded.** The node is only `uncordoned` after
+its verification pod succeeds. If verification fails, automatic rollback is triggered
+to restore the previous version before `uncordoning` the node.
+
+**Common failure modes detected by verification:**
+- Pod stuck in Pending/`ContainerCreating` (runtime can't start VM)
+- Pod crashes immediately (containerd/CRI-O configuration issues)
+- Pod times out (resource issues, image pull failures)
+- Pod exits with non-zero code (verification logic failed)
+
+All of these trigger automatic rollback. The workflow logs include pod status, events,
+and logs to help diagnose the issue.
+
+The user provides a complete Pod YAML that:
+- Uses the Kata runtime class they want to verify
+- Contains their verification logic (e.g., attestation checks)
+- Exits 0 on success, non-zero on failure
+- Includes tolerations for cordoned nodes (verification runs while node is cordoned)
+- Includes a `nodeSelector` to ensure it runs on the specific node being upgraded
+
+When upgrading multiple nodes (via label selector), nodes are processed sequentially.
+For each node, the following placeholders are substituted with that node's specific values,
+ensuring the verification pod runs on the exact node that was just upgraded:
+
+- `${NODE}` - The hostname of the node being upgraded/verified
+- `${TEST_POD}` - A generated unique pod name
+
+Example verification pod:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: ${TEST_POD}
+spec:
+  runtimeClassName: kata-qemu
+  restartPolicy: Never
+  nodeSelector:
+    kubernetes.io/hostname: ${NODE}
+  tolerations:
+    - operator: Exists    # Required: node is cordoned during verification
+  containers:
+    - name: verify
+      image: quay.io/kata-containers/alpine-bash-curl:latest
+      command: ["uname", "-a"]
+```
+
+This design keeps verification logic entirely in the user's domain, supporting:
+- Different runtime classes (`kata-qemu`, `kata-qemu-snp`, `kata-qemu-tdx`, etc.)
+- TEE-specific attestation verification
+- GPU/accelerator validation
+- Custom application smoke tests
+
+### Sequential Execution with Fail-Fast
+
+Nodes are upgraded strictly sequentially using recursive Argo templates. This design
+ensures that if any node fails verification, the workflow stops immediately before
+touching remaining nodes, preventing a mixed-version fleet.
+
+Alternative approaches considered:
+- **`withParam` + semaphore**: Provides cleaner UI but semaphore only controls concurrency,
+  not failure propagation. Other nodes would still proceed after one fails.
+- **`withParam` + `failFast`**: Would be ideal, but Argo only supports `failFast` for DAG
+  tasks, not for steps with `withParam`.
+
+The recursive template approach (`upgrade-node-chain`) naturally provides fail-fast
+behavior because if any step in the chain fails, the recursion stops.
+
+### Status Tracking
+
+Node upgrade status is tracked via Kubernetes annotations:
+
+| Annotation | Values |
+|------------|--------|
+| `katacontainers.io/kata-lifecycle-manager-status` | preparing, cordoned, draining, upgrading, verifying, completed, rolling-back, rolled-back |
+| `katacontainers.io/kata-current-version` | Version string (e.g., "3.25.0") |
+
+This enables:
+- Monitoring upgrade progress via `kubectl get nodes`
+- Integration with external monitoring systems
+- Recovery from interrupted upgrades
+
+### Rollback Support
+
+**Automatic rollback on verification failure:** If the verification pod fails (non-zero exit),
+kata-lifecycle-manager automatically:
+1. Runs `helm rollback` to revert to the previous Helm release
+2. Waits for kata-deploy DaemonSet to be ready with the previous version
+3. `Uncordons` the node
+4. Annotates the node with `rolled-back` status
+
+This ensures nodes are never left in a broken state.
+
+**Manual rollback:** For cases where you need to rollback a successfully upgraded node:
+
+```bash
+argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
+  --entrypoint rollback-node \
+  -p node-name=worker-1
+```
+
+## Components
+
+### Container Images
+
+Two multi-architecture container images are built and published:
+
+| Image | Purpose | Architectures |
+|-------|---------|---------------|
+| `quay.io/kata-containers/kubectl:latest` | Kubernetes operations | amd64, arm64, s390x, ppc64le |
+| `quay.io/kata-containers/helm:latest` | Helm operations | amd64, arm64, s390x, ppc64le |
+
+Images are rebuilt weekly to pick up security updates and tool version upgrades.
+
+### Helm Chart Structure
+
+```text
+kata-lifecycle-manager/
+├── Chart.yaml                  # Chart metadata
+├── values.yaml                 # Configurable defaults
+├── README.md                   # Usage documentation
+└── templates/
+    ├── _helpers.tpl            # Template helpers
+    ├── rbac.yaml               # ServiceAccount, ClusterRole, ClusterRoleBinding
+    └── workflow-template.yaml  # Argo `WorkflowTemplate`
+```
+
+### RBAC Requirements
+
+The workflow requires the following permissions:
+
+| Resource | Verbs | Purpose |
+|----------|-------|---------|
+| nodes | get, list, watch, patch | `cordon`/`uncordon`, annotations |
+| pods | get, list, watch, create, delete | Verification pods |
+| pods/log | get | Verification output |
+| `daemonsets` | get, list, watch | Wait for `kata-deploy` |
+
+## User Experience
+
+### Installation
+
+```bash
+# Install kata-lifecycle-manager with verification config
+helm install kata-lifecycle-manager oci://ghcr.io/kata-containers/kata-deploy-charts/kata-lifecycle-manager \
+  --set-file defaults.verificationPod=/path/to/verification-pod.yaml
+```
+
+### Triggering an Upgrade
+
+```bash
+# Label nodes for upgrade
+kubectl label node worker-1 katacontainers.io/kata-lifecycle-manager-window=true
+
+# Submit upgrade workflow
+argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \
+  -p target-version=3.25.0
+
+# Watch progress
+argo watch @latest
+```
+
+### Monitoring
+
+```bash
+kubectl get nodes \
+  -L katacontainers.io/kata-runtime \
+  -L katacontainers.io/kata-lifecycle-manager-status \
+  -L katacontainers.io/kata-current-version
+```
+
+## Security Considerations
+
+1. **Namespace-Scoped Templates**: The chart creates a `WorkflowTemplate` (namespace-scoped)
+   rather than `ClusterWorkflowTemplate` by default, reducing blast radius.
+
+2. **Required Verification**: The chart fails to install if `defaults.verificationPod` is
+   not provided, ensuring upgrades are always verified.
+
+3. **Minimal RBAC**: The `ServiceAccount` has only the permissions required for upgrade
+   operations.
+
+4. **User-Controlled Verification**: Verification logic is entirely user-defined, avoiding
+   any hardcoded assumptions about what "working" means.
+
+## Integration with Release Process
+
+The `kata-lifecycle-manager` chart is:
+- Packaged alongside `kata-deploy` during releases
+- Published to the same OCI registries (`quay.io`, `ghcr.io`)
+- Versioned to match `kata-deploy`
+
+## Potential Enhancements
+
+The following enhancements could be considered if needed:
+
+### kata-lifecycle-manager
+
+1. **Pool-Specific Verification**: Different verification pods for different node pools
+   (e.g., GPU nodes vs. CPU-only nodes).
+
+2. **Ordered Pool Upgrades**: Upgrade node pool A completely before starting pool B.
+
+## Alternatives Considered
+
+### 1. DaemonSet-Based Upgrades
+
+Using a DaemonSet to coordinate upgrades on each node.
+
+**Rejected because**: DaemonSets don't provide the node-by-node sequencing and
+verification workflow needed for controlled upgrades.
+
+### 2. Operator Pattern
+
+Building a Kubernetes Operator to manage upgrades.
+
+**Rejected because**: Adds significant complexity and maintenance burden. Argo Workflows
+is already widely adopted and provides the orchestration primitives needed.
+
+### 3. Shell Script Orchestration
+
+Providing a shell script that loops through nodes.
+
+**Rejected because**: Less reliable, harder to monitor, no built-in retry/recovery,
+and doesn't integrate with Kubernetes-native tooling.
+
+## References
+
+- [kata-deploy Helm Chart](https://github.com/kata-containers/kata-containers/tree/main/tools/packaging/kata-deploy/helm-chart/kata-deploy)
+- [Argo Workflows](https://argoproj.github.io/argo-workflows/)
+- [Helm Documentation](https://helm.sh/docs/)
--- a/src/runtime-rs/config/configuration-cloud-hypervisor.toml.in
+++ b/src/runtime-rs/config/configuration-cloud-hypervisor.toml.in
@@ -19,7 +19,7 @@ image = "@IMAGEPATH@"
 #   - xfs
 #   - erofs
 rootfs_type = @DEFROOTFSTYPE@
- 
+
 # Block storage driver to be used for the VM rootfs is backed
 # by a block device.
 vm_rootfs_driver = "@VMROOTFSDRIVER_CLH@"
@@ -41,7 +41,7 @@ valid_hypervisor_paths = @CLHVALIDHYPERVISORPATHS@

 # List of valid annotations values for ctlpath
 # The default if not set is empty (all annotations rejected.)
-# Your distribution recommends: 
+# Your distribution recommends:
 valid_ctlpaths = []

 # Optional space-separated list of options to pass to the guest kernel.
--- a/src/runtime-rs/config/configuration-dragonball.toml.in
+++ b/src/runtime-rs/config/configuration-dragonball.toml.in
@@ -23,7 +23,7 @@ image = "@IMAGEPATH@"
 #   - erofs
 rootfs_type = @DEFROOTFSTYPE@

- 
+
 # Block storage driver to be used for the VM rootfs is backed
 # by a block device. This is virtio-blk-pci, virtio-blk-mmio or nvdimm
 vm_rootfs_driver = "@VMROOTFSDRIVER_DB@"
@@ -41,7 +41,7 @@ valid_hypervisor_paths = @DBVALIDHYPERVISORPATHS@

 # List of valid annotations values for ctlpath
 # The default if not set is empty (all annotations rejected.)
-# Your distribution recommends: 
+# Your distribution recommends:
 valid_ctlpaths = []

 # Optional space-separated list of options to pass to the guest kernel.
--- a/src/runtime-rs/config/configuration-qemu-runtime-rs.toml.in
+++ b/src/runtime-rs/config/configuration-qemu-runtime-rs.toml.in
@@ -373,16 +373,16 @@ disable_image_nvdimm = false
 # Default false
 hotplug_vfio_on_root_bus = false

-# Enable hot-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
+# Enable hot-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
 # The default setting is  "no-port"
 hot_plug_vfio = "no-port"

 # In a confidential compute environment hot-plugging can compromise
-# security. 
-# Enable cold-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
-# The default setting is  "no-port", which means disabled. 
+# security.
+# Enable cold-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
+# The default setting is  "no-port", which means disabled.
 cold_plug_vfio = "no-port"

 # Before hot plugging a PCIe device, you need to add a pcie_root_port device.
--- a/src/runtime-rs/config/configuration-qemu-snp-runtime-rs.toml.in
+++ b/src/runtime-rs/config/configuration-qemu-snp-runtime-rs.toml.in
@@ -767,4 +767,4 @@ dan_conf = "@DEFDANCONF@"
 #              to non-k8s cases)
 #      cold_plug_vfio != no_port AND pod_resource_api_sock != "" => kubelet
 #              based cold plug.
-pod_resource_api_sock = "@DEFPODRESOURCEAPISOCK@"
+pod_resource_api_sock = "@DEFPODRESOURCEAPISOCK@"
--- a/src/runtime-rs/config/configuration-qemu-tdx-runtime-rs.toml.in
+++ b/src/runtime-rs/config/configuration-qemu-tdx-runtime-rs.toml.in
@@ -39,7 +39,7 @@ vm_rootfs_driver = "virtio-blk-pci"
 #
 # Known limitations:
 # * Does not work by design:
-#   - CPU Hotplug 
+#   - CPU Hotplug
 #   - Memory Hotplug
 #   - NVDIMM devices
 #
--- a/src/runtime-rs/config/configuration-rs-fc.toml.in
+++ b/src/runtime-rs/config/configuration-rs-fc.toml.in
@@ -304,7 +304,7 @@ debug_console_enabled = false

 # Agent connection dialing timeout value in seconds
 # (default: 45)
-dial_timeout = 45 
+dial_timeout = 45

 # Confidential Data Hub API timeout value in seconds
 # (default: 50)
--- a/src/runtime/Makefile
+++ b/src/runtime/Makefile
@@ -174,10 +174,6 @@ HYPERVISORS := $(HYPERVISOR_FC) $(HYPERVISOR_QEMU) $(HYPERVISOR_CLH) $(HYPERVISO
 QEMUPATH := $(QEMUBINDIR)/$(QEMUCMD)
 QEMUVALIDHYPERVISORPATHS := [\"$(QEMUPATH)\"]

-#QEMUTDXPATH := $(QEMUBINDIR)/$(QEMUTDXCMD)
-QEMUTDXPATH := PLACEHOLDER_FOR_DISTRO_QEMU_WITH_TDX_SUPPORT
-QEMUTDXVALIDHYPERVISORPATHS := [\"$(QEMUTDXPATH)\"]
-
 QEMUTDXEXPERIMENTALPATH := $(QEMUBINDIR)/$(QEMUTDXEXPERIMENTALCMD)
 QEMUTDXEXPERIMENTALVALIDHYPERVISORPATHS := [\"$(QEMUTDXEXPERIMENTALPATH)\"]

@@ -702,18 +698,15 @@ USER_VARS += PROJECT_TYPE
 USER_VARS += PROJECT_URL
 USER_VARS += QEMUBINDIR
 USER_VARS += QEMUCMD
-USER_VARS += QEMUTDXCMD
 USER_VARS += QEMUTDXEXPERIMENTALCMD
 USER_VARS += QEMUCCAEXPERIMENTALCMD
 USER_VARS += QEMUSNPCMD
 USER_VARS += QEMUPATH
-USER_VARS += QEMUTDXPATH
 USER_VARS += QEMUTDXEXPERIMENTALPATH
 USER_VARS += QEMUTDXQUOTEGENERATIONSERVICESOCKETPORT
 USER_VARS += QEMUSNPPATH
 USER_VARS += QEMUCCAEXPERIMENTALPATH
 USER_VARS += QEMUVALIDHYPERVISORPATHS
-USER_VARS += QEMUTDXVALIDHYPERVISORPATHS
 USER_VARS += QEMUTDXEXPERIMENTALVALIDHYPERVISORPATHS
 USER_VARS += QEMUCCAVALIDHYPERVISORPATHS
 USER_VARS += QEMUCCAEXPERIMENTALVALIDHYPERVISORPATHS
--- a/src/runtime/config/configuration-clh.toml.in
+++ b/src/runtime/config/configuration-clh.toml.in
@@ -251,9 +251,9 @@ guest_hook_path = ""
 # and we strongly advise users to refer the Cloud Hypervisor official
 # documentation for a better understanding of its internals:
 # https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/io_throttling.md
-# 
+#
 # Bandwidth rate limiter options
-# 
+#
 # net_rate_limiter_bw_max_rate controls network I/O bandwidth (size in bits/sec
 # for SB/VM).
 # The same value is used for inbound and outbound bandwidth.
@@ -287,9 +287,9 @@ net_rate_limiter_ops_one_time_burst = 0
 # and we strongly advise users to refer the Cloud Hypervisor official
 # documentation for a better understanding of its internals:
 # https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/io_throttling.md
-# 
+#
 # Bandwidth rate limiter options
-# 
+#
 # disk_rate_limiter_bw_max_rate controls disk I/O bandwidth (size in bits/sec
 # for SB/VM).
 # The same value is used for inbound and outbound bandwidth.
@@ -476,9 +476,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-fc.toml.in
+++ b/src/runtime/config/configuration-fc.toml.in
@@ -367,9 +367,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-qemu-cca.toml.in
+++ b/src/runtime/config/configuration-qemu-cca.toml.in
@@ -636,9 +636,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-qemu-coco-dev.toml.in
+++ b/src/runtime/config/configuration-qemu-coco-dev.toml.in
@@ -362,17 +362,17 @@ msize_9p = @DEFMSIZE9P@
 # nvdimm is not supported when `confidential_guest = true`.
 disable_image_nvdimm = @DEFDISABLEIMAGENVDIMM@

-# Enable hot-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
+# Enable hot-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
 # The default setting is  "no-port"
-hot_plug_vfio = "no-port" 
+hot_plug_vfio = "no-port"

 # In a confidential compute environment hot-plugging can compromise
-# security. 
-# Enable cold-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
-# The default setting is  "no-port", which means disabled. 
-cold_plug_vfio = "no-port" 
+# security.
+# Enable cold-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
+# The default setting is  "no-port", which means disabled.
+cold_plug_vfio = "no-port"

 # Before hot plugging a PCIe device, you need to add a pcie_root_port device.
 # Use this parameter when using some large PCI bar devices, such as Nvidia GPU
@@ -694,9 +694,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-qemu-nvidia-gpu-snp.toml.in
+++ b/src/runtime/config/configuration-qemu-nvidia-gpu-snp.toml.in
@@ -34,7 +34,7 @@ rootfs_type = @DEFROOTFSTYPE@
 #
 # Known limitations:
 # * Does not work by design:
-#   - CPU Hotplug 
+#   - CPU Hotplug
 #   - Memory Hotplug
 #   - NVDIMM devices
 #
@@ -75,7 +75,7 @@ snp_id_auth = ""

 # SNP Guest Policy, the ‘POLICY’ parameter to the SNP_LAUNCH_START command.
 # If unset, the QEMU default policy (0x30000) will be used.
-# Notice that the guest policy is enforced at VM launch, and your pod VMs 
+# Notice that the guest policy is enforced at VM launch, and your pod VMs
 # won't start at all if the policy denys it. This will be indicated by a
 # 'SNP_LAUNCH_START' error.
 snp_guest_policy = 196608
@@ -394,10 +394,10 @@ disable_image_nvdimm = @DEFDISABLEIMAGENVDIMM_NV@
 pcie_root_port = 0

 # In a confidential compute environment hot-plugging can compromise
-# security. 
-# Enable cold-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
-# The default setting is  "no-port", which means disabled. 
+# security.
+# Enable cold-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
+# The default setting is  "no-port", which means disabled.
 cold_plug_vfio = "@DEFAULTVFIOPORT_NV@"

 # If vhost-net backend for virtio-net is not desired, set to true. Default is false, which trades off
@@ -710,9 +710,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFAULTTIMEOUT_NV@

--- a/src/runtime/config/configuration-qemu-nvidia-gpu-tdx.toml.in
+++ b/src/runtime/config/configuration-qemu-nvidia-gpu-tdx.toml.in
@@ -34,7 +34,7 @@ rootfs_type = @DEFROOTFSTYPE@
 #
 # Known limitations:
 # * Does not work by design:
-#   - CPU Hotplug 
+#   - CPU Hotplug
 #   - Memory Hotplug
 #   - NVDIMM devices
 #
@@ -371,10 +371,10 @@ disable_image_nvdimm = @DEFDISABLEIMAGENVDIMM_NV@
 pcie_root_port = 0

 # In a confidential compute environment hot-plugging can compromise
-# security. 
-# Enable cold-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
-# The default setting is  "no-port", which means disabled. 
+# security.
+# Enable cold-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
+# The default setting is  "no-port", which means disabled.
 cold_plug_vfio = "@DEFAULTVFIOPORT_NV@"

 # If vhost-net backend for virtio-net is not desired, set to true. Default is false, which trades off
@@ -687,9 +687,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFAULTTIMEOUT_NV@

--- a/src/runtime/config/configuration-qemu-nvidia-gpu.toml.in
+++ b/src/runtime/config/configuration-qemu-nvidia-gpu.toml.in
@@ -361,16 +361,16 @@ msize_9p = @DEFMSIZE9P@
 # nvdimm is not supported when `confidential_guest = true`.
 disable_image_nvdimm = @DEFDISABLEIMAGENVDIMM_NV@

-# Enable hot-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
+# Enable hot-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
 # The default setting is  "no-port"
 hot_plug_vfio = "no-port"

 # In a confidential compute environment hot-plugging can compromise
-# security. 
-# Enable cold-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
-# The default setting is  "no-port", which means disabled. 
+# security.
+# Enable cold-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
+# The default setting is  "no-port", which means disabled.
 cold_plug_vfio = "@DEFAULTVFIOPORT_NV@"

 # Before hot plugging a PCIe device, you need to add a pcie_root_port device.
@@ -689,9 +689,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFAULTTIMEOUT_NV@

--- a/src/runtime/config/configuration-qemu-se.toml.in
+++ b/src/runtime/config/configuration-qemu-se.toml.in
@@ -25,7 +25,7 @@ machine_type = "@MACHINETYPE@"
 #
 # Known limitations:
 # * Does not work by design:
-#   - CPU Hotplug 
+#   - CPU Hotplug
 #   - Memory Hotplug
 #   - NVDIMM devices
 #
@@ -349,7 +349,7 @@ msize_9p = @DEFMSIZE9P@
 # nvdimm is not supported when `confidential_guest = true`.
 disable_image_nvdimm = @DEFDISABLEIMAGENVDIMM@

-# Enable hot-plugging of VFIO devices to a bridge-port, 
+# Enable hot-plugging of VFIO devices to a bridge-port,
 # root-port or switch-port.
 # The default setting is "no-port"
 hot_plug_vfio = "no-port"
@@ -677,9 +677,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-qemu-snp.toml.in
+++ b/src/runtime/config/configuration-qemu-snp.toml.in
@@ -33,7 +33,7 @@ rootfs_type = @DEFROOTFSTYPE@
 #
 # Known limitations:
 # * Does not work by design:
-#   - CPU Hotplug 
+#   - CPU Hotplug
 #   - Memory Hotplug
 #   - NVDIMM devices
 #
@@ -74,7 +74,7 @@ snp_id_auth = ""

 # SNP Guest Policy, the ‘POLICY’ parameter to the SNP_LAUNCH_START command.
 # If unset, the QEMU default policy (0x30000) will be used.
-# Notice that the guest policy is enforced at VM launch, and your pod VMs 
+# Notice that the guest policy is enforced at VM launch, and your pod VMs
 # won't start at all if the policy denys it. This will be indicated by a
 # 'SNP_LAUNCH_START' error.
 snp_guest_policy = 196608
@@ -702,9 +702,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-qemu-tdx.toml.in
+++ b/src/runtime/config/configuration-qemu-tdx.toml.in
@@ -12,7 +12,7 @@
 # XXX:   Type: @PROJECT_TYPE@

 [hypervisor.qemu]
-path = "@QEMUTDXPATH@"
+path = "@QEMUPATH@"
 kernel = "@KERNELCONFIDENTIALPATH@"
 image = "@IMAGECONFIDENTIALPATH@"
 machine_type = "@MACHINETYPE@"
@@ -33,7 +33,7 @@ rootfs_type = @DEFROOTFSTYPE@
 #
 # Known limitations:
 # * Does not work by design:
-#   - CPU Hotplug 
+#   - CPU Hotplug
 #   - Memory Hotplug
 #   - NVDIMM devices
 #
@@ -54,7 +54,7 @@ enable_annotations = @DEFENABLEANNOTATIONS_COCO@
 # Each member of the list is a path pattern as described by glob(3).
 # The default if not set is empty (all annotations rejected.)
 # Your distribution recommends: @QEMUVALIDHYPERVISORPATHS@
-valid_hypervisor_paths = @QEMUTDXVALIDHYPERVISORPATHS@
+valid_hypervisor_paths = @QEMUVALIDHYPERVISORPATHS@

 # Optional space-separated list of options to pass to the guest kernel.
 # For example, use `kernel_params = "vsyscall=emulate"` if you are having
@@ -679,9 +679,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-qemu.toml.in
+++ b/src/runtime/config/configuration-qemu.toml.in
@@ -361,17 +361,17 @@ msize_9p = @DEFMSIZE9P@
 # nvdimm is not supported when `confidential_guest = true`.
 disable_image_nvdimm = @DEFDISABLEIMAGENVDIMM@

-# Enable hot-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
+# Enable hot-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
 # The default setting is  "no-port"
 hot_plug_vfio = "no-port"

 # In a confidential compute environment hot-plugging can compromise
-# security. 
-# Enable cold-plugging of VFIO devices to a bridge-port, 
-# root-port or switch-port. 
-# The default setting is  "no-port", which means disabled. 
-cold_plug_vfio = "no-port" 
+# security.
+# Enable cold-plugging of VFIO devices to a bridge-port,
+# root-port or switch-port.
+# The default setting is  "no-port", which means disabled.
+cold_plug_vfio = "no-port"

 # Before hot plugging a PCIe device, you need to add a pcie_root_port device.
 # Use this parameter when using some large PCI bar devices, such as Nvidia GPU
@@ -693,9 +693,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/config/configuration-stratovirt.toml.in
+++ b/src/runtime/config/configuration-stratovirt.toml.in
@@ -410,9 +410,9 @@ enable_pprof = false

 # Indicates the CreateContainer request timeout needed for the workload(s)
 # It using guest_pull this includes the time to pull the image inside the guest
-# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)  
-# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config 
-# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout. 
+# Defaults to @DEFCREATECONTAINERTIMEOUT@ second(s)
+# Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
+# (https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
 # In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
 create_container_timeout = @DEFCREATECONTAINERTIMEOUT@

--- a/src/runtime/pkg/containerd-shim-v2/device_cold_plug.go
+++ b/src/runtime/pkg/containerd-shim-v2/device_cold_plug.go
@@ -19,8 +19,13 @@ import (
 )

 const (
+	// containerd CRI annotations
 	nameAnnotation      = "io.kubernetes.cri.sandbox-name"
 	namespaceAnnotation = "io.kubernetes.cri.sandbox-namespace"
+
+	// CRI-O annotations
+	crioNameAnnotation      = "io.kubernetes.cri-o.KubeName"
+	crioNamespaceAnnotation = "io.kubernetes.cri-o.Namespace"
 )

 // coldPlugDevices handles cold plug of CDI devices into the sandbox
@@ -78,8 +83,7 @@ func coldPlugWithAPI(ctx context.Context, s *service, ociSpec *specs.Spec) error
 // the Kubelet does not pass the device information via CRI during
 // Sandbox creation.
 func getDeviceSpec(ctx context.Context, socket string, ann map[string]string) ([]string, error) {
-	podName := ann[nameAnnotation]
-	podNs := ann[namespaceAnnotation]
+	podName, podNs := getPodIdentifiers(ann)

 	// create dialer for unix socket
 	dialer := func(ctx context.Context, target string) (net.Conn, error) {
@@ -111,7 +115,7 @@ func getDeviceSpec(ctx context.Context, socket string, ann map[string]string) ([
 	}
 	resp, err := client.Get(ctx, prr)
 	if err != nil {
-		return nil, fmt.Errorf("cold plug: GetPodResources failed: %w", err)
+		return nil, fmt.Errorf("cold plug: GetPodResources failed for pod(%s) in namespace(%s): %w", podName, podNs, err)
 	}
 	podRes := resp.PodResources
 	if podRes == nil {
@@ -141,6 +145,24 @@ func formatCDIDevIDs(specName string, devIDs []string) []string {
 	return result
 }

-func debugPodID(ann map[string]string) string {
-	return fmt.Sprintf("%s/%s", ann[namespaceAnnotation], ann[nameAnnotation])
+// getPodIdentifiers returns the pod name and namespace from annotations.
+// It first checks containerd CRI annotations, then falls back to CRI-O annotations.
+func getPodIdentifiers(ann map[string]string) (podName, podNamespace string) {
+	podName = ann[nameAnnotation]
+	podNamespace = ann[namespaceAnnotation]
+
+	// Fall back to CRI-O annotations if containerd annotations are empty
+	if podName == "" {
+		podName = ann[crioNameAnnotation]
+	}
+	if podNamespace == "" {
+		podNamespace = ann[crioNamespaceAnnotation]
+	}
+
+	return podName, podNamespace
+}
+
+func debugPodID(ann map[string]string) string {
+	podName, podNamespace := getPodIdentifiers(ann)
+	return fmt.Sprintf("%s/%s", podNamespace, podName)
 }
--- a/tests/functional/kata-deploy/kata-deploy-custom-runtimes.bats
+++ b/tests/functional/kata-deploy/kata-deploy-custom-runtimes.bats
@@ -155,13 +155,13 @@ EOF
 # End-to-End Tests (require cluster with kata-deploy)
 # =============================================================================

-@test "E2E: Custom RuntimeClass exists with correct properties" {
+@test "E2E: Custom RuntimeClass exists and can run a pod" {
 	# Check RuntimeClass exists
 	run kubectl get runtimeclass "${CUSTOM_RUNTIME_HANDLER}" -o name
 	if [[ "${status}" -ne 0 ]]; then
 		echo "# RuntimeClass not found. kata-deploy logs:" >&3
 		kubectl -n kube-system logs -l name=kata-deploy --tail=50 2>/dev/null || true
-		fail "Custom RuntimeClass ${CUSTOM_RUNTIME_HANDLER} not found"
+		die "Custom RuntimeClass ${CUSTOM_RUNTIME_HANDLER} not found"
 	fi

 	echo "# RuntimeClass ${CUSTOM_RUNTIME_HANDLER} exists" >&3
@@ -195,15 +195,6 @@ EOF
 	echo "# Label app.kubernetes.io/managed-by: ${label}" >&3
 	[[ "${label}" == "Helm" ]]

-	BATS_TEST_COMPLETED=1
-}
-
-@test "E2E: Custom runtime can run a pod" {
-	# Check if the custom RuntimeClass exists
-	if ! kubectl get runtimeclass "${CUSTOM_RUNTIME_HANDLER}" &>/dev/null; then
-		skip "Custom RuntimeClass ${CUSTOM_RUNTIME_HANDLER} not found"
-	fi
-
 	# Create a test pod using the custom runtime
 	cat <<EOF | kubectl apply -f -
 apiVersion: v1
@@ -239,7 +230,7 @@ EOF
 			Failed)
 				echo "# Pod failed" >&3
 				kubectl describe pod "${TEST_POD_NAME}" >&3
-				fail "Pod failed to run with custom runtime"
+				die "Pod failed to run with custom runtime"
 				;;
 			*)
 				local current_time
@@ -247,7 +238,7 @@ EOF
 				if (( current_time - start_time > timeout )); then
 					echo "# Timeout waiting for pod" >&3
 					kubectl describe pod "${TEST_POD_NAME}" >&3
-					fail "Timeout waiting for pod to be ready"
+					die "Timeout waiting for pod to be ready"
 				fi
 				sleep 5
 				;;
@@ -262,7 +253,7 @@ EOF
 		echo "# Pod ran successfully with custom runtime" >&3
 		BATS_TEST_COMPLETED=1
 	else
-		fail "Pod did not complete successfully (exit code: ${exit_code})"
+		die "Pod did not complete successfully (exit code: ${exit_code})"
 	fi
 }

--- a/tests/functional/kata-deploy/lib/helm-deploy.bash
+++ b/tests/functional/kata-deploy/lib/helm-deploy.bash
@@ -115,7 +115,7 @@ deploy_kata() {
 	kubectl -n "${HELM_NAMESPACE}" rollout status daemonset/kata-deploy --timeout=300s

 	# Give it a moment to configure runtimes
-	sleep 10
+	sleep 60

 	return 0
 }
--- a/tests/integration/kubernetes/k8s-nvidia-nim.bats
+++ b/tests/integration/kubernetes/k8s-nvidia-nim.bats
@@ -48,12 +48,59 @@ KBS_AUTH_CONFIG_JSON=$(
 )
 export KBS_AUTH_CONFIG_JSON

-# Base64 encoding for use as Kubernetes Secret in pod manifests
+# Base64 encoding for use as Kubernetes Secret in pod manifests (non-TEE)
 NGC_API_KEY_BASE64=$(
    echo -n "${NGC_API_KEY}" | base64 -w0
 )
 export NGC_API_KEY_BASE64

+# Sealed secret format for TEE pods (vault type pointing to KBS resource)
+# Format: sealed.<base64url JWS header>.<base64url payload>.<base64url signature>
+# IMPORTANT: JWS uses base64url encoding WITHOUT padding (no trailing '=')
+# We use tr to convert standard base64 (+/) to base64url (-_) and remove padding (=)
+# For vault type, header and signature can be placeholders since the payload
+# contains the KBS resource path where the actual secret is stored.
+#
+# Vault type sealed secret payload for instruct pod:
+# {
+#   "version": "0.1.0",
+#   "type": "vault",
+#   "name": "kbs:///default/ngc-api-key/instruct",
+#   "provider": "kbs",
+#   "provider_settings": {},
+#   "annotations": {}
+# }
+NGC_API_KEY_SEALED_SECRET_INSTRUCT_PAYLOAD=$(
+    echo -n '{"version":"0.1.0","type":"vault","name":"kbs:///default/ngc-api-key/instruct","provider":"kbs","provider_settings":{},"annotations":{}}' |
+    base64 -w0 | tr '+/' '-_' | tr -d '='
+)
+NGC_API_KEY_SEALED_SECRET_INSTRUCT="sealed.fakejwsheader.${NGC_API_KEY_SEALED_SECRET_INSTRUCT_PAYLOAD}.fakesignature"
+export NGC_API_KEY_SEALED_SECRET_INSTRUCT
+
+# Base64 encode the sealed secret for use in Kubernetes Secret data field
+# (genpolicy only supports the 'data' field which expects base64 values)
+NGC_API_KEY_SEALED_SECRET_INSTRUCT_BASE64=$(echo -n "${NGC_API_KEY_SEALED_SECRET_INSTRUCT}" | base64 -w0)
+export NGC_API_KEY_SEALED_SECRET_INSTRUCT_BASE64
+
+# Vault type sealed secret payload for embedqa pod:
+# {
+#   "version": "0.1.0",
+#   "type": "vault",
+#   "name": "kbs:///default/ngc-api-key/embedqa",
+#   "provider": "kbs",
+#   "provider_settings": {},
+#   "annotations": {}
+# }
+NGC_API_KEY_SEALED_SECRET_EMBEDQA_PAYLOAD=$(
+    echo -n '{"version":"0.1.0","type":"vault","name":"kbs:///default/ngc-api-key/embedqa","provider":"kbs","provider_settings":{},"annotations":{}}' |
+    base64 -w0 | tr '+/' '-_' | tr -d '='
+)
+NGC_API_KEY_SEALED_SECRET_EMBEDQA="sealed.fakejwsheader.${NGC_API_KEY_SEALED_SECRET_EMBEDQA_PAYLOAD}.fakesignature"
+export NGC_API_KEY_SEALED_SECRET_EMBEDQA
+
+NGC_API_KEY_SEALED_SECRET_EMBEDQA_BASE64=$(echo -n "${NGC_API_KEY_SEALED_SECRET_EMBEDQA}" | base64 -w0)
+export NGC_API_KEY_SEALED_SECRET_EMBEDQA_BASE64
+
 setup_langchain_flow() {
    # shellcheck disable=SC1091  # Sourcing virtual environment activation script
    source "${HOME}"/.cicd/venv/bin/activate
@@ -66,18 +113,56 @@ setup_langchain_flow() {
    [[ "$(pip show beautifulsoup4 2>/dev/null | awk '/^Version:/{print $2}')" = "4.13.4" ]] || pip install beautifulsoup4==4.13.4
 }

-setup_kbs_credentials() {
-    # Get KBS address and export it for pod template substitution
-    export CC_KBS_ADDR="$(kbs_k8s_svc_http_addr)"
+# Create initdata TOML file for genpolicy with CDH configuration.
+# This file is used by genpolicy via --initdata-path. Genpolicy will add the
+# generated policy.rego to it and set it as the cc_init_data annotation.
+# We must overwrite the default empty file AFTER create_tmp_policy_settings_dir()
+# copies it to the temp directory.
+create_nim_initdata_file() {
+    local output_file="$1"
+    local cc_kbs_address
+    cc_kbs_address=$(kbs_k8s_svc_http_addr)

-    kbs_set_gpu0_resource_policy
+    cat > "${output_file}" << EOF
+version = "0.1.0"
+algorithm = "sha256"
+
+[data]
+"aa.toml" = '''
+[token_configs]
+[token_configs.kbs]
+url = "${cc_kbs_address}"
+'''
+
+"cdh.toml" = '''
+[kbc]
+name = "cc_kbc"
+url = "${cc_kbs_address}"
+
+[image]
+authenticated_registry_credentials_uri = "kbs:///default/credentials/nvcr"
+'''
+EOF
+}
+
+setup_kbs_credentials() {
+    # Export KBS address for use in pod YAML templates (aa_kbc_params)
+    CC_KBS_ADDR=$(kbs_k8s_svc_http_addr)
+    export CC_KBS_ADDR

    # Set up Kubernetes secret for the containerd metadata pull
    kubectl delete secret ngc-secret-instruct --ignore-not-found
    kubectl create secret docker-registry ngc-secret-instruct --docker-server="nvcr.io" --docker-username="\$oauthtoken" --docker-password="${NGC_API_KEY}"

+    kbs_set_gpu0_resource_policy
+
    # KBS_AUTH_CONFIG_JSON is already base64 encoded
    kbs_set_resource_base64 "default" "credentials" "nvcr" "${KBS_AUTH_CONFIG_JSON}"
+
+    # Store the actual NGC_API_KEY in KBS for sealed secret unsealing.
+    # The sealed secrets in the pod YAML point to these KBS resource paths.
+    kbs_set_resource "default" "ngc-api-key" "instruct" "${NGC_API_KEY}"
+    kbs_set_resource "default" "ngc-api-key" "embedqa" "${NGC_API_KEY}"
 }

 create_inference_pod() {
@@ -122,10 +207,6 @@ setup_file() {
    export POD_EMBEDQA_YAML_IN="${pod_config_dir}/${POD_NAME_EMBEDQA}.yaml.in"
    export POD_EMBEDQA_YAML="${pod_config_dir}/${POD_NAME_EMBEDQA}.yaml"

-    if [ "${TEE}" = "true" ]; then
-        setup_kbs_credentials
-    fi
-
    dpkg -s jq >/dev/null 2>&1 || sudo apt -y install jq

    export PYENV_ROOT="${HOME}/.pyenv"
@@ -140,6 +221,14 @@ setup_file() {
    policy_settings_dir="$(create_tmp_policy_settings_dir "${pod_config_dir}")"
    add_requests_to_policy_settings "${policy_settings_dir}" "ReadStreamRequest"

+    if [ "${TEE}" = "true" ]; then
+        setup_kbs_credentials
+        # Overwrite the empty default-initdata.toml with our CDH configuration.
+        # This must happen AFTER create_tmp_policy_settings_dir() copies the empty
+        # file and BEFORE auto_generate_policy() runs.
+        create_nim_initdata_file "${policy_settings_dir}/default-initdata.toml"
+    fi
+
    create_inference_pod

    if [ "${SKIP_MULTI_GPU_TESTS}" != "true" ]; then
--- a/tests/integration/kubernetes/k8s-policy-pod.bats
+++ b/tests/integration/kubernetes/k8s-policy-pod.bats
@@ -282,7 +282,7 @@ teardown() {

 	# Debugging information. Don't print the "Message:" line because it contains a truncated policy log.
 	kubectl describe pod "${pod_name}" | grep -v "Message:"
-	teardown_common "${node}" "${node_start_time:-}"
+
 	# Clean-up
 	kubectl delete pod "${pod_name}"
 	kubectl delete configmap "${configmap_name}"
@@ -291,4 +291,6 @@ teardown() {
 	rm -f "${incorrect_configmap_yaml}"
 	rm -f "${testcase_pre_generate_pod_yaml}"
 	rm -f "${testcase_pre_generate_configmap_yaml}"
+
+	teardown_common "${node}" "${node_start_time:-}"
 }
--- a/tests/integration/kubernetes/k8s-policy-pvc.bats
+++ b/tests/integration/kubernetes/k8s-policy-pvc.bats
@@ -62,9 +62,11 @@ teardown() {

 	# Debugging information. Don't print the "Message:" line because it contains a truncated policy log.
 	kubectl describe pod "${pod_name}" | grep -v "Message:"
-	teardown_common "${node}" "${node_start_time:-}"
+
 	# Clean-up
 	kubectl delete -f "${correct_pod_yaml}"
 	kubectl delete -f "${pvc_yaml}"
 	rm -f "${incorrect_pod_yaml}"
+
+	teardown_common "${node}" "${node_start_time:-}"
 }
--- a/tests/integration/kubernetes/runtimeclass_workloads/busybox-pod.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/busybox-pod.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: busybox
 spec:
-  terminationGracePeriodSeconds: 0
  shareProcessNamespace: true
  runtimeClassName: kata
  containers:
--- a/tests/integration/kubernetes/runtimeclass_workloads/busybox-template.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/busybox-template.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: POD_NAME
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  shareProcessNamespace: true
  containers:
--- a/tests/integration/kubernetes/runtimeclass_workloads/initContainer-shared-volume.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/initContainer-shared-volume.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: initcontainer-shared-volume
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  initContainers:
  - name: first
--- a/tests/integration/kubernetes/runtimeclass_workloads/initcontainer-shareprocesspid.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/initcontainer-shareprocesspid.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: busybox
 spec:
-  terminationGracePeriodSeconds: 0
  shareProcessNamespace: true
  runtimeClassName: kata
  initContainers:
--- a/tests/integration/kubernetes/runtimeclass_workloads/job-template.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/job-template.yaml
@@ -16,7 +16,6 @@ spec:
      labels:
        jobgroup: jobtest
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: test
--- a/tests/integration/kubernetes/runtimeclass_workloads/job.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/job.yaml
@@ -10,7 +10,6 @@ metadata:
 spec:
  template:
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: pi
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-layered-sc-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-layered-sc-deployment.yaml
@@ -23,7 +23,6 @@ spec:
        role: master
        tier: backend
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      securityContext:
        runAsUser: 2000
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-pod-sc-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-pod-sc-deployment.yaml
@@ -23,7 +23,6 @@ spec:
        role: master
        tier: backend
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      securityContext:
        runAsUser: 2000
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-pod-sc-nobodyupdate-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-pod-sc-nobodyupdate-deployment.yaml
@@ -23,7 +23,6 @@ spec:
        role: master
        tier: backend
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      securityContext:
        runAsUser: 65534
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-pod-sc-supplementalgroups-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-pod-sc-supplementalgroups-deployment.yaml
@@ -23,7 +23,6 @@ spec:
        role: master
        tier: backend
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      securityContext:
        runAsUser: 2000
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-deployment.yaml
@@ -23,7 +23,6 @@ spec:
        role: master
        tier: backend
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      securityContext:
        runAsUser: 1000
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-hard-coded.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-hard-coded.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: hard-coded-policy-pod
 spec:
-  terminationGracePeriodSeconds: 0
  shareProcessNamespace: true
  runtimeClassName: kata
  containers:
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-job.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-job.yaml
@@ -10,7 +10,6 @@ metadata:
 spec:
  template:
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
        - name: hello
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-pod-pvc.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-pod-pvc.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: policy-pod-pvc
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: busybox
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-pod.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-pod.yaml
@@ -9,7 +9,6 @@ metadata:
  name: policy-pod
  uid: policy-pod-uid
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: prometheus
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-rc.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-rc.yaml
@@ -17,7 +17,6 @@ spec:
      labels:
        app: policy-nginx-rc
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: nginxtest
--- a/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-set-keys.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/k8s-policy-set-keys.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: set-keys-test
 spec:
-  terminationGracePeriodSeconds: 0
  shareProcessNamespace: true
  runtimeClassName: kata
  containers:
--- a/tests/integration/kubernetes/runtimeclass_workloads/lifecycle-events.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/lifecycle-events.yaml
@@ -9,7 +9,6 @@ kind: Pod
 metadata:
  name: handlers
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: handlers-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/nginx-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/nginx-deployment.yaml
@@ -17,7 +17,6 @@ spec:
      labels:
        app: nginx
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: nginx
--- a/tests/integration/kubernetes/runtimeclass_workloads/nvidia-nim-llama-3-1-8b-instruct-tee.yaml.in
+++ b/tests/integration/kubernetes/runtimeclass_workloads/nvidia-nim-llama-3-1-8b-instruct-tee.yaml.in
@@ -10,7 +10,11 @@ metadata:
  labels:
    app: ${POD_NAME_INSTRUCT}
  annotations:
-    io.katacontainers.config.hypervisor.kernel_params: "agent.image_registry_auth=kbs:///default/credentials/nvcr agent.aa_kbc_params=cc_kbc::${CC_KBS_ADDR}"
+    # Start CDH process and configure AA for KBS communication
+    # aa_kbc_params tells the Attestation Agent where KBS is located
+    io.katacontainers.config.hypervisor.kernel_params: "agent.guest_components_procs=confidential-data-hub agent.aa_kbc_params=cc_kbc::${CC_KBS_ADDR}"
+    # cc_init_data annotation will be added by genpolicy with CDH configuration
+    # from the custom default-initdata.toml created by create_nim_initdata_file()
 spec:
  restartPolicy: Never
  runtimeClassName: kata
@@ -58,7 +62,7 @@ spec:
      - name: NGC_API_KEY
        valueFrom:
          secretKeyRef:
-            name: ngc-api-key-instruct
+            name: ngc-api-key-sealed-instruct
            key: api-key
    # GPU resource limit (for NVIDIA GPU)
    resources:
@@ -78,7 +82,9 @@ data:
 apiVersion: v1
 kind: Secret
 metadata:
-  name: ngc-api-key-instruct
+  name: ngc-api-key-sealed-instruct
 type: Opaque
 data:
-  api-key: "${NGC_API_KEY_BASE64}"
+  # Sealed secret pointing to kbs:///default/ngc-api-key/instruct
+  # CDH will unseal this by fetching the actual key from KBS
+  api-key: "${NGC_API_KEY_SEALED_SECRET_INSTRUCT_BASE64}"
--- a/tests/integration/kubernetes/runtimeclass_workloads/nvidia-nim-llama-3-2-nv-embedqa-1b-v2-tee.yaml.in
+++ b/tests/integration/kubernetes/runtimeclass_workloads/nvidia-nim-llama-3-2-nv-embedqa-1b-v2-tee.yaml.in
@@ -10,7 +10,11 @@ metadata:
  labels:
    app: ${POD_NAME_EMBEDQA}
  annotations:
-    io.katacontainers.config.hypervisor.kernel_params: "agent.image_registry_auth=kbs:///default/credentials/nvcr agent.aa_kbc_params=cc_kbc::${CC_KBS_ADDR}"
+    # Start CDH process and configure AA for KBS communication
+    # aa_kbc_params tells the Attestation Agent where KBS is located
+    io.katacontainers.config.hypervisor.kernel_params: "agent.guest_components_procs=confidential-data-hub agent.aa_kbc_params=cc_kbc::${CC_KBS_ADDR}"
+    # cc_init_data annotation will be added by genpolicy with CDH configuration
+    # from the custom default-initdata.toml created by create_nim_initdata_file()
 spec:
  restartPolicy: Always
  runtimeClassName: kata
@@ -29,7 +33,7 @@ spec:
      - name: NGC_API_KEY
        valueFrom:
          secretKeyRef:
-            name: ngc-api-key-embedqa
+            name: ngc-api-key-sealed-embedqa
            key: api-key
      - name: NIM_HTTP_API_PORT
        value: "8000"
@@ -88,7 +92,9 @@ data:
 apiVersion: v1
 kind: Secret
 metadata:
-  name: ngc-api-key-embedqa
+  name: ngc-api-key-sealed-embedqa
 type: Opaque
 data:
-  api-key: "${NGC_API_KEY_BASE64}"
+  # Sealed secret pointing to kbs:///default/ngc-api-key/embedqa
+  # CDH will unseal this by fetching the actual key from KBS
+  api-key: "${NGC_API_KEY_SEALED_SECRET_EMBEDQA_BASE64}"
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-besteffort.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-besteffort.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: besteffort-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: qos-besteffort
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-block-pv.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-block-pv.yaml
@@ -3,7 +3,6 @@ kind: Pod
 metadata:
  name: pod-block-pv
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: my-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-burstable.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-burstable.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: burstable-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: qos-burstable
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-caps.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-caps.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: pod-caps
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-configmap.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-configmap.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: config-env-test-pod
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-cpu-defaults.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-cpu-defaults.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: default-cpu-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: default-cpu-demo-ctr
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-cpu.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-cpu.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: constraints-cpu-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: first-cpu-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-custom-dns.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-custom-dns.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: custom-dns-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-empty-dir.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-empty-dir.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: sharevol-kata
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: test
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-env.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-env.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: test-env
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-file-volume.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-file-volume.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: test-file-volume
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  restartPolicy: Never
  nodeName: NODE
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-footloose.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-footloose.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: footubuntu
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  volumes:
  - name: runv
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-guaranteed.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-guaranteed.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: qos-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: qos-guaranteed
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-hostname.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-hostname.yaml
@@ -9,7 +9,6 @@ kind: Pod
 metadata:
  name: test-pod-hostname
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  restartPolicy: Never
  containers:
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-hostpath-kmsg.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-hostpath-kmsg.yaml
@@ -9,7 +9,6 @@ kind: Pod
 metadata:
  name: hostpath-kmsg
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  restartPolicy: Never
  volumes:
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-http-liveness.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-http-liveness.yaml
@@ -10,7 +10,6 @@ metadata:
    test: liveness-test
  name: liveness-http
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: liveness
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-liveness.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-liveness.yaml
@@ -10,7 +10,6 @@ metadata:
    test: liveness
  name: liveness-exec
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: liveness
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-memory-limit.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-memory-limit.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: memory-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: memory-test-ctr
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-nested-configmap-secret.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-nested-configmap-secret.yaml
@@ -23,7 +23,6 @@ kind: Pod
 metadata:
  name: nested-configmap-secret-pod
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-number-cpu.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-number-cpu.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: cpu-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: c1
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-optional-empty-configmap.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-optional-empty-configmap.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: optional-empty-config-test-pod
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-optional-empty-secret.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-optional-empty-secret.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: optional-empty-secret-test-pod
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-privileged.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-privileged.yaml
@@ -9,7 +9,6 @@ kind: Pod
 metadata:
  name: privileged
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  restartPolicy: Never
  containers:
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-projected-volume.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-projected-volume.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: test-projected-volume
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: test-projected-volume
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-quota-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-quota-deployment.yaml
@@ -17,7 +17,6 @@ spec:
      labels:
        purpose: quota-demo
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: pod-quota-demo
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-readonly-volume.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-readonly-volume.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: test-readonly-volume
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  restartPolicy: Never
  volumes:
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-seccomp.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-seccomp.yaml
@@ -11,7 +11,6 @@ metadata:
    io.katacontainers.config.runtime.disable_guest_seccomp: "false"
 spec:
  runtimeClassName: kata
-  terminationGracePeriodSeconds: 0
  restartPolicy: Never
  containers:
  - name: busybox
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-secret-env.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-secret-env.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: secret-envars-test-pod
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: envars-test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-secret.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-secret.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: secret-test-pod
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: test-container
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-security-context.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-security-context.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: security-context-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  securityContext:
    runAsUser: 1000
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-shared-volume.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-shared-volume.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: test-shared-volume
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  restartPolicy: Never
  volumes:
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-sysctl.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-sysctl.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: sysctl-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  securityContext:
    sysctls:
--- a/tests/integration/kubernetes/runtimeclass_workloads/pod-tcp-liveness.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pod-tcp-liveness.yaml
@@ -10,7 +10,6 @@ metadata:
  labels:
    app: tcp-liveness
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: tcp-liveness
--- a/tests/integration/kubernetes/runtimeclass_workloads/pv-pod.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/pv-pod.yaml
@@ -8,7 +8,6 @@ apiVersion: v1
 metadata:
  name: pv-pod
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  nodeName: NODE
  volumes:
--- a/tests/integration/kubernetes/runtimeclass_workloads/redis-master-deployment.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/redis-master-deployment.yaml
@@ -23,7 +23,6 @@ spec:
        role: master
        tier: backend
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: master
--- a/tests/integration/kubernetes/runtimeclass_workloads/replication-controller.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/replication-controller.yaml
@@ -17,7 +17,6 @@ spec:
      labels:
        app: nginx-rc-test
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: nginxtest
--- a/tests/integration/kubernetes/runtimeclass_workloads/vfio.yaml
+++ b/tests/integration/kubernetes/runtimeclass_workloads/vfio.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: vfio
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: c1
--- a/tests/integration/kubernetes/tests_common.sh
+++ b/tests/integration/kubernetes/tests_common.sh
@@ -114,29 +114,29 @@ adapt_common_policy_settings_for_non_coco() {

 	# Using UpdateEphemeralMountsRequest - instead of CopyFileRequest.
 	jq '.request_defaults.UpdateEphemeralMountsRequest = true' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	# Using a different path to container container root.
 	jq '.common.root_path = "/run/kata-containers/shared/containers/$(bundle-id)/rootfs"' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	# Using CreateContainer Storage input structs for configMap & secret volumes - instead of using CopyFile like CoCo.
 	jq '.kata_config.enable_configmap_secret_storages = true' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	# Using watchable binds for configMap volumes - instead of CopyFileRequest.
 	jq '.volumes.configMap.mount_point = "^$(cpath)/watchable/$(bundle-id)-[a-z0-9]{16}-" | .volumes.configMap.driver = "watchable-bind"' \
 		"${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	# Using a Storage input struct for paths shared with the Host using virtio-fs.
 	jq '.sandbox.storages += [{"driver":"virtio-fs","driver_options":[],"fs_group":null,"fstype":"virtiofs","mount_point":"/run/kata-containers/shared/containers/","options":[],"source":"kataShared"}]' \
 		"${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	# Disable guest pull.
 	jq '.cluster_config.guest_pull = false' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"
 }

 # adapt common policy settings for AKS Hosts
@@ -144,16 +144,16 @@ adapt_common_policy_settings_for_aks() {
 	info "Adapting common policy settings for AKS Hosts"

 	jq '.pause_container.Process.User.UID = 0' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	jq '.pause_container.Process.User.GID = 0' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	jq '.cluster_config.pause_container_image = "mcr.microsoft.com/oss/v2/kubernetes/pause:3.6"' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"

 	jq '.cluster_config.pause_container_id_policy = "v2"' "${settings_dir}/genpolicy-settings.json" > temp.json
-	sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	mv temp.json "${settings_dir}/genpolicy-settings.json"
 }

 # adapt common policy settings for CBL-Mariner Hosts
@@ -161,7 +161,7 @@ adapt_common_policy_settings_for_cbl_mariner() {
 	local settings_dir=$1

 	info "Adapting common policy settings for KATA_HOST_OS=cbl-mariner"
-	jq '.kata_config.oci_version = "1.2.0"' "${settings_dir}/genpolicy-settings.json" > temp.json && sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	jq '.kata_config.oci_version = "1.2.0"' "${settings_dir}/genpolicy-settings.json" > temp.json && mv temp.json "${settings_dir}/genpolicy-settings.json"
 }

 # Adapt common policy settings for NVIDIA GPU platforms (CI runners use containerd 2.x).
@@ -169,7 +169,7 @@ adapt_common_policy_settings_for_nvidia_gpu() {
 	local settings_dir=$1

 	info "Adapting common policy settings for NVIDIA GPU platform (${KATA_HYPERVISOR})"
-	jq '.kata_config.oci_version = "1.2.1"' "${settings_dir}/genpolicy-settings.json" > temp.json && sudo mv temp.json "${settings_dir}/genpolicy-settings.json"
+	jq '.kata_config.oci_version = "1.2.1"' "${settings_dir}/genpolicy-settings.json" > temp.json && mv temp.json "${settings_dir}/genpolicy-settings.json"
 }

 # adapt common policy settings for various platforms
@@ -195,10 +195,10 @@ create_common_genpolicy_settings() {

 	auto_generate_policy_enabled || return 0

-	adapt_common_policy_settings "${default_genpolicy_settings_dir}"
-
 	cp "${default_genpolicy_settings_dir}/genpolicy-settings.json" "${genpolicy_settings_dir}"
 	cp "${default_genpolicy_settings_dir}/rules.rego" "${genpolicy_settings_dir}"
+
+	adapt_common_policy_settings "${genpolicy_settings_dir}"
 }

 # If auto-generated policy testing is enabled, make a copy of the common genpolicy settings
--- a/tests/metrics/density/runtimeclass_workloads/sysbench-pod.yaml
+++ b/tests/metrics/density/runtimeclass_workloads/sysbench-pod.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: test-sysbench
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: test-sysbench
--- a/tests/metrics/network/iperf3_kubernetes/runtimeclass_workloads/iperf3-daemonset.yaml
+++ b/tests/metrics/network/iperf3_kubernetes/runtimeclass_workloads/iperf3-daemonset.yaml
@@ -26,4 +26,3 @@ spec:
      - name: iperf3-client
        image: networkstatic/iperf3
        command: ['/bin/sh', '-c', 'sleep infinity']
-      terminationGracePeriodSeconds: 0
--- a/tests/metrics/network/iperf3_kubernetes/runtimeclass_workloads/iperf3-deployment.yaml
+++ b/tests/metrics/network/iperf3_kubernetes/runtimeclass_workloads/iperf3-deployment.yaml
@@ -38,7 +38,6 @@ spec:
        ports:
        - containerPort: 5201
          name: server
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata

 ---
--- a/tests/metrics/network/latency_kubernetes/latency-client.yaml
+++ b/tests/metrics/network/latency_kubernetes/latency-client.yaml
@@ -7,7 +7,6 @@ kind: Pod
 metadata:
  name: latency-client
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: client-container
--- a/tests/metrics/network/latency_kubernetes/latency-server.yaml
+++ b/tests/metrics/network/latency_kubernetes/latency-server.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: latency-server
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
    - name: server-container
--- a/tests/metrics/network/nginx_kubernetes/runtimeclass_workloads/nginx-networking.yaml
+++ b/tests/metrics/network/nginx_kubernetes/runtimeclass_workloads/nginx-networking.yaml
@@ -16,7 +16,6 @@ spec:
      labels:
        app: nginx
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: nginx
--- a/tests/stability/runtimeclass_workloads/pod-deployment.yaml
+++ b/tests/stability/runtimeclass_workloads/pod-deployment.yaml
@@ -16,7 +16,6 @@ spec:
      labels:
        purpose: pod-test
    spec:
-      terminationGracePeriodSeconds: 0
      runtimeClassName: kata
      containers:
      - name: pod-test
--- a/tests/stability/runtimeclass_workloads/stability-test.yaml
+++ b/tests/stability/runtimeclass_workloads/stability-test.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: stability-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: stability-test
--- a/tests/stability/runtimeclass_workloads/stress-test.yaml
+++ b/tests/stability/runtimeclass_workloads/stress-test.yaml
@@ -8,7 +8,6 @@ kind: Pod
 metadata:
  name: stressng-test
 spec:
-  terminationGracePeriodSeconds: 0
  runtimeClassName: kata
  containers:
  - name: stress-test
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Fabiano Fidêncio	622b912369	release: Add kata-lifecycle-manager chart to release process Update the release workflow and scripts to package and publish the kata-lifecycle-manager Helm chart alongside kata-deploy. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 12:00:19 +01:00
Fabiano Fidêncio	62fef5a5e4	fixup! helm: Add kata-lifecycle-manager chart for Argo Workflows-based upgrades	2026-02-05 11:55:44 +01:00
Fabiano Fidêncio	2e9ed9aa4c	helm: Add kata-lifecycle-manager chart for Argo Workflows-based upgrades This chart installs an Argo WorkflowTemplate for orchestrating controlled, node-by-node upgrades of kata-deploy with verification and automatic rollback on failure. The workflow processes nodes sequentially rather than in parallel to ensure fleet consistency. This design choice prevents ending up with a mixed-version fleet where some nodes run the new version while others remain on the old version. If verification fails on any node, the workflow stops immediately before touching remaining nodes. Alternative approaches considered: - withParam loop with semaphore (max-concurrent: 1): Provides cleaner UI with all nodes visible at the same level, but Argo's semaphore only controls concurrency, not failure propagation. When one node fails and releases the lock, other nodes waiting on the semaphore still proceed. - withParam with failFast: true: Would be ideal, but Argo only supports failFast for DAG tasks, not for steps with withParam. Attempting to use it results in "unknown field" errors. - Single monolithic script: Would guarantee sequential execution and fail-fast, but loses per-node visibility in the Argo UI and makes debugging harder. The chosen approach uses recursive Argo templates (upgrade-node-chain) which naturally provides fail-fast behavior because if any step in the chain fails, the recursion stops. Despite the nesting in the Argo UI, each node's upgrade steps remain visible for monitoring. A verification pod is required to validate that Kata is functioning correctly on each node after upgrade. The chart will fail to install without one. Users must provide the verification pod when installing kata-lifecycle-manager using --set-file defaults.verificationPod=./pod.yaml. The pod can also be overridden at workflow submission time using a base64-encoded workflow parameter. When passing the verification pod as a workflow parameter, base64 encoding is required because multi-line YAML with special characters does not survive the journey through Argo CLI and shell script parsing. The workflow validates prerequisites before touching any nodes. If no verification pod is configured, the workflow fails immediately with a clear error message. This prevents partial upgrades that would leave the cluster in an inconsistent state. During helm upgrade, kata-deploy's verification is explicitly disabled (--set verification.pod="") because: - kata-deploy's verification is cluster-wide, designed for initial install - kata-lifecycle-manager does per-node verification with proper placeholder substitution (${NODE}, ${TEST_POD}) - Running kata-deploy's verification on each node would be redundant and could fail due to unsubstituted placeholders On verification failure, the workflow triggers an automatic helm rollback, waits for kata-deploy to stabilize, uncordons the node, and marks it with a rolled-back status annotation. The workflow then exits with an error so the failure is clearly visible. The upgrade flow per node: 1. Prepare: Annotate node with upgrade status 2. Cordon: Mark node unschedulable 3. Drain (optional): Evict pods if enabled 4. Upgrade: Run helm upgrade with --reuse-values 5. Wait: Wait for kata-deploy DaemonSet pod ready 6. Verify: Run verification pod with substituted placeholders 7. Complete: Uncordon and update annotations Draining is disabled by default because running Kata VMs continue using their in-memory binaries after upgrade. Only new workloads use the upgraded binaries. Users who prefer to evict all workloads before maintenance can enable draining. Known limitations: - Fleet consistency during rollback: Because kata-deploy uses a DaemonSet that is updated cluster-wide, nodes that pass verification are uncordoned and can accept new workloads before all nodes are verified. If a later node fails verification and triggers a rollback, workloads that started on already-verified nodes continue running with the new version's in-memory binaries while the cluster reverts to the old version. This is generally acceptable since running VMs continue functioning and new workloads use the rolled-back version. A future improvement could implement a two-phase approach that cordons all nodes upfront and only uncordons after all verifications pass. The chart requires Argo Workflows v3.4+ and uses multi-arch container images supporting amd64, arm64, s390x, and ppc64le. Usage: # Install kata-lifecycle-manager with verification pod (required) helm install kata-lifecycle-manager ./kata-lifecycle-manager \ --set-file defaults.verificationPod=./my-verification-pod.yaml # Label nodes for upgrade kubectl label node worker-1 katacontainers.io/kata-lifecycle-manager-window=true # Trigger upgrade argo submit -n argo --from workflowtemplate/kata-lifecycle-manager \ -p target-version=3.25.0 \ -p node-selector="katacontainers.io/kata-lifecycle-manager-window=true" \ -p helm-namespace=kata-system # Monitor progress argo watch @latest -n argo Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 11:50:04 +01:00
Fabiano Fidêncio	e8a896aaa2	packaging: Add multi-arch helm container image Add a Dockerfile and GitHub Actions workflow to build and publish a multi-arch helm container image to quay.io/kata-containers/helm. The image is based on quay.io/kata-containers/kubectl and adds: - helm (latest stable version) The image supports the following architectures: - linux/amd64 - linux/arm64 - linux/s390x - linux/ppc64le The workflow runs: - Weekly (every Sunday at 12:00 UTC, 12 hours after kubectl image) - On manual trigger - When the Dockerfile or workflow changes Image tags: - latest - Date-based (YYYYMMDD) - Helm version (e.g., v3.17.0) - Git SHA This image is used by the kata-upgrade Helm chart for orchestrating kata-deploy upgrades via Argo Workflows. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-05 11:49:24 +01:00
Greg Kurz	e430b2641c	Merge pull request #12435 from bpradipt/crio-annotation shim: Add CRI-O annotation support for device cold plug	2026-02-05 09:29:19 +01:00
Alex Lyn	e257430976	Merge pull request #12433 from manuelh-dev/mahuber/cfg-sanitize-whitespaces runtimes: Sanitize trailing whitespaces	2026-02-05 09:31:21 +08:00
Fabiano Fidêncio	dda1b30c34	tests: nvidia-nim: Use sealed secrets for NGC_API_KEY Convert the NGC_API_KEY from a regular Kubernetes secret to a sealed secret for the CC GPU tests. This ensures the API key is only accessible within the confidential enclave after successful attestation. The sealed secret uses the "vault" type which points to a resource stored in the Key Broker Service (KBS). The Confidential Data Hub (CDH) inside the guest will unseal this secret by fetching it from KBS after attestation. The initdata file is created AFTER create_tmp_policy_settings_dir() copies the empty default file, and BEFORE auto_generate_policy() runs. This allows genpolicy to add the generated policy.rego to our custom CDH configuration. The sealed secret format follows the CoCo specification: sealed.<JWS header>.<JWS payload>.<signature> Where the payload contains: - version: "0.1.0" - type: "vault" (pointer to KBS resource) - provider: "kbs" - resource_uri: KBS path to the actual secret Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:34:44 +01:00
Fabiano Fidêncio	c9061f9e36	tests: kata-deploy: Increase post-deployment wait time Increase the sleep time after kata-deploy deployment from 10s to 60s to give more time for runtimes to be configured. This helps avoid race conditions on slower K8s distributions like k3s where the RuntimeClass may not be immediately available after the DaemonSet rollout completes. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	0fb2c500fd	tests: kata-deploy: Merge E2E tests to avoid timing issues Merge the two E2E tests ("Custom RuntimeClass exists with correct properties" and "Custom runtime can run a pod") into a single test, as those 2 are very much dependent of each other. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	fef93f1e08	tests: kata-deploy: Use die() instead of fail() for error handling Replace fail() calls with die() which is already provided by common.bash. The fail() function doesn't exist in the test infrastructure, causing "command not found" errors when tests fail. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	f90c12d4df	kata-deploy: Avoid text file busy error with nydus-snapshotter We cannot overwrtie a binary that's currently in use, and that's the reason that elsewhere we remove / unlink the binary (the running process keeps its file descriptor, so we're good doing that) and only then we copy the binary. However, we missed doing this for the nydus-snapshotter deployment. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 10:24:49 +01:00
Manuel Huber	30c7325e75	runtimes: Sanitize trailing whitespaces Clean up trailing whitespaces, making life easier for those who have configured their IDE to clean these up. Suggest to not add new code with trailing whitespaces etc. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-02-03 11:46:30 -08:00
Steve Horsman	30494abe48	Merge pull request #12426 from kata-containers/dependabot/github_actions/zizmorcore/zizmor-action-0.4.1 build(deps): bump zizmorcore/zizmor-action from 0.2.0 to 0.4.1	2026-02-03 14:38:54 +00:00
Pradipta Banerjee	8a449d358f	shim: Add CRI-O annotation support for device cold plug Add support for CRI-O annotations when fetching pod identifiers for device cold plug. The code now checks containerd CRI annotations first, then falls back to CRI-O annotations if they are empty. This enables device cold plug to work with both containerd and CRI-O container runtimes. Annotations supported: - containerd: io.kubernetes.cri.sandbox-name, io.kubernetes.cri.sandbox-namespace - CRI-O: io.kubernetes.cri-o.KubeName, io.kubernetes.cri-o.Namespace Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>	2026-02-03 04:51:15 +00:00
Steve Horsman	6bb77a2f13	Merge pull request #12390 from mythi/tdx-updates-2026-2 runtime: tdx QEMU configuration changes	2026-02-02 16:58:44 +00:00
Zvonko Kaiser	6702b48858	Merge pull request #12428 from fidencio/topic/nydus-snapshotter-start-from-a-clean-state kata-deploy: nydus: Always start from a clean state	2026-02-02 11:21:26 -05:00
Steve Horsman	0530a3494f	Merge pull request #12415 from nlle/make-helm-updatestrategy-configurable kata-deploy: Make update strategy configurable for kata-deploy DaemonSet	2026-02-02 10:29:01 +00:00
Steve Horsman	93dcaee965	Merge pull request #12423 from manuelh-dev/mahuber/pause-build-fix packaging: Delete pause_bundle dir before unpack	2026-02-02 10:26:30 +00:00
Fabiano Fidêncio	62ad0814c5	kata-deploy: nydus: Always start from a clean state Clean up existing nydus-snapshotter state to ensure fresh start with new version. This is safe across all K8s distributions (k3s, rke2, k0s, microk8s, etc.) because we only touch the nydus data directory, not containerd's internals. When containerd tries to use non-existent snapshots, it will re-pull/re-unpack. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-02 11:06:37 +01:00
Mikko Ylinen	870630c421	kata-deploy: drop custom TDX installation steps As we have moved to use QEMU (and OVMF already earlier) from kata-deploy, the custom tdx configurations and distro checks are no longer needed. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-02-02 11:11:26 +02:00
Mikko Ylinen	927be7b8ad	runtime: tdx: move to use QEMU from kata-deploy Currently, a working TDX setup expects users to install special TDX support builds from Canonical/CentOS virt-sig for TDX to work. kata-deploy configured TDX runtime handler to use QEMU from the distro's paths. With TDX support now being available in upstream Linux and Ubuntu 24.04 having an install candidate (linux-image-generic-6.17) for a new enough kernel, move TDX configuration to use QEMU from kata-deploy. While this is the new default, going back to the original setup is possible by making manual changes to TDX runtime handlers. Note: runtime-rs is already using QEMUPATH for TDX. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-02-02 11:10:52 +02:00
Nikolaj Lindberg Lerche	6e98df2bac	kata-deploy: Make update strategy configurable for kata-deploy DaemonSet This Allows the updateStrategy to be configured for the kata-deploy helm chart, this is enabling administrators to control the aggressiveness of updates. For a less aggressive approach, the strategy can be set to `OnDelete`. Alternatively, the update process can be made more aggressive by adjusting the `maxUnavailable` parameter. Signed-off-by: Nikolaj Lindberg Lerche <nlle@ambu.com>	2026-02-01 20:14:29 +01:00
Dan Mihai	d7ff54769c	tests: policy: remove the need for using sudo Modify the copy of root user's settings file, instead of modifying the original file. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-01 20:09:50 +01:00
Dan Mihai	4d860dcaf5	tests: policy: avoid redundant debug output Avoid redundant and confusing teardown_common() debug output for k8s-policy-pod.bats and k8s-policy-pvc.bats. The Policy tests skip the Message field when printing information about their pods, because unfortunately that field might contain a truncated Policy log - for the test cases that intentiocally cause Policy failures. The non-truncated Policy log is already available from other "kubectl describe" fields. So, avoid the redundant pod information from teardown_common(), that also included the confusing Message field. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-01 20:09:50 +01:00
dependabot[bot]	dc8d9e056d	build(deps): bump zizmorcore/zizmor-action from 0.2.0 to 0.4.1 Bumps [zizmorcore/zizmor-action](https://github.com/zizmorcore/zizmor-action) from 0.2.0 to 0.4.1. - [Release notes](https://github.com/zizmorcore/zizmor-action/releases) - [Commits](`e673c3917a...135698455d`) --- updated-dependencies: - dependency-name: zizmorcore/zizmor-action dependency-version: 0.4.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2026-02-01 15:08:10 +00:00
Manuel Huber	8b0c199f43	packaging: Delete pause_bundle dir before unpack Delete the pause_bundle directory before running the umoci unpack operation. This will make builds idempotent and not fail with errors like "create runtime bundle: config.json already exists in .../build/pause-image/destdir/pause_bundle". This will make life better when building locally. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-31 19:43:11 +01:00
Steve Horsman	4d1095e653	Merge pull request #12350 from manuelh-dev/mahuber/term-grace-period tests: Remove terminationGracePeriod in manifests	2026-01-29 15:17:17 +00:00
Manuel Huber	6438fe7f2d	tests: Remove terminationGracePeriod in manifests Do not kill containers immediately, instead use Kubernetes' default termination grace period. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-23 16:18:44 -08:00