mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Files

Fabiano Fidêncio fefc0b75ab kata-deploy: nvidia: Default to the Job-based deployment mode

Switch the NVIDIA GPU example values file to install Kata via the
Job-based deployment mode (deploymentMode: job) instead of the
always-on, privileged DaemonSet, so that nothing keeps running on the
node once the install completes.

To exercise this in our CI, make the helm_helper aware of the deployment
mode coming from the (base) values file:

  - In "job" mode, clear job.nodeSelectorExpressions so the dispatcher
    targets every discovered node.  Our CI clusters are typically
    single-node, where the only node carries the control-plane label,
    and the default selector excludes control-plane/master nodes.

  - There is no always-on DaemonSet to wait on in "job" mode.  The
    dispatcher runs as a blocking post-install hook and the final
    per-node stage labels the node, so wait until at least one node
    carries the katacontainers.io/kata-runtime label as the
    "install complete" signal (dumping Job/pod logs on timeout).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>

2026-06-12 22:55:11 +02:00

16 KiB

Raw Blame History

Helm Configuration

Parameters

The helm chart provides a comprehensive set of configuration options. You may view the parameters and their descriptions by going to the GitHub source or by using helm:

# List available kata-deploy chart versions:
#   helm search repo kata-deploy-charts/kata-deploy --versions
#
# Then replace X.Y.Z below with the desired chart version:
helm show values --version X.Y.Z oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy

shims

Kata ships with a number of pre-built artifacts and runtimes. You may selectively enable or disable specific shims. For example:

shims:
  disableAll: true
  qemu:
    enabled: true
  qemu-nvidia-gpu:
    enabled: true
  qemu-nvidia-gpu-snp:
    enabled: false

Shims can also have configuration options specific to them:

  qemu-nvidia-gpu:
    enabled: ~
    supportedArches:
      - amd64
    dropIn: |
      [agent.kata]
      dial_timeout = 999
    allowedHypervisorAnnotations: []
    containerd:
      snapshotter: ""
    runtimeClass:
      # This label is automatically added by gpu-operator. Override it
      # if you want to use a different label.
      # Uncomment once GPU Operator v26.3 is out
      # nodeSelector:
        # nvidia.com/cc.ready.state: "false"

The optional shims.<shim>.dropIn field lets you add a custom Kata drop-in for a default (non-custom) runtime. kata-deploy writes it as config.d/50-user-overrides.toml for that shim.

It's best to reference the default values.yaml file above for more details.

Custom Runtimes

Kata allows you to create custom runtime configurations. This is done by overlaying one of the pre-existing runtime configs with user-provided configs. For example, we can use the qemu-nvidia-gpu as a base config and overlay our own parameters to it:

customRuntimes:
  enabled: false
  runtimes:
    my-gpu-runtime:
      baseConfig: "qemu-nvidia-gpu"  # Required: existing config to use as base
      dropIn: |                      # Optional: overrides via config.d mechanism
        [hypervisor.qemu]
        default_memory = 1024
        default_vcpus = 4
      runtimeClass: |
        kind: RuntimeClass
        apiVersion: node.k8s.io/v1
        metadata:
          name: kata-my-gpu-runtime
          labels:
            app.kubernetes.io/managed-by: kata-deploy
        handler: kata-my-gpu-runtime
        overhead:
          podFixed:
            memory: "640Mi"
            cpu: "500m"
        scheduling:
          nodeSelector:
            katacontainers.io/kata-runtime: "true"
      # Optional: CRI-specific configuration
      containerd:
        snapshotter: "nydus"  # Configure containerd snapshotter (nydus, erofs, etc.)
      crio:
        pullType: "guest-pull"  # Configure CRI-O runtime_pull_image = true

Again, view the default values.yaml file for more details.

Drop-In Runtime Configuration

The base runtime configuration shipped with Kata Containers can be modified using an overlay method. This can be done directly on the filesystem using the instructions found here. You can also use the customRuntimes.runtimes.[name].dropIn configuration in the helm chart to achieve the same results.

Deployment Modes (DaemonSet vs Job)

The chart can install Kata on nodes in one of two ways, selected with the top-level deploymentMode value:

daemonset (default): the long-running kata-deploy DaemonSet installs Kata on every matching node and reverts it when the pod is terminated (i.e. on uninstall). This is the historical behavior and is unchanged.
job: there is no always-on component. A tiny dispatcher Job (the dispatcher, kata-deploy-job-dispatcher) runs as a post-install/post-upgrade hook, enumerates the selected nodes live via the Kubernetes API, and creates one node-pinned install Job per node. Each per-node Job runs the staged install pipeline as ordered initContainers and then exits:
```
host-check -> artifacts -> cri   (initContainers)  ->  label (main)
```
On helm uninstall, a pre-delete dispatcher fans out per-node Jobs that run the pipeline in reverse (unlabel -> revert-cri -> remove-artifacts). Unlike the DaemonSet, nothing keeps running on the node after installation completes, and the dispatcher itself only ever talks to the API server — it never touches the host (so it ships as a separate, minimal image, job.dispatcherImage).

The privilege split is explicit: the dispatcher pod runs fully unprivileged (runAsNonRoot, all capabilities dropped, no privilege escalation, read-only root filesystem, RuntimeDefault seccomp) under a dedicated minimal ServiceAccount whose only rights are nodes: list (cluster-scoped) and managing Jobs in the release namespace. All privileged, host-mutating work stays in the per-node Jobs, which continue to use the kata-deploy ServiceAccount.

deploymentMode: job

Why a dispatcher instead of Helm-rendered per-node Jobs

Rendering one Job per node directly in the chart does not scale: Helm stores the whole rendered release in a single (~1 MiB) Secret and runs hook resources sequentially, so large fleets blow the size limit and/or take far too long. A single Indexed Job or a JobSet removes those limits but cannot guarantee one pod per node once parallelism < node-count: Kubernetes' topology-spread and affinity scheduling ignore completed pods, so as paced pods finish, later pods pile onto a subset of nodes and leave others uncovered.

The dispatcher sidesteps both problems: the Helm release stays O(1) (just the dispatcher + a constant-size ConfigMap holding the per-node Job templates), node membership is resolved at run time, and the dispatcher itself paces the rollout (at most job.parallelism per-node Jobs in flight) while guaranteeing one Job per node. Per-node Jobs are garbage-collected via an ownerReference to the dispatcher and job.ttlSecondsAfterFinished.

Adding nodes in `job` mode

The dispatcher only runs on helm install / helm upgrade / helm uninstall. There is no dispatcher watching for new nodes, so when you add nodes later, re-run helm upgrade; the dispatcher re-enumerates the cluster and installs the new nodes:

helm upgrade kata-deploy "${CHART}" --version "${VERSION}" --reuse-values

Each per-node stage is idempotent (it skips when already applied), so the upgrade only does real work on the newly added nodes.

Recovering from a failed or deleted dispatcher

The dispatcher runs as a blocking post-install/post-upgrade hook Job with restartPolicy: Never and backoffLimit: 0, so if its pod is evicted, drained, or deleted mid-rollout the Job is marked failed and is not restarted automatically — helm install/helm upgrade surfaces the failure rather than leaving you silently half-installed.

What survives the dispatcher dying:

Per-node Jobs already created keep running. They are independent, nodeName-pinned Jobs, not children of the dispatcher pod, so installs that were already dispatched run to completion and those nodes get labeled. Only nodes still queued (never dispatched) are skipped, so at worst you get partial coverage — never a half-mutated host, because each stage is idempotent.
Those per-node Jobs carry a (non-controller) ownerReference to the dispatcher Job, so they survive pod deletion but are garbage-collected once the dispatcher Job itself is removed or its job.ttlSecondsAfterFinished elapses. Keep that TTL comfortably larger than a single node's install so in-flight Jobs are not reaped early.

Recovery is the same one-liner as adding nodes — re-run helm upgrade:

helm upgrade kata-deploy "${CHART}" --version "${VERSION}" --reuse-values

The before-hook-creation delete policy first removes the stale dispatcher Job (cascading away any leftover per-node Jobs); the fresh dispatcher then re-enumerates nodes live, recreates the per-node Jobs (adopting any that still exist rather than duplicating them), and because every stage is idempotent the already-installed nodes are fast no-ops. Coverage converges on the re-run.

Choosing which nodes get a Job

In job mode, node selection is configured under the job key, with the following precedence (highest first):

job.nodes: an explicit list of node names, passed to the dispatcher verbatim.
job.nodeSelector (an equality map) ANDed with job.nodeSelectorExpressions (Kubernetes label-selector requirements using the operators In, NotIn, Exists, DoesNotExist). These are compiled into a single label-selector string that the dispatcher resolves live.
If both are empty, all nodes are targeted.

By default the expressions target worker (non-control-plane) nodes, so no custom node labeling is required (this differs from the DaemonSet nodeSelector examples above, which rely on you labeling nodes). Override as needed:

# Target nodes carrying a specific label:
job:
  nodeSelector:
    kata-containers: "enabled"

# Target every node, including control-plane (e.g. single-node clusters / CI):
job:
  nodeSelectorExpressions: []

# Richer expressions:
job:
  nodeSelectorExpressions:
    - { key: kubernetes.io/os, operator: In, values: ["linux"] }
    - { key: node-role.kubernetes.io/control-plane, operator: DoesNotExist }

# Pin to explicit nodes:
job:
  nodes: ["worker-1", "worker-2"]

Use job.parallelism to pace the rollout — it caps how many per-node Jobs run concurrently (e.g. to limit how many CRI runtimes restart at once on a big fleet). It is effectively capped at the number of targeted nodes.

Choosing which nodes are cleaned up on uninstall

Because the cleanup dispatcher resolves nodes live when it runs at helm uninstall (the dispatcher does the lookup, not Helm at render time), the node set is not frozen into the stored release. This means the default cleanup selector can simply be "nodes carrying the katacontainers.io/kata-runtime label" — i.e. exactly the nodes the install actually labeled, regardless of how the install selector has drifted since.

Override it under job.cleanup, with the same precedence/semantics as install (cleanup.nodes, then cleanup.nodeSelector ANDed with cleanup.nodeSelectorExpressions, else all nodes):

# Only uninstall from specific nodes:
job:
  cleanup:
    nodes: ["worker-1"]

# Use an explicit selector instead of the kata-runtime label default:
job:
  cleanup:
    nodeSelectorExpressions:
      - { key: node-role.kubernetes.io/control-plane, operator: DoesNotExist }

See the default values.yaml for the remaining job.* options (e.g. dispatcherImage, parallelism, ttlSecondsAfterFinished, backoffLimit).

Examples

We provide a few examples that you can pass to helm via the -f/--values flag.

`try-kata-tee.values.yaml`

This file enables only the TEE (Trusted Execution Environment) shims for confidential computing:

helm install kata-deploy oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
  --version VERSION \
  -f try-kata-tee.values.yaml

Includes:

qemu-snp - AMD SEV-SNP (amd64)
qemu-tdx - Intel TDX (amd64)
qemu-se - IBM Secure Execution for Linux (SEL) (s390x)
qemu-se-runtime-rs - IBM Secure Execution for Linux (SEL) Rust runtime (s390x)
qemu-coco-dev - Confidential Containers development (amd64, s390x)
qemu-coco-dev-runtime-rs - Confidential Containers development Rust runtime (amd64, arm64, s390x)

`try-kata-nvidia-gpu.values.yaml`

This file enables only the NVIDIA GPU-enabled shims and installs them using the job deployment mode (no always-on DaemonSet on the node):

helm install kata-deploy oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
  --version VERSION \
  -f try-kata-nvidia-gpu.values.yaml

Includes:

qemu-nvidia-gpu - Standard NVIDIA GPU support (amd64)
qemu-nvidia-gpu-snp - NVIDIA GPU with AMD SEV-SNP (amd64)
qemu-nvidia-gpu-tdx - NVIDIA GPU with Intel TDX (amd64)

`nodeSelector`

We can deploy Kata only to specific nodes using nodeSelector

# First, label the nodes where you want kata-containers to be installed
$ kubectl label nodes worker-node-1 kata-containers=enabled
$ kubectl label nodes worker-node-2 kata-containers=enabled

# Then install the chart with `nodeSelector`
$ helm install kata-deploy \
  --set nodeSelector.kata-containers="enabled" \
  "${CHART}" --version  "${VERSION}"

You can also use a values file:

nodeSelector:
  kata-containers: "enabled"
  node-type: "worker"

$ helm install kata-deploy -f values.yaml "${CHART}" --version "${VERSION}"

Multiple Kata installations on the Same Node

For debugging, testing and other use-case it is possible to deploy multiple versions of Kata on the very same node. All the needed artifacts are getting the multiInstallSuffix appended to distinguish each installation. BEWARE that one needs at least containerd-2.0 since this version has drop-in conf support which is a prerequisite for the multiInstallSuffix to work properly.

$ helm install kata-deploy-cicd       \
  -n kata-deploy-cicd                 \
  --set env.multiInstallSuffix=cicd   \
  --set env.debug=true                \
  "${CHART}" --version  "${VERSION}"

Note: runtimeClasses are automatically created by Helm (via runtimeClasses.enabled=true, which is the default).

Now verify the installation by examining the runtimeClasses:

$ kubectl get runtimeClasses
NAME                            HANDLER                         AGE
kata-clh-cicd                   kata-clh-cicd                   77s
kata-clh-runtime-rs-cicd        kata-clh-runtime-rs-cicd        77s
kata-dragonball-cicd            kata-dragonball-cicd            77s
kata-fc-cicd                    kata-fc-cicd                    77s
kata-qemu-cicd                  kata-qemu-cicd                  77s
kata-qemu-coco-dev-cicd         kata-qemu-coco-dev-cicd         77s
kata-qemu-nvidia-gpu-cicd       kata-qemu-nvidia-gpu-cicd       77s
kata-qemu-nvidia-gpu-snp-cicd   kata-qemu-nvidia-gpu-snp-cicd   77s
kata-qemu-nvidia-gpu-tdx-cicd   kata-qemu-nvidia-gpu-tdx-cicd   76s
kata-qemu-runtime-rs-cicd       kata-qemu-runtime-rs-cicd       77s
kata-qemu-se-runtime-rs-cicd    kata-qemu-se-runtime-rs-cicd    77s
kata-qemu-snp-cicd              kata-qemu-snp-cicd              77s
kata-qemu-tdx-cicd              kata-qemu-tdx-cicd              77s
kata-stratovirt-cicd            kata-stratovirt-cicd            77s

RuntimeClass Node Selectors for TEE Shims

Manual configuration: Any nodeSelector you set under shims.<shim>.runtimeClass.nodeSelector is always applied to that shim's RuntimeClass, whether or not NFD is present. Use this when you want to pin TEE workloads to specific nodes (e.g. without NFD, or with custom labels).

Auto-inject when NFD is present: If you do not set a runtimeClass.nodeSelector for a TEE shim, the chart can automatically inject NFD-based labels when NFD is detected in the cluster (deployed by this chart with node-feature-discovery.enabled=true or found externally):

AMD SEV-SNP shims: amd.feature.node.kubernetes.io/snp: "true"
Intel TDX shims: intel.feature.node.kubernetes.io/tdx: "true"
IBM Secure Execution for Linux (SEL) shims (s390x): feature.node.kubernetes.io/cpu-security.se.enabled: "true"

The chart uses Helm's lookup function to detect NFD (by looking for the node-feature-discovery-worker DaemonSet). Auto-inject only runs when NFD is detected and no manual runtimeClass.nodeSelector is set for that shim.

Note: NFD detection requires cluster access. During helm template (dry-run without a cluster), external NFD is not seen, so auto-injected labels are not added. Manual runtimeClass.nodeSelector values are still applied in all cases.

16 KiB Raw Blame History