Update Go from 1.24.11 to 1.24.12 to address security vulnerabilities
in the standard library:
- GO-2026-4342: Excessive CPU consumption in archive/zip
- GO-2026-4341: Memory exhaustion in net/url query parsing
- GO-2026-4340: TLS handshake encryption level issue in crypto/tls
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Remove the initrd function and add the image function to align
with the actually existing functions in this file.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
We want to enable local and remote CUDA repository builds.
Moving the cuda and tools repo to versions.yaml with a
unified build for both types.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Update the Helm chart README to document the new shims.disableAll
option and simplify the examples that previously required listing
every shim to disable.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Simplify the example values files by using the new shims.disableAll
option instead of listing every shim to disable.
Before (try-kata-nvidia-gpu.values.yaml):
shims:
clh:
enabled: false
cloud-hypervisor:
enabled: false
# ... 15 more lines ...
After:
shims:
disableAll: true
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a new `shims.disableAll` option that disables all standard shims
at once. This is useful when:
- Enabling only specific shims without listing every other shim
- Using custom runtimes only mode (no standard Kata shims)
Usage:
shims:
disableAll: true
qemu:
enabled: true # Only qemu is enabled
All helper templates are updated to check for this flag before
iterating over shims.
One thing that's super important to note here is that helm recursively
merges user values with chart defaults, making a simple
`disableAll` flag problematic: if defaults have `enabled: true`, user's
`disableAll: true` gets merged with those defaults, resulting in all
shims still being enabled.
The workaround found is to use null (`~`) as the default for `enabled`
field. The template logic interprets null differently based on
disableAll:
| enabled value | disableAll: false | disableAll: true |
|---------------|-------------------|------------------|
| ~ (null) | Enabled | Disabled |
| true | Enabled | Enabled |
| false | Disabled | Disabled |
This is backward compatible:
- Default behavior unchanged: all shims enabled when disableAll: false
- Users can set `disableAll: true` to disable all, then explicitly
enable specific shims with `enabled: true`
- Explicit `enabled: false` always disables, regardless of disableAll
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add functions to install and remove custom runtime configuration files.
Each custom runtime gets an isolated directory structure:
custom-runtimes/{handler}/
configuration-{baseConfig}.toml # Copied from base config
config.d/
50-overrides.toml # User's drop-in overrides
The base config is copied AFTER kata-deploy has applied its modifications
(debug settings, proxy configuration, annotations), so custom runtimes
inherit these settings.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add functions to configure custom runtimes in containerd and CRI-O.
Custom runtimes use an isolated config directory under:
custom-runtimes/{handler}/
Custom runtimes automatically derive the shim binary path from the
baseConfig field using the existing is_rust_shim() logic.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add support for parsing custom runtime configurations from a mounted
ConfigMap. This allows users to define their own RuntimeClasses with
custom Kata configurations.
The ConfigMap format uses a custom-runtimes.list file with entries:
handler:baseConfig:containerd_snapshotter:crio_pulltype
Drop-in files are read from dropin-{handler}.toml, if present.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Let's extract the common logic from configure_containerd_runtime and
configure_crio_runtime into reusable helper functions. This reduces
code duplication and prepares for adding custom runtime support.
For containerd:
- Add ContainerdRuntimeParams struct to encapsulate common parameters
- Add get_containerd_pluginid() to extract version detection logic
- Add get_containerd_output_path() to extract file path resolution
- Add write_containerd_runtime_config() to write common TOML values
For CRI-O:
- Add CrioRuntimeParams struct to encapsulate common parameters
- Add write_crio_runtime_config() to write common configuration
While here, let's also simplify pod_annotations to always use
"[\"io.katacontainers.*\"]" for all runtimes, as the NVIDIA specific
case has been removed from the shell script, but we forgot to do so
here.
No functional changes intended.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add the necessary configuration and code changes to support QEMU
on arm64 architecture in runtime-rs.
Changes:
- Set MACHINETYPE to "virt" for arm64
- Add machine accelerators "usb=off,gic-version=host" required for
proper arm64 virtualization
- Add arm64-specific kernel parameter "iommu.passthrough=0"
- Guard vIOMMU (Intel IOMMU) to skip on arm64 since it's not supported
These changes align runtime-rs with the Go runtime's arm64 QEMU support.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
The verification job mounts a ConfigMap containing the pod spec for
the Kata runtime test. Previously, both the ConfigMap and the Job were
Helm hooks with different weights (-5 and 0 respectively).
On k3s, a race condition was observed where the Job pod would be
scheduled before the kubelet's informer cache had registered the
ConfigMap, causing a FailedMount error:
MountVolume.SetUp failed for volume "pod-spec": object
"kube-system"/"kata-deploy-verification-spec" not registered
This happened because k3s's lightweight architecture schedules pods
very quickly, and the hook weight difference only controls Helm's
ordering, not actual timing between resource creation and cache sync.
By making the ConfigMap a regular chart resource (removing hook
annotations), it is created during the main chart installation phase,
well before any post-install hooks run. This guarantees the ConfigMap
is fully propagated to all kubelets before the verification Job starts.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The verification job needs to list nodes to check for the
katacontainers.io/kata-runtime label and list events to detect
FailedCreatePodSandBox errors during pod creation.
This was discovered when testing with k0s, where the service account
lacked the required cluster-scope permissions to list nodes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Remove k0s-worker and k0s-controller from
RUNTIMES_WITHOUT_CONTAINERD_DROP_IN_SUPPORT and always return true for
k0s in is_containerd_capable_of_using_drop_in_files since k0s auto-loads
from containerd.d/ directory regardless of containerd version.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add microk8s case to get_containerd_paths() method and remove microk8s
from RUNTIMES_WITHOUT_CONTAINERD_DROP_IN_SUPPORT to enable dynamic
containerd version checking.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Introduce ContainerdPaths struct and get_containerd_paths() method to
centralize the complex logic for determining containerd configuration
file paths across different Kubernetes distributions.
The new ContainerdPaths struct includes:
- config_file: File to read containerd version from and write to
- backup_file: Backup file path before modification
- imports_file: File to add/remove drop-in imports from (Option<String>)
- drop_in_file: Path to the drop-in configuration file
- use_drop_in: Whether drop-in files can be used
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The JSONPath parser was incorrectly splitting on escaped dots (\.)
causing microk8s detection to fail. Labels like "microk8s.io/cluster"
were being split into ["microk8s\", "io/cluster"] instead of being
treated as a single key.
This adds a split_jsonpath() helper that properly handles escaped dots,
allowing the automatic microk8s detection via the node label to work
correctly.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The verification job now supports configurable timeouts to accommodate
different environments and network conditions. The daemonset timeout
defaults to 1200 seconds (20 minutes) to allow for large image downloads,
while the verification pod timeout defaults to 180 seconds.
The job now waits for the DaemonSet to exist, pods to be scheduled,
rollout to complete, and nodes to be labeled before creating the
verification pod. A 15-second delay is added after node labeling to
allow kubelet time to refresh runtime information.
Retry logic with 3 attempts and a 10-second delay handles transient
FailedCreatePodSandBox errors that can occur during runtime
initialization. The job only fails on pod errors after a 30-second
grace period to avoid false positives from timing issues.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The runk tool hasn't been supported for a few years, with no maintainers
since ManaSugi stopped being involved in the project and the CI was
disabled in 2024.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This reverts commit 6130d7330f, as we're
officially swithcing to the rust version of kata-deploy.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
a2534e7bc8 introduced the logic to also
release a kata-tools tarball, but it missed allowing
KATA_TOOLS_STATIC_TARBALL env var to be passed to the release script,
leading to the following error during the release process:
```
ERROR: Invalid environment variable "KATA_TOOLS_STATIC_TARBALL"
```
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add optional verification that runs after kata-deploy installation.
When a pod spec is provided via --set-file verification.pod=<file>,
a verification job runs after install/upgrade to validate deployment.
The user is fully responsible for the verification pod content:
- Pod name, runtimeClassName, annotations, and verification logic
- Pod must exit 0 on success, non-zero on failure
The verification job simply:
1. Waits for kata-deploy DaemonSet to be ready
2. Applies the user-provided pod spec
3. Waits for the pod to complete
4. Shows logs and cleans up
Usage:
helm install kata-deploy ... \
--set-file verification.pod=/path/to/your-pod.yaml
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The new NVRC version works for CC and non-CC use cases,
no --feature confidential needed anymore.
Bump versions.yaml and adjust deployment instructions.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We've done some bad file based driver determination,
now with versions.yaml there is a single source of truth.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We need to package the build modules for the rootfs
to be able to consume it. We package the whole
/lib/modules/$(uname -r) directory strip=2.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We actually never installed yq to the kernel build,
there are some path that use yq but were never hit,
for the GPU use-case we need to read values from versions.yaml
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
This is needed as the 580 driver doesn't build against 6.18.x, and the
590 driver is not yet fully working for our case, thus we stick to the
previous version that worked before.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Bump both the kernel and kernel-confidential versions from v6.12.x and
v6.16.x to v6.18.4, aligning with the new LTS release.
Kernel 6.18 introduced several configuration changes that required
updates to our kernel config fragments:
* CRYPTO_FIPS dependencies changed:
- In 6.12: depended on !CRYPTO_MANAGER_DISABLE_TESTS
- In 6.18: now depends on CRYPTO_SELFTESTS (which requires EXPERT)
Added CONFIG_EXPERT=y and CONFIG_CRYPTO_SELFTESTS=y to crypto.conf
to satisfy the new dependency chain.
* CONFIG_EXPERT is a naughty one, as it disables / enables a bunch
of things behind ones back, probably just to prove a point that
it is for experts ;-) ... regardless, a reasonable amount of
options had to be re-added in order to make sure anything ends
up broken.
* Legacy iptables support:
Kernel 6.18 requires explicit legacy xtables/iptables configs for
IP_NF_* options. Added CONFIG_NETFILTER_XTABLES_LEGACY,
CONFIG_IP_NF_IPTABLES_LEGACY, and CONFIG_IP6_NF_IPTABLES_LEGACY
to netfilter.conf.
* Module signing dependencies:
Added CONFIG_MODULES=y and other required dependencies to
module_signing.conf to ensure MODULE_SIG can be properly enabled.
* Whitelist updates:
- Added CONFIG_NF_CT_PROTO_DCCP (removed in 6.18+)
- Added CONFIG_CRYPTO_SELFTESTS, CONFIG_NETFILTER_XTABLES_LEGACY,
CONFIG_IP_NF_IPTABLES_LEGACY, CONFIG_IP6_NF_IPTABLES_LEGACY
(added in 6.18+, not present in older kernels like 6.12)
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This image will be used by our helm charts to verify that a
kata-containers deployment is correct.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
In some builds we are seeing:
```
error: could not create temp file /opt/rustup/tmp/r2xu46kwuyc7k2kr_file: Permission denied (os error 13)
```
in the agent-ctl build, so try and port a fix from #12313 to the tools build
to try and resolve this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fixes deploying kata-containers using k3s. The deploy script fails with /opt/kata-artifacts/scripts/kata-deploy.sh: line 397: [: too many arguments
Signed-off-by: Federico A. Corazza <git@facorazza.com>
The following error was observed during virtiofsd static build:
```
error: could not create temp file /opt/rustup/tmp/p44enysfaxwdbvw4_file:
Permission denied (os error 13)
```
This occurs because RUSTUP_HOME and CARGO_HOME were initialized by the
root user during `docker build`, but `cargo build` is executed as a
non-root user via 'docker run --user'.
Ensure these directories are writable by adjusting the permission after
the toolchain installation is complete.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>